The primary goal of a genome project is the identification and functional characterization of the entire catalog of genes within a particular species. While genome sequencing projects have provided a wealth of data, finding and cataloging genes and their variants remains a significant challenge. For this reason, among others, the sequencing of Expressed Sequence Tags (ESTs) derived from genes transcripts remains an important tool for biological inquiry despite the growing number of species for which genome sequencing projects have been initiated. ESTs provide valuable data for gene identification in sequenced species, provide evidence for non-coding but biologically important genomic features, and, in many species, provide the only information available about their gene content. The Gene Index databases (TGI; http://compbio.dfci.harvard.edu/tgi/plant.html) were developed to provide a high-quality, publicly available analysis of EST sequences and currently represent more than 34 different plant species. The TGI provide a consistent view of these data across species. This project will more than double the number of plant and plant parasite species represented in the TGI databases. The databases and the associated software tools will facilitate a wide range of plant functional genomics studies, will assist in identification of genes that can be used, for example, in plant breeding and the study of pathogen resistance, and will contribute to the annotation of plant genomes that will be sequenced in the coming years. All methods developed through this proposal will be instantiated in freely available, open source software tools. This will allow other researchers to faithfully reproduce the TGI in their home institutions and offer an alternative approach to the development and maintenance of gene indices.

Broader Impact Over the years, the TGI project has made a number of important contributions to the research community, including the creation of a highly used and well-cited public collection of databases, widely used software tools for the analysis of EST data, and the training of a number of students and postdocs. Specifically, the databases created through this project will continue to be available without restriction through http://compbio.dfci.harvard.edu/tgi/plant.html and web services access will allow other plant databases to link more effectively to the resources provided through the TGI and it is expected that in the coming years the databases will see far more than the nearly 15 million web hits these databases received in 2006. This project will continue to support collaboration with a variety of plant genome research groups, to welcome their personnel as visitors to more effectively link resources across projects, and to offer workshops on the use of the TGI resources.

Project Report

DBI-0649614 PI: John Quackenbush The cost of sequencing genomes has fallen dramatically since the completion of the first draft human genome sequence in 2000 and the speed has increased dramatically. Projects that would have taken years and cost tens or hundreds of millions of dollars can now be completed in weeks or months at a cost of a few tens of thousands of dollars. This has made it possible to apply genomics to the study of a wide range of species, including plants and animals that are economically important or interesting from a scientific or evolutionary perspective. As a result, we now have whole genome sequence from many, diverse species. However, finding genes with a species’ genome and defining its structure remains an ongoing challenge. Even with complete genome sequence, the best evidence is often direct gene sequencing using RNA transcripts from a range of tissues. The Gene Index Project (http://compbio.dfci.harvard.edu/tgi/) was funded by a grant from the National Science Foundation (NSF DBI-0649614) to use RNA transcript gene sequence data to reconstruct genes and their structures for a large number of species. Over the time during which we had NSF support, we were able to develop software, databases, and web resources providing access to the reconstructed gene sequences for 131 different species: 60 plants, 46 animals (including plant pests), 15 protists, and 10 fungi. The web-based gene index database is accessed millions of times each year and has been instrumental in developing an understanding of the genetic content of these organisms, providing insight into how they evolved, how they develop, and how they adapt to a wide range of challenges. An essential part of this process was developing software to reconstruct gene sequences. The TGICL2 software we developed for this purpose is freely available as open-source through Sourceforge (http://sourceforge.net/projects/tgicl/). The software has been downloaded 3,872 times: 1,385 times in 2012 alone. We have also trained a large number of students and postdoctoral fellows, presented this work at national and international scientific meetings, collaborated with scientists on projects ranging from sequencing the Maize genome to trying to understand hibernation in the Black Bear, and we have published six scientific papers directly describing the work done in the context of this project. 1. Antonescu C, Antonescu V, Sultana R, Quackenbush J. Using the DFCI gene index databases for biological discovery. Current protocols in bioinformatics / editoral board, Andreas D Baxevanis [et al]. 2010;Chapter 1:Unit1 6 1-36. Epub 2010/03/06. doi: 10.1002/0471250953.bi0106s29. PubMed PMID: 20205187. 2. Cannon EK, Birkett SM, Braun BL, Kodavali S, Jennewein DM, Yilmaz A, Antonescu V, Antonescu C, Harper LC, Gardiner JM, Schaeffer ML, Campbell DA, Andorf CM, Andorf D, Lisch D, Koch KE, McCarty DR, Quackenbush J, Grotewold E, Lushbough CM, Sen TZ, Lawrence CJ. POPcorn: An Online Resource Providing Access to Distributed and Diverse Maize Project Data. International journal of plant genomics. 2011;2011:923035. Epub 2012/01/19. doi: 10.1155/2011/923035. PubMed PMID: 22253616; PubMed Central PMCID: PMC3255282. 3. Danley PD, Mullen SP, Liu F, Nene V, Quackenbush J, Shaw KL. A cricket Gene Index: a genomic resource for studying neurobiology, speciation, and molecular evolution. BMC genomics. 2007;8:109. Epub 2007/04/27. doi: 10.1186/1471-2164-8-109. PubMed PMID: 17459168; PubMed Central PMCID: PMC1878485. 4. Fedorov VB, Goropashnaya AV, Toien O, Stewart NC, Gracey AY, Chang C, Qin S, Pertea G, Quackenbush J, Showe LC, Showe MK, Boyer BB, Barnes BM. Elevated expression of protein biosynthesis genes in liver and muscle of hibernating black bears (Ursus americanus). Physiological genomics. 2009;37(2):108-18. Epub 2009/02/26. doi: 10.1152/physiolgenomics.90398.2008. PubMed PMID: 19240299. 5. Guerrero FD, Miller RJ, Rousseau ME, Sunkara S, Quackenbush J, Lee Y, Nene V. BmiGI: a database of cDNAs expressed in Boophilus microplus, the tropical/southern cattle tick. Insect biochemistry and molecular biology. 2005;35(6):585-95. Epub 2005/04/29. doi: 10.1016/j.ibmb.2005.01.020. PubMed PMID: 15857764. 6. Lee Y, Tsai J, Sunkara S, Karamycheva S, Pertea G, Sultana R, Antonescu V, Chan A, Cheung F, Quackenbush J. The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes. Nucleic acids research. 2005;33(Database issue):D71-4. Epub 2004/12/21. doi: 10.1093/nar/gki064. PubMed PMID: 15608288; PubMed Central PMCID: PMC540018.

Agency
National Science Foundation (NSF)
Institute
Division of Integrative Organismal Systems (IOS)
Application #
0649614
Program Officer
Diane Jofuku Okamuro
Project Start
Project End
Budget Start
2007-09-01
Budget End
2012-08-31
Support Year
Fiscal Year
2006
Total Cost
$2,566,000
Indirect Cost
Name
Dana-Farber Cancer Institute
Department
Type
DUNS #
City
Boston
State
MA
Country
United States
Zip Code
02215