CoPIs: Karin Dorman (Iowa State University), Shannon Schlueter (University of North Carolina - Charlotte) and Shailesh Lal (Oakland University)
Senior Personnel: Jon Duvick and Yasser El-Manzalawy (Iowa State University)
The premise of this project is that the scale of sequence and other data accumulation in plant genomics necessitates the development of novel, highly automated, scalable, comprehensive, and accurate approaches to genome annotation. The depth of transcript data accumulating for many plant species under numerous experimental conditions provide unprecedented evidence for the evaluation of all aspects of transcription, including precise mapping of transcription start sites as well as dominant and alternative splice sites. This project engages a team of experts in a wide range of fields, including genomics, molecular biology, bioinformatics, statistics, machine learning, high performance computing, and software engineering to jointly work toward a solution for accurately predicting the expressed protein-coding gene transcriptome from plant genome sequences. Successful completion of the project will result in the deployment of (1) software that implements the novel prediction algorithms, (2) visualization and data access portals, and (3) a cyberinfrastructure environment implementation of the developed tools for distributed computing, sharing of protocols, and analysis provenance recording. In the long run, the project seeks to explore the extent to which genomic biology can transition from a largely descriptive to a highly predictive science driven by quantitative measurements, with algorithms and computation as the domain-adapted language.
The project will generate standardized, accurate protein-coding gene structure annotation for 25 plant genomes from a wide range of the phylogenetic spectrum. Initial emphasis will be on improved annotation of recently sequenced genomes, which will benefit the entire community of researchers working on these important crops. The anticipated algorithms for transcriptome prediction will be essential to the analysis of the thousands of complete plant genome sequences likely to become available within the next few years. Through the development of reliable gold standard annotations and the dissemination of training and test sets for algorithmic development, a larger community of computational data analysts, in particular from the machine learning community, will be engaged. All software developed and data generated in this research is freely available through project Web sites, in particular www.plantgdb.org. The project's plan for integration of research and education will train a new generation of scientists to work on genomics data with the broad range of interdisciplinary approaches represented by the project team.