The maize genome sequence is the knowledge infrastructure for the next generation of plant molecular genetics and crop improvement, and will provide the foundation for improving maize and other cereal crops. A broad understanding of the genes present in maize would provide the identities, and eventually the map positions, of many of the genes responsible for controlling agronomically important traits. However, the products of ongoing and future maize sequencing projects are collections of large contiguous nucleotide segments for which there is no a priori knowledge of content or function. Therefore, high throughput computational tools that can accurately identify genes within maize genomic sequence are absolutely necessary for annotating and understanding the maize genome. Computational gene prediction has been an active area of research for many years. It includes similarity-based approaches that make use of expressed sequences, such as ESTs and cDNAs, by aligning them to the genome; and, predictive approaches whose only biological inputs are genomic sequences.
A significant improvement in gene prediction accuracy has come from dual-genome prediction programs, which integrate traditional probability models like those underlying GENSCAN and FGENESH with information from the alignments between two genomes. The essential idea is that functional sequences, such as protein coding regions and splice sites, show different patterns of evolutionary conservation than sequences under little selective pressure, such as the central regions of introns. One of the most accurate dual-genome prediction programs is TWINSCAN. This project will improve gene prediction in maize by identifying a comprehensive "training set" of complete and annotated maize gene models; and, using these to optimize TWINSCAN to accurately identify maize genes in un-annotated maize genome sequence.
Maize trained TWINSCAN will be thoroughly benchmarked and used to re-annotate available pubic maize genomic sequence. The results of the benchmarking and re-annotation will be transparent to the scientific community, and maize trained TWINSCAN will be made publicly available through the open-source software agreement. Access information will be posted at www.maizegenome.org/.