As the Human Genome Project accelerates, and as other eukaryotic sequencing projects near or reach completion, the need for automatic annotation of genes and other genomic sequence features is becoming increasingly urgent. Considerable research has been devoted to developing programs to find genes in genomic sequence data from both eukaryotes and prokaryotes, and related efforts have been dedicated to finding signals such as splice sites, ribosome binding sites, terminators, promoters and other transcription factors. Programs that find genes by alignment have become a standard tool of gene finding, taking advantage of both the increasing quantity of genomic DNA sequence and large databases of expressed sequence tags (ESTs). The rapid outpouring of complete microbial genomes--16 thus far--has added tens of thousands of genes to the public databases, making gene finding by sequence comparison more effective than ever. However, experience with the most recently completed genomes shows that 30-40 percent of the genes in a new genome do not have similarity to any previously catalogued gene. Therefore de novo gene prediction continues to be a crucially important tool, and it is important to continue research on improving the computational techniques underlying gene finding methods. This project will develop state-of-the-art gene finding systems that achieve several objectives: (1) integration of the best computational techniques for gene finding, including hidden Markov models (HMMs), interpolated Markov models (IMMs), and decision trees into a single gene finding system; (2) development of new techniques for recognizing starts of translation, splice junctions, polyadenylation sites, and other sequence patterns and integration of these in the gene finder; (3) integration of sequence homology information in the gene finding algorithm; (4) development of a platform-independent display tool that shows all annotation, including alternative interpretations of a sequence; and (5) release of this software for free use by the genomics research community. The approach to developing these bioinformatics tools will be interdisciplinary, building on the PI's experience in computer science and his NIH-supported training in genomics.
Magoc, Tanja; Wood, Derrick; Salzberg, Steven L (2013) EDGE-pro: Estimated Degree of Gene Expression in Prokaryotic Genomes. Evol Bioinform Online 9:127-36 |
Schatz, Michael C; Phillippy, Adam M; Sommer, Daniel D et al. (2013) Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies. Brief Bioinform 14:213-24 |
Salzberg, Steven L; Phillippy, Adam M; Zimin, Aleksey et al. (2012) GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res 22:557-67 |
Lipman, David; Flicek, Paul; Salzberg, Steven et al. (2011) Closure of the NCBI SRA and implications for the long-term future of genomics data storage. Genome Biol 12:402 |
Mago?, Tanja; Salzberg, Steven L (2011) FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27:2957-63 |
Shulaev, Vladimir; Sargent, Daniel J; Crowhurst, Ross N et al. (2011) The genome of woodland strawberry (Fragaria vesca). Nat Genet 43:109-16 |
Florea, Liliana; Souvorov, Alexander; Kalbfleisch, Theodore S et al. (2011) Genome assembly has a major impact on gene content: a comparison of annotation in two Bos taurus assemblies. PLoS One 6:e21400 |
Angiuoli, Samuel V; Salzberg, Steven L (2011) Mugsy: fast multiple alignment of closely related whole genomes. Bioinformatics 27:334-42 |
Brady, Arthur; Salzberg, Steven (2011) PhymmBL expanded: confidence scores, custom databases, parallelization and more. Nat Methods 8:367 |
Kim, Daehwan; Salzberg, Steven L (2011) TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol 12:R72 |
Showing the most recent 10 out of 112 publications