The goal of the project is to build more accurate and powerful DNA sequence interpretation algorithms utilizing the positive experience and ideas of previously proven GeneMark and GenMark.hmm methods. We plan to improve the quality of gene finding in prokaryotic genomes in terms of reliable and accurate prediction of gene starts and detection of frameshift sequencing errors. We also plan to develop a machine-learning iterative procedure for deriving all necessary models for precise gene prediction/annotation from totally anonymous prokaryotic sequences. For eukaryotic species, we will improve the accuracy of the ab initio method GeneMark.hmm by building more accurate models for splice sites and initiation/termination sites, and we will address the problem of accurately finding intergenic regions with polyadenilation sites and promoters. On the basis of GeneMark.hmm, we plan to develop an integrated gene finding approach by """"""""projecting"""""""" pieces of diverse extrinsic evidence into DNA level, the translating them into DNA patterns and combining these patterns with statistical patterns of DNA coding and non-coding sequence within a generalized HMM model. The most intriguing sources of this additional information are evolutionary conserved regions in DNA sequences of closely related species, functional motifs in protein sequences and protein sequence patterns reflecting three dimensional structural motifs. All these newly developed methods, as well as several others mentioned in the proposal, will deal with anonymous DNA for which interpretation is increasingly needed in the post-genomic era.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG000783-09
Application #
6536458
Study Section
Genome Study Section (GNM)
Program Officer
Bonazzi, Vivien
Project Start
1993-03-15
Project End
2004-07-19
Budget Start
2002-07-01
Budget End
2004-07-19
Support Year
9
Fiscal Year
2002
Total Cost
$318,645
Indirect Cost
Name
Georgia Institute of Technology
Department
Biology
Type
Schools of Arts and Sciences
DUNS #
097394084
City
Atlanta
State
GA
Country
United States
Zip Code
30332
Lomsadze, Alexandre; Gemayel, Karl; Tang, Shiyuyun et al. (2018) Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes. Genome Res 28:1079-1089
Hoff, Katharina J; Lange, Simone; Lomsadze, Alexandre et al. (2016) BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32:767-9
Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat et al. (2016) NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res 44:6614-24
Tang, Shiyuyun; Lomsadze, Alexandre; Borodovsky, Mark (2015) Identification of protein coding regions in RNA transcripts. Nucleic Acids Res 43:e78
Wu, G Albert; Prochnik, Simon; Jenkins, Jerry et al. (2014) Sequencing of diverse mandarin, pummelo and orange genomes reveals complex history of admixture during citrus domestication. Nat Biotechnol 32:656-62
Borodovsky, Mark; Lomsadze, Alex (2014) Gene identification in prokaryotic genomes, phages, metagenomes, and EST sequences with GeneMarkS suite. Curr Protoc Microbiol 32:Unit 1E.7.
Lomsadze, Alexandre; Burns, Paul D; Borodovsky, Mark (2014) Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res 42:e119
Burns, Paul D; Li, Yang; Ma, Jian et al. (2014) UnSplicer: mapping spliced RNA-Seq reads in compact genomes and filtering noisy splicing. Nucleic Acids Res 42:e25
Li, Yang; Li-Byarlay, Hongmei; Burns, Paul et al. (2013) TrueSight: a new algorithm for splice junction detection using RNA-seq. Nucleic Acids Res 41:e51
Antonov, Ivan; Baranov, Pavel; Borodovsky, Mark (2013) GeneTack database: genes with frameshifts in prokaryotic genomes and eukaryotic mRNA sequences. Nucleic Acids Res 41:D152-6

Showing the most recent 10 out of 48 publications