The general objective of the proposed project is to develop an improved computer algorithm for predicting gene locations in newly sequenced DNA. This problem is well known but still far from being successfully resolved. A new approach to the problem utilizes both splicing site and coding/noncoding DNA sequence information in the form of stochastic models. There are several specific aims that have to be achieved: 1) The most efficient type of nonstationary Markov chain model of the protein coding region (exons) has to be chosen on the basis of statistical analysis of previously compiled learning sets of eukaryotic DNA according to the goodness-of-fit test. Also, the most efficient type of an ordinary Markov chain model of noncoding DNA sequences (introns) has to be determined based on the analysis of the intron learning set. 2) An improved set of parameters needed for calculation of the value of the discrimination energy (estimating the relative activity of a splicing site) will be extracted from an expanded learning set of known splicing sites. 3) Splicing site stochastic models and models of coding/noncoding DNA sequences (joined together in a Bayes type algorithm finding out the value of the coding potential of a DNA fragment) have to be combined and enhanced as a new multistage method for the identification of gene locations. 4) After evaluating the method's accuracy, scaling of decision making thresholds, improving computational performance, and creating an interactive environment for the method, the software will be made available to the scientific community.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG000783-03
Application #
2209036
Study Section
Genome Study Section (GNM)
Project Start
1993-03-15
Project End
1997-02-28
Budget Start
1995-03-01
Budget End
1997-02-28
Support Year
3
Fiscal Year
1995
Total Cost
Indirect Cost
Name
Georgia Institute of Technology
Department
Biology
Type
Schools of Arts and Sciences
DUNS #
097394084
City
Atlanta
State
GA
Country
United States
Zip Code
30332
Lomsadze, Alexandre; Gemayel, Karl; Tang, Shiyuyun et al. (2018) Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes. Genome Res 28:1079-1089
Hoff, Katharina J; Lange, Simone; Lomsadze, Alexandre et al. (2016) BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32:767-9
Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat et al. (2016) NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res 44:6614-24
Tang, Shiyuyun; Lomsadze, Alexandre; Borodovsky, Mark (2015) Identification of protein coding regions in RNA transcripts. Nucleic Acids Res 43:e78
Wu, G Albert; Prochnik, Simon; Jenkins, Jerry et al. (2014) Sequencing of diverse mandarin, pummelo and orange genomes reveals complex history of admixture during citrus domestication. Nat Biotechnol 32:656-62
Borodovsky, Mark; Lomsadze, Alex (2014) Gene identification in prokaryotic genomes, phages, metagenomes, and EST sequences with GeneMarkS suite. Curr Protoc Microbiol 32:Unit 1E.7.
Lomsadze, Alexandre; Burns, Paul D; Borodovsky, Mark (2014) Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res 42:e119
Burns, Paul D; Li, Yang; Ma, Jian et al. (2014) UnSplicer: mapping spliced RNA-Seq reads in compact genomes and filtering noisy splicing. Nucleic Acids Res 42:e25
Li, Yang; Li-Byarlay, Hongmei; Burns, Paul et al. (2013) TrueSight: a new algorithm for splice junction detection using RNA-seq. Nucleic Acids Res 41:e51
Antonov, Ivan; Baranov, Pavel; Borodovsky, Mark (2013) GeneTack database: genes with frameshifts in prokaryotic genomes and eukaryotic mRNA sequences. Nucleic Acids Res 41:D152-6

Showing the most recent 10 out of 48 publications