We propose to extend the ab initio self-training algorithms for eukaryotic gene finding developed in the previous grant period in several important directions. First we will upgrade this algorithm to a multilevel data mining approach to allow construction of a consistent "genome- transcriptome-proteome" data structure at the early stages of a genome project. Here, we will compensate for an information deficit in various segments of experimental data (such as EST data) by unsupervised machine learning on existing and abundant data segments (an anonymous genomic sequence) with subsequent computational modeling of missing biological information (protein-coding genes and proteins). An important new feature of the self-training algorithm will be the utilization of protein level information to monitor and increase biological relevance of the models derived by the unsupervised iterative algorithm. Second, we will enhance the self-training algorithm developed earlier on a smaller scale and tested on fungal and other "compact" eukaryotic genomes (such as Caenorhabditis elegans and Drosophila melanogaster) to work with most complex eukaryotic genomes. At this higher level of complexity we see species with host genes occupying just a small fraction of genome which can be inhomogeneous in GC composition, populated with transposable elements and pseudogenes (besides animal genomes, genomes of some fungal pathogens as well as human parasites and their vectors fall into this category). Third, for the human microbiome containing bacterial, archaeal, viral and fungal species, situated at yet another end of the genome in homogeneity spectrum, we will develop improved algorithms and tools for ab initio gene identification. This work will be done in close contact with sequencing and annotation groups from leading genome centers both in the US and abroad.

Public Health Relevance

Rational systems biology, cancer cure, vaccine development, drug design, is impossible without understanding genomic DNA in human cell. Gene prediction is a cornerstone of biological interpretation of DNA sequence. The goal of this proposal is developing automatic and accurate gene prediction algorithms for the most complex genomic sequences important for human health.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Bonazzi, Vivien
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Georgia Institute of Technology
Engineering (All Types)
Schools of Engineering
United States
Zip Code
Lomsadze, Alexandre; Burns, Paul D; Borodovsky, Mark (2014) Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res 42:e119
Wu, G Albert; Prochnik, Simon; Jenkins, Jerry et al. (2014) Sequencing of diverse mandarin, pummelo and orange genomes reveals complex history of admixture during citrus domestication. Nat Biotechnol 32:656-62
Borodovsky, Mark; Lomsadze, Alex (2014) Gene identification in prokaryotic genomes, phages, metagenomes, and EST sequences with GeneMarkS suite. Curr Protoc Microbiol 32:Unit 1E.7.
Burns, Paul D; Li, Yang; Ma, Jian et al. (2014) UnSplicer: mapping spliced RNA-Seq reads in compact genomes and filtering noisy splicing. Nucleic Acids Res 42:e25
Li, Yang; Li-Byarlay, Hongmei; Burns, Paul et al. (2013) TrueSight: a new algorithm for splice junction detection using RNA-seq. Nucleic Acids Res 41:e51
Antonov, Ivan; Baranov, Pavel; Borodovsky, Mark (2013) GeneTack database: genes with frameshifts in prokaryotic genomes and eukaryotic mRNA sequences. Nucleic Acids Res 41:D152-6
Tang, Shiyuyun; Antonov, Ivan; Borodovsky, Mark (2013) MetaGeneTack: ab initio detection of frameshifts in metagenomic sequences. Bioinformatics 29:114-6
Antonov, Ivan; Coakley, Arthur; Atkins, John F et al. (2013) Identification of the nature of reading frame transitions observed in prokaryotic genomes. Nucleic Acids Res 41:6514-30
Borodovsky, Mark; Lomsadze, Alex (2011) Gene identification in prokaryotic genomes, phages, metagenomes, and EST sequences with GeneMarkS suite. Curr Protoc Bioinformatics Chapter 4:Unit 4.5.1-17
Borodovsky, Mark; Lomsadze, Alex (2011) Eukaryotic gene prediction using GeneMark.hmm-E and GeneMark-ES. Curr Protoc Bioinformatics Chapter 4:Unit 4.6.1-10

Showing the most recent 10 out of 43 publications