Gene Prediction by Markov Models and Complementary Methods

Borodovsky, Mark

Abstract

We propose to extend the ab initio self-training algorithms for eukaryotic gene finding developed in the previous grant period in several important directions. First we will upgrade this algorithm to a multilevel data mining approach to allow construction of a consistent """"""""genome- transcriptome-proteome"""""""" data structure at the early stages of a genome project. Here, we will compensate for an information deficit in various segments of experimental data (such as EST data) by unsupervised machine learning on existing and abundant data segments (an anonymous genomic sequence) with subsequent computational modeling of missing biological information (protein-coding genes and proteins). An important new feature of the self-training algorithm will be the utilization of protein level information to monitor and increase biological relevance of the models derived by the unsupervised iterative algorithm. Second, we will enhance the self-training algorithm developed earlier on a smaller scale and tested on fungal and other """"""""compact"""""""" eukaryotic genomes (such as Caenorhabditis elegans and Drosophila melanogaster) to work with most complex eukaryotic genomes. At this higher level of complexity we see species with host genes occupying just a small fraction of genome which can be inhomogeneous in GC composition, populated with transposable elements and pseudogenes (besides animal genomes, genomes of some fungal pathogens as well as human parasites and their vectors fall into this category). Third, for the human microbiome containing bacterial, archaeal, viral and fungal species, situated at yet another end of the genome in homogeneity spectrum, we will develop improved algorithms and tools for ab initio gene identification. This work will be done in close contact with sequencing and annotation groups from leading genome centers both in the US and abroad.

Public Health Relevance

Rational systems biology, cancer cure, vaccine development, drug design, is impossible without understanding genomic DNA in human cell. Gene prediction is a cornerstone of biological interpretation of DNA sequence. The goal of this proposal is developing automatic and accurate gene prediction algorithms for the most complex genomic sequences important for human health.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 3R01HG000783-16S2
Application #: 8909702
Study Section: Biodata Management and Analysis Study Section (BDMA)
Program Officer: Bonazzi, Vivien

Project Start: 1993-03-15
Project End: 2015-03-31
Budget Start: 2014-08-21
Budget End: 2015-03-31
Support Year: 16
Fiscal Year: 2014
Total Cost: $100,000
Indirect Cost: $35,856

Institution

Name: Georgia Institute of Technology
Department: Engineering (All Types)
Type: Schools of Engineering
DUNS #: 097394084

City: Atlanta
State: GA
Country: United States
Zip Code: 30332

Related projects

Publications

Lomsadze, Alexandre; Gemayel, Karl; Tang, Shiyuyun et al. (2018) Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes. Genome Res 28:1079-1089

Hoff, Katharina J; Lange, Simone; Lomsadze, Alexandre et al. (2016) BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32:767-9

Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat et al. (2016) NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res 44:6614-24

Tang, Shiyuyun; Lomsadze, Alexandre; Borodovsky, Mark (2015) Identification of protein coding regions in RNA transcripts. Nucleic Acids Res 43:e78

Wu, G Albert; Prochnik, Simon; Jenkins, Jerry et al. (2014) Sequencing of diverse mandarin, pummelo and orange genomes reveals complex history of admixture during citrus domestication. Nat Biotechnol 32:656-62

Borodovsky, Mark; Lomsadze, Alex (2014) Gene identification in prokaryotic genomes, phages, metagenomes, and EST sequences with GeneMarkS suite. Curr Protoc Microbiol 32:Unit 1E.7.

Lomsadze, Alexandre; Burns, Paul D; Borodovsky, Mark (2014) Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res 42:e119

Burns, Paul D; Li, Yang; Ma, Jian et al. (2014) UnSplicer: mapping spliced RNA-Seq reads in compact genomes and filtering noisy splicing. Nucleic Acids Res 42:e25

Li, Yang; Li-Byarlay, Hongmei; Burns, Paul et al. (2013) TrueSight: a new algorithm for splice junction detection using RNA-seq. Nucleic Acids Res 41:e51

Antonov, Ivan; Baranov, Pavel; Borodovsky, Mark (2013) GeneTack database: genes with frameshifts in prokaryotic genomes and eukaryotic mRNA sequences. Nucleic Acids Res 41:D152-6

Showing the most recent 10 out of 48 publications

Comments

Be the first to comment on Mark Borodovsky's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: