We propose to extend the ab initio self-training algorithms for eukaryotic gene finding developed in the previous grant period in several important directions. First we will upgrade this algorithm to a multilevel data mining approach to allow construction of a consistent """"""""genome- transcriptome-proteome"""""""" data structure at the early stages of a genome project. Here, we will compensate for an information deficit in various segments of experimental data (such as EST data) by unsupervised machine learning on existing and abundant data segments (an anonymous genomic sequence) with subsequent computational modeling of missing biological information (protein-coding genes and proteins). An important new feature of the self-training algorithm will be the utilization of protein level information to monitor and increase biological relevance of the models derived by the unsupervised iterative algorithm. Second, we will enhance the self-training algorithm developed earlier on a smaller scale and tested on fungal and other """"""""compact"""""""" eukaryotic genomes (such as Caenorhabditis elegans and Drosophila melanogaster) to work with most complex eukaryotic genomes. At this higher level of complexity we see species with host genes occupying just a small fraction of genome which can be inhomogeneous in GC composition, populated with transposable elements and pseudogenes (besides animal genomes, genomes of some fungal pathogens as well as human parasites and their vectors fall into this category). Third, for the human microbiome containing bacterial, archaeal, viral and fungal species, situated at yet another end of the genome in homogeneity spectrum, we will develop improved algorithms and tools for ab initio gene identification. This work will be done in close contact with sequencing and annotation groups from leading genome centers both in the US and abroad.
Rational systems biology, cancer cure, vaccine development, drug design, is impossible without understanding genomic DNA in human cell. Gene prediction is a cornerstone of biological interpretation of DNA sequence. The goal of this proposal is developing automatic and accurate gene prediction algorithms for the most complex genomic sequences important for human health.
Showing the most recent 10 out of 48 publications