The goal of the project is to build more accurate and powerful DNA sequence interpretation algorithms utilizing the positive experience and ideas of previously proven GeneMark and GenMark.hmm methods. We plan to improve the quality of gene finding in prokaryotic genomes in terms of reliable and accurate prediction of gene starts and detection of frameshift sequencing errors. We also plan to develop a machine-learning iterative procedure for deriving all necessary models for precise gene prediction/annotation from totally anonymous prokaryotic sequences. For eukaryotic species, we will improve the accuracy of the ab initio method GeneMark.hmm by building more accurate models for splice sites and initiation/termination sites, and we will address the problem of accurately finding intergenic regions with polyadenilation sites and promoters. On the basis of GeneMark.hmm, we plan to develop an integrated gene finding approach by """"""""projecting"""""""" pieces of diverse extrinsic evidence into DNA level, the translating them into DNA patterns and combining these patterns with statistical patterns of DNA coding and non-coding sequence within a generalized HMM model. The most intriguing sources of this additional information are evolutionary conserved regions in DNA sequences of closely related species, functional motifs in protein sequences and protein sequence patterns reflecting three dimensional structural motifs. All these newly developed methods, as well as several others mentioned in the proposal, will deal with anonymous DNA for which interpretation is increasingly needed in the post-genomic era.
Showing the most recent 10 out of 48 publications