The general objective of the proposed project is to develop an improved computer algorithm for predicting gene locations in newly sequenced DNA. This problem is well known but still far from being successfully resolved. A new approach to the problem utilizes both splicing site and coding/noncoding DNA sequence information in the form of stochastic models. There are several specific aims that have to be achieved: 1) The most efficient type of nonstationary Markov chain model of the protein coding region (exons) has to be chosen on the basis of statistical analysis of previously compiled learning sets of eukaryotic DNA according to the goodness-of-fit test. Also, the most efficient type of an ordinary Markov chain model of noncoding DNA sequences (introns) has to be determined based on the analysis of the intron learning set. 2) An improved set of parameters needed for calculation of the value of the discrimination energy (estimating the relative activity of a splicing site) will be extracted from an expanded learning set of known splicing sites. 3) Splicing site stochastic models and models of coding/noncoding DNA sequences (joined together in a Bayes type algorithm finding out the value of the coding potential of a DNA fragment) have to be combined and enhanced as a new multistage method for the identification of gene locations. 4) After evaluating the method's accuracy, scaling of decision making thresholds, improving computational performance, and creating an interactive environment for the method, the software will be made available to the scientific community.
Showing the most recent 10 out of 48 publications