A steadily increasing proportion of biomedical research is conducted on organisms for which genomic sequence is available. For many research questions, however, a genome is important primarily because of its protein products. Thus, a critical question in genome analysis is: What are the structures of all the genes and the exact amino acid sequences of their translation products? Despite significant contributions from experimental biology, high-throughput sequencing, and bioinformatics, we are still far from being able to answer this question accurately. The proposed research aims to improve gene-structure prediction by exploiting patterns of evolutionary conservation.
Aim 1 focuses on developing probability models for exploiting genomic homology to improve gene-structure prediction. A novel aspect of the proposed models is their use of """"""""conservation sequence"""""""" to represent the degree and pattern of evolutionary conservation at each nucleotide in the genome to be annotated. A conservation sequence is a synthesis of potentially overlapping local alignments into one sequence. Our probability models build on the Hidden Markov Model (HMM) approach used in state-of-the-art gene-structure prediction systems.
Aim 2 focuses on developing probability models for improving gene prediction by exploiting cDNA and EST alignments. The fundamental approach is similar to that of Aim 1. The most important new question is how to combine information from transcript alignments with information from genomic homology in a way that does not count the same evidence twice.
Aim 3 focuses on analysis of vertebrate genomes using homology from multiple vertebrate genomes. The best method is expected to depend on the evolutionary distances among the genomes. Our investigations will focus on gene-structure prediction in human, with homology provided by (a) both mouse and pufferfish, and (b) both mouse and rat. Our gene-structure predictions will be provided to the research community through our web site and that of our collaborators at Ensembl. The ability to predict complete gene structures reliably would constitute significant progress in high-throughput biology. Potential biomedical applications include: (1) Identifying novel protein families that could serve as drug targets, and (2) Accelerating positional cloning projects for the identification of disease related genes.
Showing the most recent 10 out of 21 publications