Comparative sequence analysis plays an increasingly prominent role in genome annotation, drug target discovery, biomolecule engineering, and medical genetics. The gold standard of comparative sequence analysis is phylogenetic character analysis based on a multiple alignment (MSA) of sequence family members, inference of a phylogenetic tree from the MSA, and analysis of reconstructed changes in sequence features of interest (active site residues, splice junctions, and so on). In this project, an automated system to facilitate evolutionary analysis will be developed, tested, and applied to unresolved issues in the evolution of spliceosomal introns. A software pipeline to assemble sequence family data sets (sequences, MSAs, intron sites, trees) for eukaryotic nuclear protein-coding genes will be developed and tested. Data sets produced in a NEXUS-based standard exchange format will be loaded into SPAN, a database/analysis system that will provide i) a relational schema suitable for storing sequence family data sets; ii) taxonomic query functions based on a comprehensive taxonomic hierarchy; iii) reconstruction of evolutionary changes; iv) query and retrieval based on tree topology, branch lengths, and reconstructions; v) explicit treatment of quality or uncertainty in sequence annotations and evolutionary inferences. Using the software pipeline and SPAN, a series of databases with 20-200 sequence families will be used to evaluate the role of targeted intron gain, and more generally the role of recent events of intron gain and loss, in accounting for non-randomness in the distribution of introns in genes and genomes. After obtaining a refined estimate of the nucleotide preferences of intron gain using evolutionary reconstructions, the implications of targeted gain will be evaluated with respect to biases in intron phase frequencies, amino acid composition near intron sites, and protein structure biases near intron sites. The proposed research will resolve long-standing issues concerning the evolutionary history of split genes, and the software systems developed will represent a major methodological advance with broad implications for bioinformatics.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
1R01LM007218-01A1
Application #
6547213
Study Section
Genome Study Section (GNM)
Program Officer
Ye, Jane
Project Start
2002-09-15
Project End
2005-09-14
Budget Start
2002-09-15
Budget End
2003-09-14
Support Year
1
Fiscal Year
2002
Total Cost
$260,800
Indirect Cost
Name
University of MD Biotechnology Institute
Department
Type
Organized Research Units
DUNS #
City
Baltimore
State
MD
Country
United States
Zip Code
21202
De Kee, Danny W; Gopalan, Vivek; Stoltzfus, Arlin (2007) A sequence-based model accounts largely for the relationship of intron positions to protein structural features. Mol Biol Evol 24:2158-68
Hladish, Thomas; Gopalan, Vivek; Liang, Chengzhi et al. (2007) Bio::NEXUS: a Perl API for the NEXUS format for comparative biological data. BMC Bioinformatics 8:191
Stoltzfus, Arlin (2006) Mutation-biased adaptation in a protein NK model. Mol Biol Evol 23:1852-62
Gopalan, Vivek; Qiu, Wei-Gang; Chen, Michael Z et al. (2006) Nexplorer: phylogeny-based exploration of sequence family data. Bioinformatics 22:120-1
Qiu, Wei-Gang; Schisler, Nick; Stoltzfus, Arlin (2004) The evolutionary gain of spliceosomal introns: sequence and phase preferences. Mol Biol Evol 21:1252-63