Comparative sequence analysis plays an increasingly prominent role in genome annotation, drug target discovery, biomolecule engineering, and medical genetics. The gold standard of comparative sequence analysis is phylogenetic character analysis based on a multiple alignment (MSA) of sequence family members, inference of a phylogenetic tree from the MSA, and analysis of reconstructed changes in sequence features of interest (active site residues, splice junctions, and so on). In this project, an automated system to facilitate evolutionary analysis will be developed, tested, and applied to unresolved issues in the evolution of spliceosomal introns. A software pipeline to assemble sequence family data sets (sequences, MSAs, intron sites, trees) for eukaryotic nuclear protein-coding genes will be developed and tested. Data sets produced in a NEXUS-based standard exchange format will be loaded into SPAN, a database/analysis system that will provide i) a relational schema suitable for storing sequence family data sets; ii) taxonomic query functions based on a comprehensive taxonomic hierarchy; iii) reconstruction of evolutionary changes; iv) query and retrieval based on tree topology, branch lengths, and reconstructions; v) explicit treatment of quality or uncertainty in sequence annotations and evolutionary inferences. Using the software pipeline and SPAN, a series of databases with 20-200 sequence families will be used to evaluate the role of targeted intron gain, and more generally the role of recent events of intron gain and loss, in accounting for non-randomness in the distribution of introns in genes and genomes. After obtaining a refined estimate of the nucleotide preferences of intron gain using evolutionary reconstructions, the implications of targeted gain will be evaluated with respect to biases in intron phase frequencies, amino acid composition near intron sites, and protein structure biases near intron sites. The proposed research will resolve long-standing issues concerning the evolutionary history of split genes, and the software systems developed will represent a major methodological advance with broad implications for bioinformatics.
De Kee, Danny W; Gopalan, Vivek; Stoltzfus, Arlin (2007) A sequence-based model accounts largely for the relationship of intron positions to protein structural features. Mol Biol Evol 24:2158-68 |
Hladish, Thomas; Gopalan, Vivek; Liang, Chengzhi et al. (2007) Bio::NEXUS: a Perl API for the NEXUS format for comparative biological data. BMC Bioinformatics 8:191 |
Gopalan, Vivek; Qiu, Wei-Gang; Chen, Michael Z et al. (2006) Nexplorer: phylogeny-based exploration of sequence family data. Bioinformatics 22:120-1 |
Stoltzfus, Arlin (2006) Mutation-biased adaptation in a protein NK model. Mol Biol Evol 23:1852-62 |
Qiu, Wei-Gang; Schisler, Nick; Stoltzfus, Arlin (2004) The evolutionary gain of spliceosomal introns: sequence and phase preferences. Mol Biol Evol 21:1252-63 |