High-throughput analysis has become an essential tool in biomedical informatics and genomic research. One aspect of this research paradigm that has not been heavily examined is how errors generated in early stages of these analyses propagate through downstream analyses. For many complex, high-throughput analyses of genome data, sequence alignment is among the very first steps. Alignment results in a hypothesized set of site homologies;however, these hypotheses are often taken as fact and used in all subsequent analyses as if correct. It is well known that alignments become inaccurate as evolutionary divergence increases;how this error affects genome analyses in biomedical informatics is poorly understood. This proposal is a follow-up to a pilot study examining the use of computational sequence simulation to study the effects and propagation of alignment error on comparative and functional genomic sequence analysis.
The specific aims of this study are (1) to more thoroughly examine the effect of alignment errors on phylogenetic accuracy;(2) to examine the accuracy of ancestral sequence reconstruction methods and see how errors in both alignment and phylogenetic hypothesis may affect these reconstructions;(3) to explore a novel approach for determining the composition of gene families which makes use of evolutionary principles, including phylogeny and ancestral sequence reconstruction;and (4) to improve a sequence simulation program to include (a) better insertion/deletion models, including those based on both theory and empirical data;(b) more realistic simulation options for RNA, coding DNA and protein sequences;(c) to allow full genome simulations, including segmental variation of mutation parameters and genome-scale processes;and (d) to extend the program for use on all major computational platforms, including Windows, Linux and Macintosh.