The primary focus this year was on the assessment of substitution scoring systems for aligning protein profiles to one another. Pairwise protein sequence alignments are generally evaluated using scores defined as the sum of "substitution scores" for aligning amino acids to one another, and "gap scores" for aligning runs of amino acids in one sequence to null characters inserted into the other. Protein "profiles" may be abstracted from multiple alignments of protein sequences, and substitution and gap scores have been generalized to the alignment of such profiles either to single sequences or to other profiles. Although there is widespread agreement on the general form substitution scores should take for profile-sequence alignment, little consensus has been reached on how best to construct profile-profile substitution scores, and a large number of these scoring systems have been proposed. We assessed a variety of such substitution scores, using several sets of "gold standard" multiple alignments. For our evaluation, we calculated the probability that a profile column yields a higher substitution score when aligned to a related than to an unrelated column. We also considered the same measure applied to sets of two or three adjacent columns. This simple approach had the advantages that it did not depend primarily upon the gold standard alignment columns with the weakest empirical support, and that it did not need to fit gap and offset costs for use with each substitution cost studied. No substitution scoring system emerges as superior in all our tests, but two show consistently strong behavior: a generalization of profile-sequence scores similar to those used in the Compass alignment program, and the recently proposed Bayesian Integral Log-odds (BILD) scores. A secondary focus was on the issues related to the Dirichlet mixture model, used to analyze protein sequences. The Dirichlet mixture model was introduced to protein sequence analysis by a Haussler's group at UCSC. In brief, this model imagines a particular position in a protein family is described by a multinomial distribution on the set of amino acids. Although the multinomial for a particular position may be unique, the study of many protein families reveals that certain regions of multinomial space are much more heavily populated than others. This general knowledge may be summarized by a "Dirichlet mixture prior", which is a probability density over multinomial space that lends itself to easy analysis. Our research on Dirichlet mixture priors this year centered on the question of how best to derive such priors from a set of multiple alignment data. Our previous work had applied the Minimum Description Length principle and a Gibbs sampling algorithm to this problem. Work begun this year applied the Dirichlet Process to this problem, which preliminary results suggest leads to much improved mixtures with many more components.

Project Start
Project End
Budget Start
Budget End
Support Year
21
Fiscal Year
2012
Total Cost
$260,305
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Neuwald, Andrew F; Altschul, Stephen F (2016) Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties. PLoS Comput Biol 12:e1004936
Yu, Yi-Kuo; Capra, John A; Stojmirović, Aleksandar et al. (2015) Log-odds sequence logos. Bioinformatics 31:324-31
Nguyen, Viet-An; Boyd-Graber, Jordan; Altschul, Stephen F (2013) Dirichlet mixtures, the Dirichlet process, and the structure of protein space. J Comput Biol 20:1-18
Yu, Yi-Kuo; Altschul, Stephen F (2011) The complexity of the dirichlet model for multiple alignment data. J Comput Biol 18:925-39
Ye, Xugang; Yu, Yi-Kuo; Altschul, Stephen F (2011) On the inference of dirichlet mixture priors for protein sequence comparison. J Comput Biol 18:941-54
Ye, Xugang; Wang, Guoli; Altschul, Stephen F (2011) An assessment of substitution scores for protein profile-profile comparison. Bioinformatics 27:3356-63
Altschul, Stephen F; Wootton, John C; Zaslavsky, Elena et al. (2010) The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput Biol 6:e1000852
Ye, Xugang; Yu, Yi-Kuo; Altschul, Stephen F (2010) Compositional adjustment of Dirichlet mixture priors. J Comput Biol 17:1607-20