Work this year focused on scoring systems for multiple alignment. Most pairwise and multiple sequence alignment programs seek alignments with optimal scores. Central to defining such scores is selecting a set of substitution scores for aligned amino acids or nucleotides. Substitution scores for local pairwise alignment are implicitly of log-odds form, comparing the probabilities of aligning two letters under models of relatedness and non-relatedness, and the best pairwise substitution scores are explicitly so constructed. We have developed ideas, based on the minimum description length principle, for extending this formalism to multiple alignments. Most simply, Bayesian methods can be used to derive """"""""BILD"""""""" substitution scores from prior distributions describing columns of related letters. This approach has been used previously only to define scores for aligning individual sequences to sequence profiles, but it has much broader applicability. We have employed BILD scores in Gibbs sampling optimization procedures, and shown that they yield improved performance in constructing biologically accurate alignments. We have developed pilot programs for constructing gapped multiple alignments using BILD scores. Using artificial sequences, we have shown these programs to have superior performance to earlier programs at detecting domain boundaries. We have also appled them to the recognition and annotation of DNA-binding domains in Apicomplexan proteins. In related work, we have studied Dirichlet mixture models in the context of non-standard sequence composition. A Dirichlet mixture with M components over an alphabet of L letters has M*(L+1)-1 free parameters. If M = L/2, this is exactly as many as a symmetric pairwise substitution matrix. While each Dirichlet mixture implies a unique such matrix, we have shown that multiple mixtures can map to the same matrix, and some substitution matrices may not correspond to any Dirichlet mixture. A Dirichlet mixture for protein sequence analysis generally is constructed from a particular set of proteins, implying a particular background amino acid composition. The mixture should be non-optimal for the comparison of proteins with significantly different composition. We have described a sensible and efficient method for adjusting the parameters of a Dirichlet mixture so that they are consistent with any specified composition.

Project Start
Project End
Budget Start
Budget End
Support Year
19
Fiscal Year
2010
Total Cost
$391,740
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Neuwald, Andrew F; Aravind, L; Altschul, Stephen F (2018) Inferring joint sequence-structural determinants of protein functional specificity. Elife 7:
Neuwald, Andrew F; Altschul, Stephen F (2016) Inference of Functionally-Relevant N-acetyltransferase Residues Based on Statistical Correlations. PLoS Comput Biol 12:e1005294
Neuwald, Andrew F; Altschul, Stephen F (2016) Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties. PLoS Comput Biol 12:e1004936
Yu, Yi-Kuo; Capra, John A; Stojmirovi?, Aleksandar et al. (2015) Log-odds sequence logos. Bioinformatics 31:324-31
Nguyen, Viet-An; Boyd-Graber, Jordan; Altschul, Stephen F (2013) Dirichlet mixtures, the Dirichlet process, and the structure of protein space. J Comput Biol 20:1-18
Ye, Xugang; Yu, Yi-Kuo; Altschul, Stephen F (2011) On the inference of dirichlet mixture priors for protein sequence comparison. J Comput Biol 18:941-54
Ye, Xugang; Wang, Guoli; Altschul, Stephen F (2011) An assessment of substitution scores for protein profile-profile comparison. Bioinformatics 27:3356-63
Yu, Yi-Kuo; Altschul, Stephen F (2011) The complexity of the dirichlet model for multiple alignment data. J Comput Biol 18:925-39
Altschul, Stephen F; Wootton, John C; Zaslavsky, Elena et al. (2010) The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput Biol 6:e1000852
Ye, Xugang; Yu, Yi-Kuo; Altschul, Stephen F (2010) Compositional adjustment of Dirichlet mixture priors. J Comput Biol 17:1607-20