The new direction for this project, in collaboration with Dr. Andrew Neuwald of the Institute for Genome Sciences and Department of Biochemistry & Molecular Biology at the University of Maryland School of Medicine, continued throughout this year.
The first aim of the work was to development an improved program for the multiple alignment of large numbers of sequence. The strategy has several central features: (i) It employs a top-down alignment strategy that first identifies regions shared by all the input sequences, and then realigns closely related subgroups. This is key to escaping suboptimal traps, in which a set S of closely related but misaligned sequences resists change, because when a sequence X from S is dealt with individually, the remaining misaligned sequences of S pull X back into misalignment; (ii) It uses a Bayesian statistical measure of alignment quality, based on the minimum description length principle and on Dirichlet mixture priors. This measure favors more biologically realistic alignments than does, for example, the ad hoc but widely used sum-of-the-pairs scoring system; (iii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins' structural core. When applied to large datasets, the program we have developed produces on average more biologically accurate alignments than widely used programs that have been considered the state of the art. A paper describing this work was published.
A second aim of this work is to extend the method described above to a hierarchical multiple alignment model. Such a model is based on the fact that large protein superfamilies frequently have diversified to fulfill distinct functional roles within different subfamilies. Each subfamily has distinct structural constraints, which yield distinct amino acid frequency vectors at particular positions characteristic of that subfamily. Although, within a subfamily, the amino acids at different positions may be independent, the changes in frequency vectors across multiple positions characteristic of each subfamily yields the appearance of correlation between positions when a simple, non-hierarchical model of a superfamily is constructed. Earlier approaches have modeled these apparent correlations directly, using pairwise coupling terms, but we model them by constructing an explicit hierarchical model, with individual sequences assigned to distinct nodes within the hierarchy. We have applied the Minimum Description Length principle to insure that the hierarchical models we construct do not overfit the data, but have statistical support. A paper describing the first stage of this work has been submitted for publication. Work on a third aim of this project was launched this year. The hierarchical models constructed by our approach include the explicit description of a set of distinguishing positions characteristic of each node in the hierarchy. When mapped only available three-dimensional structures, these distinguishing positions often cluster together in space, and can aid in the development of specific hypotheses for the biological mechanisms underlying the diversification of protein subfamilies. We have begun work on the developing appropriate measures for the clustering of distinguished positions, and their statistical assessment.

Project Start
Project End
Budget Start
Budget End
Support Year
25
Fiscal Year
2016
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Neuwald, Andrew F; Aravind, L; Altschul, Stephen F (2018) Inferring joint sequence-structural determinants of protein functional specificity. Elife 7:
Neuwald, Andrew F; Altschul, Stephen F (2016) Inference of Functionally-Relevant N-acetyltransferase Residues Based on Statistical Correlations. PLoS Comput Biol 12:e1005294
Neuwald, Andrew F; Altschul, Stephen F (2016) Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties. PLoS Comput Biol 12:e1004936
Yu, Yi-Kuo; Capra, John A; Stojmirovi?, Aleksandar et al. (2015) Log-odds sequence logos. Bioinformatics 31:324-31
Nguyen, Viet-An; Boyd-Graber, Jordan; Altschul, Stephen F (2013) Dirichlet mixtures, the Dirichlet process, and the structure of protein space. J Comput Biol 20:1-18
Yu, Yi-Kuo; Altschul, Stephen F (2011) The complexity of the dirichlet model for multiple alignment data. J Comput Biol 18:925-39
Ye, Xugang; Yu, Yi-Kuo; Altschul, Stephen F (2011) On the inference of dirichlet mixture priors for protein sequence comparison. J Comput Biol 18:941-54
Ye, Xugang; Wang, Guoli; Altschul, Stephen F (2011) An assessment of substitution scores for protein profile-profile comparison. Bioinformatics 27:3356-63
Altschul, Stephen F; Wootton, John C; Zaslavsky, Elena et al. (2010) The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput Biol 6:e1000852
Ye, Xugang; Yu, Yi-Kuo; Altschul, Stephen F (2010) Compositional adjustment of Dirichlet mixture priors. J Comput Biol 17:1607-20