The current direction of this project, in collaboration with Dr. Andrew Neuwald of the Institute for Genome Sciences and Department of Biochemistry & Molecular Biology at the University of Maryland School of Medicine, continued throughout this year. Previous focuses had been the development of an improved method for multiple alignment that could identify the common elements shared by large and diverse protein superfamilies, and the extension of this method to a hierarchical multiple alignment model. Such a model is based on the fact that large protein superfamilies frequently have diversified to fulfill distinct functional roles within different subfamilies. Each subfamily has distinct structural constraints, which yield distinct amino acid frequency vectors at particular positions characteristic of that subfamily. Although, within a subfamily, the amino acids at different positions may be independent, the changes in frequency vectors across multiple positions characteristic of each subfamily yields the appearance of correlation between positions when a simple, non-hierarchical model of a superfamily is constructed. Earlier approaches have modeled these apparent correlations directly, using pairwise coupling terms, but we model them by constructing an explicit hierarchical model, with individual sequences assigned to distinct nodes within the hierarchy. We applied the Minimum Description Length principle to insure that the hierarchical models we construct do not overfit the data, but have statistical support. This year the central focus this project was the statistical assessment of the three-dimensional clustering of distinguished positions, identified as characteristic of various nodes in a hierarchy. Our approach, called Initial Cluster Analysis (ICA), seeks to determine whether a set of distinguished elements within a linear array is clustered significantly near the start of the array and, if so, what is the most significant initial cluster of these elements. Abstractly, given a linear array of length L containing D '1's (the distinguished elements) and L-D '0's, it considers a generative model in which in which the '1's occur with particular and differing probabilities before and after a cut point X in the array. For any particular X it is relatively easy to calculate a likelihood Like(X) of the array of data, and one may optimize Like(X) by simply evaluating it for all possible X. However, the values of Like(X) for close values of X are highly correlated, dependent upon a calculable density of independent trials Rho(X). Because Rho(X) is not constant but rather grows approximately as the reciprocal of X's distance from 0 or L, simply optimizing Like(X) inherently favors, a priori, small or large values of X. Therefore, if one's application suggests no such bias, choosing to optimize Like(X)/Rho(X) rather than Like(X) for a given array of '0's and '1's may be a better strategy; we refer to this approach as using flattened priors. ICA estimates the effective total number of independent trials implicit in either optimization, which it uses in calculating a p-value for the optimal X. This provides a mathematically principled way to define an optimal initial cluster of distinguished elements, balancing the claims of very short and dense clusters with those of longer but sparser clusters. We published ICA in the Journal of Computational Biology. To analyze real proteins using ICA, we ordered the residues within a protein by their physical distance from a point of reference, and used our previously-developed hierarchical analysis to define a set of distinguished residues, characteristic of a protein family or subfamily. ICA then allows us to find sets of distinguished residues that are significantly clustered in three dimensions. Applying this approach to N-acetyltransferases, P-loop GTPases, RNA helicases, synaptojanin-superfamily phosphatases and nucleases, and thymine/uracil DNA glycosylases yielded results congruent with biochemical understanding of these proteins, and also revealed striking sequence-structural features overlooked by other methods. This work was published in eLife.

Project Start
Project End
Budget Start
Budget End
Support Year
27
Fiscal Year
2018
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Neuwald, Andrew F; Aravind, L; Altschul, Stephen F (2018) Inferring joint sequence-structural determinants of protein functional specificity. Elife 7:
Neuwald, Andrew F; Altschul, Stephen F (2016) Inference of Functionally-Relevant N-acetyltransferase Residues Based on Statistical Correlations. PLoS Comput Biol 12:e1005294
Neuwald, Andrew F; Altschul, Stephen F (2016) Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties. PLoS Comput Biol 12:e1004936
Yu, Yi-Kuo; Capra, John A; Stojmirovi?, Aleksandar et al. (2015) Log-odds sequence logos. Bioinformatics 31:324-31
Nguyen, Viet-An; Boyd-Graber, Jordan; Altschul, Stephen F (2013) Dirichlet mixtures, the Dirichlet process, and the structure of protein space. J Comput Biol 20:1-18
Yu, Yi-Kuo; Altschul, Stephen F (2011) The complexity of the dirichlet model for multiple alignment data. J Comput Biol 18:925-39
Ye, Xugang; Yu, Yi-Kuo; Altschul, Stephen F (2011) On the inference of dirichlet mixture priors for protein sequence comparison. J Comput Biol 18:941-54
Ye, Xugang; Wang, Guoli; Altschul, Stephen F (2011) An assessment of substitution scores for protein profile-profile comparison. Bioinformatics 27:3356-63
Altschul, Stephen F; Wootton, John C; Zaslavsky, Elena et al. (2010) The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput Biol 6:e1000852
Ye, Xugang; Yu, Yi-Kuo; Altschul, Stephen F (2010) Compositional adjustment of Dirichlet mixture priors. J Comput Biol 17:1607-20