Work this year has focused on the Dirichlet mixture model used to analyze protein sequences. The Dirichlet mixture model was introduced to protein sequence analysis by a Haussler's group at UCSC. In brief, this model imagines a particular position in a protein family is described by a multinomial distribution on the set of amino acids. Although the multinomial for a particular position may be unique, the study of many protein families reveals that certain regions of multinomial space are much more heavily populated than others. This general knowledge may be summarized by a """"""""Dirichlet mixture prior"""""""", which is a probability density over multinomial space that lends itself to easy analysis. Our research on Dirichlet and Dirichlet mixture priors has had three separate focuses. First, the set of all Dirichlet distributions, called the Dirichlet model D, can be used to describe a set multiple alignment data, consisting most simply of n columns, each containing c letters. When these data are used to select a maximum-likelihood distribution or """"""""theory"""""""" from the Dirichlet model, an important question is how many effectively independent theories D contains;the log of this number is called the model's complexity, or COMP(D). This complexity can be expressed as a multidimensional definite integral of the square root of the determinant of the Fisher information matrix for D. In the limit of large n and c, we have been able to derive an analytic, closed form expression for COMP(D), i.e. L/2 log(n) + (L-1)/2 log(c) + A_L, where L is the size of the alphabet, and A_L is a calculable constant dependent on L. Specifically, for protein sequences, A_20 = -30.093 bits. We have also described a Monte Carlo method for calculating accurately a small (i.e. <1 bit) correction to this formula for small c. Although be can not extend our analysis to the Dirichlet mixtures appropriate for protein sequence analysis, heuristic argument allow us to derive a plausible formula applicable to that case as well. Second, we examined the question of how best to infer a Dirichlet mixture from a set of multiple alignment data. The first issue that arises is how many components such a mixture should have. Using our formula for the complexity of a Dirichlet mixture model, we applied the Minimum Description Length principle to this problem. As a proof of principle, we showed that with a sufficient amount of artificial data generated using a known Dirichlet mixture, we were able to converge of the correct number of components. Once the number of components is known, the problem remains how to infer the parameters of the Dirichlet mixture. We applied a Gibbs sampling approach to this problem. It had the advantage over the previously described EM approach of requiring optimizations in only one dimension, and was able to produce better results on the same data set. Finally, we described how to adjust a previously inferred Dirichlet mixture for use on a data set with non-standard amino acid composition. This was the culmination of work begun the year before. Each of these three projects resulted in an independent publication.

Project Start
Project End
Budget Start
Budget End
Support Year
20
Fiscal Year
2011
Total Cost
$409,736
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Altschul, Stephen F; Neuwald, Andrew F (2017) Initial Cluster Analysis. J Comput Biol :
Neuwald, Andrew F; Altschul, Stephen F (2016) Inference of Functionally-Relevant N-acetyltransferase Residues Based on Statistical Correlations. PLoS Comput Biol 12:e1005294
Neuwald, Andrew F; Altschul, Stephen F (2016) Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties. PLoS Comput Biol 12:e1004936
Yu, Yi-Kuo; Capra, John A; Stojmirovi?, Aleksandar et al. (2015) Log-odds sequence logos. Bioinformatics 31:324-31
Nguyen, Viet-An; Boyd-Graber, Jordan; Altschul, Stephen F (2013) Dirichlet mixtures, the Dirichlet process, and the structure of protein space. J Comput Biol 20:1-18
Yu, Yi-Kuo; Altschul, Stephen F (2011) The complexity of the dirichlet model for multiple alignment data. J Comput Biol 18:925-39
Ye, Xugang; Yu, Yi-Kuo; Altschul, Stephen F (2011) On the inference of dirichlet mixture priors for protein sequence comparison. J Comput Biol 18:941-54
Ye, Xugang; Wang, Guoli; Altschul, Stephen F (2011) An assessment of substitution scores for protein profile-profile comparison. Bioinformatics 27:3356-63
Altschul, Stephen F; Wootton, John C; Zaslavsky, Elena et al. (2010) The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput Biol 6:e1000852
Ye, Xugang; Yu, Yi-Kuo; Altschul, Stephen F (2010) Compositional adjustment of Dirichlet mixture priors. J Comput Biol 17:1607-20