The primary focus was this year was on issues related to the Dirichlet mixture model, used to analyze protein sequences. The Dirichlet mixture model was introduced to protein sequence analysis by a Haussler's group at UCSC. In brief, this model imagines a particular position in a protein family is described by a multinomial distribution on the set of amino acids. Although the multinomial for a particular position may be unique, the study of many protein families reveals that certain regions of multinomial space are much more heavily populated than others. This general knowledge may be summarized by a Dirichlet mixture (DM) prior, which is a probability density over multinomial space that lends itself to easy analysis. Our research on DM priors this year centered on the question of how best to derive them from a set of multiple alignment data, and in particular the application of the Dirichlet Process (DP) to this problem. The DP finds application to mixtures with an unknown number of components. It may be understood as a generalized prior over mixtures with an infinite number of components, but where any set of data is described by a finite number of these components. A DP is completely specified by a prior on the parameters of the underlying distribution (in our case, a Dirichlet distribution), and a hyperparameter which implicitly specifies a prior on the weights of the mixture components. The DP had not previously been applied to DMs, and we needed to develop several technical innovations for this case. In particular, it was most convenient to take the prior for the parameters of a Dirichlet distribution as, first, a very slowly decaying exponential distribution on the concentration parameter and, second, a Dirichlet distribution on its center-of-mass parameter vector. When applied to multiple alignment data, the DP yielded DMs with over five hundred components, which were fully justified by the Minimum Description Length principle. Previous heuristic optimization had not succeeded in finding DMs with more than about 35 components. An analysis of the probability landscape defined by the DMs with large numbers of components reveals they are characterized by continuous ridges winding through amino acid multinomial space. The previous conceptual picture, of a small number of discrete probability hills in multinomial space, each corresponding to a distinct type of amino acid environment within proteins, had been strongly suggested by the DM metaphor, but proved to be ungrounded in the actual constraints on protein evolution. A secondary focus this year has been preliminary work on adapting sequence "logos", as originally described by Tom Schneider, to the log-odds formalism implicit in the Dirichlet mixture perspective on protein multiple alignments. Sequence logos represent DNA or protein motifs, captured by an input multiple alignment. A logo represents each position in the multiple alignment by a stack of letters whose individual heights are proportional to the observed letter frequencies, and whose aggregate height is proportional to the information at the position in question. Originally "information" was defined as an entropy difference, perhaps with a correction for small sample size. The log-odds perspective suggests it is better defined as relative entropy. For DNA motifs, uniform background frequencies imply the alternative definition differs from the traditional one only by a constant offset. However, for protein motifs, the new definition yields logos that may differ substantially from the traditional ones, and that may enhance the recognition of biologically important positions.
Nguyen, Viet-An; Boyd-Graber, Jordan; Altschul, Stephen F (2013) Dirichlet mixtures, the Dirichlet process, and the structure of protein space. J Comput Biol 20:1-18 |