Statistics of Sequence Comparison

Altschul, Stephen

Abstract

The primary focus of the project this year has been work on adapting sequence """"""""logos"""""""", as originally described by Tom Schneider, to the log-odds formalism. Sequence logos represent DNA or protein motifs, captured by an input multiple alignment. A logo represents each position in the multiple alignment by a stack of letters whose individual heights are proportional to the observed letter frequencies, and whose aggregate height is proportional to the """"""""information"""""""" at the position in question. Originally, information was defined as an entropy difference, perhaps with a correction for small sample size. Alternative definitions of information have also been proposed and implemented. These include the relative entropy implied by the observed letter frequencies and background frequencies attributable to chance. It has long been recognized that any scoring systems for local alignments, i.e. a system with negative expected score, is of the log-odds form log(Q/P), with implicit if not explicit """"""""target frequencies"""""""" Q. These are the frequencies of aligned letters, among related sequences, that the scoring system is optimized to distinguish from chance, modeled by the probability P. For pairwise alignments, all popular local alignment scoring systems are constructed by explicitly specifying appropriate target frequencies Q. It was only in 2010, however, that a method for explicitly estimating target frequencies for multiple alignment columns was described in the bioinformatics literature. This project has sought to compare the resulting multiple-alignment log-odds scores to previously proposed scores, by the criterion of their effectiveness in recognizing biologically important positions, and to make log-odds scores available to researchers through a public sequence-logo construction program. The frequencies Q for log-odds scores may be constructed in a variety of ways. Among the simplest is normalized-maximum- likelihood (NML), in which Q is taken to be proportional to the likelihood of a column implied by a maximum-likelihood multinomial. It can be shown that NML log-odds scores are equivalent to relative-entropy scores, plus a correction term c(N) dependent on the number of observations N in the multiple alignment column. For DNA and protein multiple alignments, we have calculated c(N) explicitly for small N, and have derived an asymptotic formula of sufficient accuracy to be used for N where the explicitly calculating c(N) become infeasible. Although NML log-odds scores may be appropriate for DNA, in the protein alignment context they ignore prior knowledge concerning amino acid relationships. The alternative log-odds BILD scores, first described in 2010, are able to exploit this knowledge through their use of a Bayesian Dirichlet mixture prior, describing multiple alignment columns of related protein sequences. It is notable that BILD scores are essentially equivalent to NML scores when """"""""uninformative"""""""" Jeffreys priors are used. Using an enzyme dataset, with active sites as a proxy for biologically important positions within proteins, we compared small-sample-size corrected and uncorrected entropy difference scores to NML scores and BILD scores using a recently developed Dirichlet mixture prior. In this comparison, log-odds scores proved superior to previously proposed multiple alignment scoring systems. Taking account of prior knowledge concerning amino acid relationships tended to raise the scores of all alignment positions, slightly favoring """"""""important"""""""" positions. We implemented an online served to produce log-odds sequence logos. A paper describing this work has been submitted for publication. A secondary focus of this project has been preliminary work on applying BILD scores to the automatic articulation of protein subfamilies.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Investigator-Initiated Intramural Research Projects (ZIA)
Project #: 1ZIALM000014-23
Application #: 8943213
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 23
Fiscal Year: 2014
Total Cost
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country
Zip Code

Related projects

Publications

Neuwald, Andrew F; Aravind, L; Altschul, Stephen F (2018) Inferring joint sequence-structural determinants of protein functional specificity. Elife 7:

Neuwald, Andrew F; Altschul, Stephen F (2016) Inference of Functionally-Relevant N-acetyltransferase Residues Based on Statistical Correlations. PLoS Comput Biol 12:e1005294

Neuwald, Andrew F; Altschul, Stephen F (2016) Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties. PLoS Comput Biol 12:e1004936

Yu, Yi-Kuo; Capra, John A; Stojmirovi?, Aleksandar et al. (2015) Log-odds sequence logos. Bioinformatics 31:324-31

Nguyen, Viet-An; Boyd-Graber, Jordan; Altschul, Stephen F (2013) Dirichlet mixtures, the Dirichlet process, and the structure of protein space. J Comput Biol 20:1-18

Yu, Yi-Kuo; Altschul, Stephen F (2011) The complexity of the dirichlet model for multiple alignment data. J Comput Biol 18:925-39

Ye, Xugang; Yu, Yi-Kuo; Altschul, Stephen F (2011) On the inference of dirichlet mixture priors for protein sequence comparison. J Comput Biol 18:941-54

Ye, Xugang; Wang, Guoli; Altschul, Stephen F (2011) An assessment of substitution scores for protein profile-profile comparison. Bioinformatics 27:3356-63

Altschul, Stephen F; Wootton, John C; Zaslavsky, Elena et al. (2010) The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput Biol 6:e1000852

Ye, Xugang; Yu, Yi-Kuo; Altschul, Stephen F (2010) Compositional adjustment of Dirichlet mixture priors. J Comput Biol 17:1607-20

Comments

Be the first to comment on Stephen Altschul's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: