Work was completed this year on the sequence logo project, detailed in reports of previous years, and a paper describing this project was published. A new direction for this project was launched this year in collaboration with Dr. Andrew Neuwald of the Institute for Genome Sciences and Department of Biochemistry & Molecular Biology at the University of Maryland School of Medicine.
The first aim of the work was the development of an improved program for the multiple alignment of large numbers of sequence. The strategy we employed has several central features: (i) It employs a top-down alignment strategy that first identifies regions shared by all the input sequences, and then realigns closely related subgroups. This is key to escaping suboptimal traps, in which a set S of closely related but misaligned sequences resists change, because when a sequence X from S is dealt with individually, the remaining misaligned sequences of S pull X back into misalignment; (ii) It uses a Bayesian statistical measure of alignment quality, based on the minimum description length principle and on Dirichlet mixture priors. This measure favors more biologically realistic alignments than does, for example, the ad hoc but widely used sum-of-the-pairs scoring system; (iii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins' structural core. When applied to large datasets, the program we have developed runs significantly faster, and produces on average more biologically accurate alignments than widely used programs that have been considered the state of the art. A paper describing this work has been submitted for publication.
A second aim of this work is to extend the method described above to a multiple alignment model that is articulated to describe phenotypically diverged sequences distinctly in alignment positions statistically implicated as associated with their divergence. Preliminary research has begun in this direction.

Project Start
Project End
Budget Start
Budget End
Support Year
24
Fiscal Year
2015
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Neuwald, Andrew F; Aravind, L; Altschul, Stephen F (2018) Inferring joint sequence-structural determinants of protein functional specificity. Elife 7:
Neuwald, Andrew F; Altschul, Stephen F (2016) Inference of Functionally-Relevant N-acetyltransferase Residues Based on Statistical Correlations. PLoS Comput Biol 12:e1005294
Neuwald, Andrew F; Altschul, Stephen F (2016) Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties. PLoS Comput Biol 12:e1004936
Yu, Yi-Kuo; Capra, John A; Stojmirovi?, Aleksandar et al. (2015) Log-odds sequence logos. Bioinformatics 31:324-31
Nguyen, Viet-An; Boyd-Graber, Jordan; Altschul, Stephen F (2013) Dirichlet mixtures, the Dirichlet process, and the structure of protein space. J Comput Biol 20:1-18
Yu, Yi-Kuo; Altschul, Stephen F (2011) The complexity of the dirichlet model for multiple alignment data. J Comput Biol 18:925-39
Ye, Xugang; Yu, Yi-Kuo; Altschul, Stephen F (2011) On the inference of dirichlet mixture priors for protein sequence comparison. J Comput Biol 18:941-54
Ye, Xugang; Wang, Guoli; Altschul, Stephen F (2011) An assessment of substitution scores for protein profile-profile comparison. Bioinformatics 27:3356-63
Altschul, Stephen F; Wootton, John C; Zaslavsky, Elena et al. (2010) The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput Biol 6:e1000852
Ye, Xugang; Yu, Yi-Kuo; Altschul, Stephen F (2010) Compositional adjustment of Dirichlet mixture priors. J Comput Biol 17:1607-20