We are investigating segments of protein and nucleotide sequences that show compositional bias and raise several challenges for computational analysis. We develop methods to help to understand the structural, functional and evolutionary significance of these regions and their pathology. The sequences include local low complexity regions or domains, including conformationally mobile or intrinsically unstructured regions of proteins, and tandemly-repeated sequences. Further problems arise from more generally distributed amino acid content biases that can reflect directional mutation pressures at the genomic level and constraints specific to protein or domain function. Low complexity regions comprise a large proportion of the genome-encoded amino acids, and may contain homopolymeric tracts or mosaics of a few amino acids, or repeated patterns, frequently subtle, including those typical of many non-globular domains and dynamic or intrinsically unstructured segments of proteins. We have developed mathematical definitions and algorithms to define and identify regions of compositional bias, and to discover and analyze properties of these regions relevant to their structures, interactions, and evolution. Local regions of low complexity and tandemly repeated amino acid sequences occur in many proteins involved in cellular differentiation and embryonic development, RNA processing, transcriptional regulation, signal transduction and aspects of cellular and extracellular structural integrity. Segments of proteins are commonly non-globular, intrinsically unstructured, or conformationally mobile: however, knowledge of the molecular structures and dynamics of these domains is still very limited. They are generally relatively intractable to investigation by crystallography and NMR, and they still account for less than 1% of the residues in 3-dimensional structural databases. Current computer methods based on molecular mechanics and dynamics have given inconsistent results when applied to low-complexity amino acid sequences. Accordingly, we are experimenting with ab initio quantum chemical methods to investigate the ensembles of conformational states accessible to these regions of proteins. As specific examples to motivate this development, we are investigating amino acid sequence repeats of malaria parasites with possible roles in immune evasion as components of malaria vaccines. A related problem is compositional bias that can affect not only local segments but is distributed generally over the entire genome or proteins of an organism. This is shown, for example, by the biases in codons or proteins encoded by very AT-rich or GC-rich genomes including those of several important infectious disease organisms. Such variation and bias in genome-wide amino acid and nucleotide compositions raise problems for several commonly used sequence analysis algorithms. Accordingly, current research with Stephen Altschul and Yi-Kuo Yu is developing the theoretical foundation and implementation of these algorithms further in ways that include an improved treatment of background frequencies.

Project Start
Project End
Budget Start
Budget End
Support Year
18
Fiscal Year
2010
Total Cost
$293,805
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Altschul, Stephen F; Wootton, John C; Zaslavsky, Elena et al. (2010) The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput Biol 6:e1000852