The goal of this project is to define and analyze, using computational methods, segments of protein and nucleotide sequences showing compositional bias and to understand their structural, functional and evolutionary significance, and their pathology. These sequences include local low complexity regions or domains, including conformationally mobile or intrinsically unstructured regions of proteins, tandemly-repeated sequences, and also more generally distributed amino acid content bias. The latter can reflect directional mutation pressures at the genomic level and constraints specific to protein or domain function. Low complexity regions comprise a large proportion of the genome-encoded amino acids, and may contain homopolymeric tracts or mosaics of a few amino acids, or repeated patterns, frequently subtle, including those typical of many non-globular domains. New mathematical definitions and algorithms are continuing to be developed to identify regions of compositional bias, and to discover and analyze properties of these regions relevant to their structures, interactions, biological functions, and evolution. Strong background bias is shown by proteins encoded by very AT-rich or GC-rich genomes, which include those of several important infectious disease organisms: these raise problems for sequence alignment algorithms which are being addressed. Local regions of low complexity and tandemly repeated amino acid sequecnes occur in many proteins involved in cellular differentiation and embryonic development, RNA processing, transcriptional regulation, signal transduction and aspects of cellular and extracellular structural integrity. Experimental data indicate that low complexity segments of proteins are generally non-globular, intrinsically unstructured, or conformationally mobile: however, knowledge of the molecular structures and dynamics of these domains is still very limited. They are generally relatively intractable to investigation by crystallography and NMR, and they account for less than 1% of the residues in current structural databases. Hence, mathematically rigorous sequence analysis provides a primary methodology for gaining insights into their biology, and for raising questions to be investigated expermentally. These methods are also valuable, for both nucleotide and amino acid sequences, in detecting and eliminating some artifacts in sequence database searches and alignment analysis.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000025-12
Application #
6988450
Study Section
(CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
12
Fiscal Year
2004
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Altschul, Stephen F; Wootton, John C; Gertz, E Michael et al. (2005) Protein database searches using compositionally adjusted substitution matrices. FEBS J 272:5101-9
Wan, Honghui; Li, Lugang; Federhen, Scott et al. (2003) Discovering simple regions in biological sequences associated with scoring schemes. J Comput Biol 10:171-85
Yu, Yi-Kuo; Wootton, John C; Altschul, Stephen F (2003) The compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci U S A 100:15688-93
Sonnhammer, E L; Wootton, J C (2001) Integrated graphical analysis of protein sequence features predicted from sequence composition. Proteins 45:262-73
Wan, H; Wootton, J C (2000) A global compositional complexity measure for biological sequences: AT-rich and GC-rich genomes encode less complex proteins. Comput Chem 24:71-94