The goal of this project is to define, classify and analyze, using computational analysis, segments of protein and nucleotide sequences showing compositional bias or improbably low compositional complexity. In protein sequences, these include the abundant residue clusters of predominantly one or a few amino acid types, which commonly contain homopolymeric tracts or mosaics of these, aperiodic patterns and sections of low-period repeats. Other common examples include long non-globular domains. The abundance of biased segments in both amino acid and nucleotide sequence databases has been determined, and their properties are being related to evidence of biological functions. A. Methods: Different formal definitions of local compositional complexity were used to make unbiased identification of low-complexity segments, at different levels of stringency. Algorithms were refined to (a) select segments for further study, (b) filter out non-informative segments prior to database searches, and (c) discover and analyze regions in which compositional bias is present in periodically-spaced rather than contiguous residues. New methods for automated classification and neighboring of low- complexity sequences have been developed. B. Abundance and biological properties: Approximately 25% of the residues in protein databases are in compositionally biased segments (including some known long non- globular regions) and approximately 55% of proteins contain one or more such segments. Interspersed low-complexity sequences are particularly abundant in many eukaryotic proteins crucial in morphogenesis and embryonic development, RNA processing, transcriptional regulation, signal transduction and aspects of cellular and extracellular structural integrity. The limited structural information available for low- complexity regions of proteins indicates that they are generally non- globular and polymorphic kr mobile. Significance of project: The project is highlighting the high abundance and biological importance of low-complexity protein segments. Knowledge of their molecular structure and dynamics is beginning to emerge in a few cases, but these are a minority. This is a priority area for future research. The methods recently developed to analyze nucleotide sequences are revealing many new and intricate compositional features. These methods are valuable in eliminating many artefacts in sequence database searches and alignment analysis.
Altschul, Stephen F; Wootton, John C; Gertz, E Michael et al. (2005) Protein database searches using compositionally adjusted substitution matrices. FEBS J 272:5101-9 |
Wan, Honghui; Li, Lugang; Federhen, Scott et al. (2003) Discovering simple regions in biological sequences associated with scoring schemes. J Comput Biol 10:171-85 |
Yu, Yi-Kuo; Wootton, John C; Altschul, Stephen F (2003) The compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci U S A 100:15688-93 |
Sonnhammer, E L; Wootton, J C (2001) Integrated graphical analysis of protein sequence features predicted from sequence composition. Proteins 45:262-73 |
Wan, H; Wootton, J C (2000) A global compositional complexity measure for biological sequences: AT-rich and GC-rich genomes encode less complex proteins. Comput Chem 24:71-94 |