The long-term goals of our research are: (a) to develop more sensitive and reliable methods for exploiting sequence and structure information through similarity searching; and (b) to understand better the biophysical constraints on protein folding that can be identified from protein sequence information. Although similarity searching is now routinely used to characterize sequences and annotate genomes, the most widely used methods focus on speed at the expense of sensitivity and statistical accuracy. We believe that more flexible algorithms, with more accurate statistical estimates, can provide new biological insights about the structure, function, and evolutionary history of protein and DMA sequences. Over the next five years, our specific aims are: (1) To improve the FASTA programs by: providing better performance on parallel (Beowulf) clusters; using vector-parallel instruction sets, and providing more accurate statistics. (2) To develop evolutionary calibrated DMA sequence comparison algorithms using rapid initial seeding, followed by extension using context dependent scoring matrices. The goal is to develop heuristic approaches with well understood evolutionary horizons. (3) To develop improved strategies for identifying repeated sequences in proteins by combining optimal local alignment strategies with appropriate scoring matrices and gap penalties, (4) To develop accurate statistical estimates for profile: sequence and profile: profile similarity searches. Profile: profile comparison programs with accurate statistical estimates should substantially reduce the sensitivity gap between sequence and structure comparison. Profile: profile comparisons will both be far more useful, and allow us to explore fundamental questions about how easy it is for new protein families to emerge. (5) We will examine local sequence constraints in proteins, using each family as an independent observation. We believe that much of the literature on the global properties of protein sequences fails to distinguish between correlations that reflect genuine biophysical constraints, and correlations that reflect shared evolutionary history. We will also search for clear examples of convergent evolution-similar functions carried out by clearly non-homologous proteins. Accurate statistical estimates for searches with real protein sequences, and profiles from real protein families, can change fundamentally the inference of homology from statistically significant similarity. Because of inaccurate statistical estimates, similarity searching is often considered a tool for generating hypotheses about homology, which must be confirmed experimentally. When the statistical estimates are highly accurate, it may become possible to define homology in terms of statistically significant similarity. ? ? ?

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
5R01LM004969-19
Application #
7409662
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Ye, Jane
Project Start
1988-08-01
Project End
2011-04-30
Budget Start
2008-05-01
Budget End
2009-04-30
Support Year
19
Fiscal Year
2008
Total Cost
$321,293
Indirect Cost
Name
University of Virginia
Department
Biochemistry
Type
Schools of Medicine
DUNS #
065391526
City
Charlottesville
State
VA
Country
United States
Zip Code
22904
Pearson, William R; Mackey, Aaron J (2017) Using SQL Databases for Sequence Similarity Searching and Analysis. Curr Protoc Bioinformatics 59:9.4.1-9.4.22
Pearson, William R (2016) Finding Protein and Nucleotide Similarities with FASTA. Curr Protoc Bioinformatics 53:3.9.1-25
Triant, Deborah A; Pearson, William R (2015) Most partial domains in proteins are alignment and annotation artifacts. Genome Biol 16:99
Pearson, William R (2013) An introduction to sequence similarity (""homology"") searching. Curr Protoc Bioinformatics Chapter 3:Unit3.1
Pearson, William R (2013) Selecting the Right Similarity-Scoring Matrix. Curr Protoc Bioinformatics 43:3.5.1-9
Mills, Lauren J; Pearson, William R (2013) Adjusting scoring matrices to correct overextended alignments. Bioinformatics 29:3007-13
Holliday, Gemma L; Andreini, Claudia; Fischer, Julia D et al. (2012) MACiE: exploring the diversity of biochemical reactions. Nucleic Acids Res 40:D783-9
Li, Weizhong; McWilliam, Hamish; Goujon, Mickael et al. (2012) PSI-Search: iterative HOE-reduced profile SSEARCH searching. Bioinformatics 28:1650-1
Gonzalez, Mileidy W; Pearson, William R (2010) RefProtDom: a protein database with improved domain boundaries and homology relationships. Bioinformatics 26:2361-2
Sierk, Michael L; Smoot, Michael E; Bass, Ellen J et al. (2010) Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments. BMC Bioinformatics 11:146

Showing the most recent 10 out of 29 publications