The long-term goal of our research is the development of better methods for identifying distantly related protein and DNA sequences, and to exploit our ability to detect distant homologies to explore the duplication, fusion, and other processes responsible for increases in protein diversity. Although similarity searching is now a routine first step in the characterization of newly determined sequences, we believe that additional improvements in similarity searching methods will allow investigators to look back more deeply in evolutionary time. Moreover, as complete protein sequence sets become available for more organisms, sequence information can be exploited more effectively for functional genomics and traditional biochemical problems. The availability of complete genome sequences, combined with reliable and sensitive sequence comparison algorithms, also allows us to test hypotheses about the possible emergence of novel proteins over the past 200-1,200 million years. Over the next five years, our specific aims are: (1) To extend the average look-back time provided by protein sequence similarity searching. We propose improvements to the scoring methods and statistics analysis of similarity scores that seek to push back the protein similarity-search horizon from 1.5-2-fold, to more than 2,000 million years for most protein families. (2) To develop a higher performance, more flexible and user-friendly FASTA package. (3) To study repeated domains in proteins. We will develop more quantitative methods for identifying both simple sequence and long-period repeats in proteins. We will characterize the fraction of repeat-containing proteins in proteomes, characterize the fraction of domain-structured proteins that are not internally repetitive, and ask whether these proteins duplicate or diverge with patterns that differ from """"""""normal"""""""" single domain proteins. (4) To explore genome-scale protein evolution and to identify potential """"""""novel"""""""" or """"""""young"""""""" protein families or domains. Over the next 2-4 years, more than six genomes that have diverged in the last 400 million years - an evolutionary distance sufficiently short that we should be able to identify all protein homologs - will become available. We will compare complete genomes searching for newly emergent sequences. (5) We will develop and characterize unified methods for the simultaneous construction of alignments and phylogenies over multiple sequences. We will also develop standalone tree-based alignment heuristics capable of rapidly aligning large numbers of sequences.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
2R01LM004969-12
Application #
6130119
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Ye, Jane
Project Start
1988-08-01
Project End
2005-04-30
Budget Start
2000-05-01
Budget End
2001-04-30
Support Year
12
Fiscal Year
2000
Total Cost
$348,838
Indirect Cost
Name
University of Virginia
Department
Biochemistry
Type
Schools of Medicine
DUNS #
001910777
City
Charlottesville
State
VA
Country
United States
Zip Code
22904
Pearson, William R; Mackey, Aaron J (2017) Using SQL Databases for Sequence Similarity Searching and Analysis. Curr Protoc Bioinformatics 59:9.4.1-9.4.22
Pearson, William R (2016) Finding Protein and Nucleotide Similarities with FASTA. Curr Protoc Bioinformatics 53:3.9.1-25
Triant, Deborah A; Pearson, William R (2015) Most partial domains in proteins are alignment and annotation artifacts. Genome Biol 16:99
Pearson, William R (2013) An introduction to sequence similarity (""homology"") searching. Curr Protoc Bioinformatics Chapter 3:Unit3.1
Pearson, William R (2013) Selecting the Right Similarity-Scoring Matrix. Curr Protoc Bioinformatics 43:3.5.1-9
Mills, Lauren J; Pearson, William R (2013) Adjusting scoring matrices to correct overextended alignments. Bioinformatics 29:3007-13
Li, Weizhong; McWilliam, Hamish; Goujon, Mickael et al. (2012) PSI-Search: iterative HOE-reduced profile SSEARCH searching. Bioinformatics 28:1650-1
Holliday, Gemma L; Andreini, Claudia; Fischer, Julia D et al. (2012) MACiE: exploring the diversity of biochemical reactions. Nucleic Acids Res 40:D783-9
Gonzalez, Mileidy W; Pearson, William R (2010) RefProtDom: a protein database with improved domain boundaries and homology relationships. Bioinformatics 26:2361-2
Sierk, Michael L; Smoot, Michael E; Bass, Ellen J et al. (2010) Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments. BMC Bioinformatics 11:146

Showing the most recent 10 out of 29 publications