The development of rapid methods for molecular cloning. DNA sequencing, and protein and DNA sequence comparison have revolutionized the practice of molecular biology. Newly determined sequences are routinely compared against large sequence databases, and increasingly, inferences about structure are based on sequence similarity. During the last grant period, we improved the sensitivity of the FASTA algorithm and implemented a general platform for protein and DNA sequence comparison on Intel hypercube parallel computers. With improvements in comparison algorithms and computer hardware, time, or computational expense, is no longer a significant factor in protein sequence comparison. As a result, we propose to shift our emphasis from improving the speed of protein sequence comparison to improving the quality of the comparison, by examining approaches to improve the sensitivity, selectivity, or amount of information that can be inferred from a sequence similarity score. To improve the quality of sequence comparison, we will consider corrections for pair-wise similarity scores that may provide greater selectivity. These corrections will be based on empirical measurements on the distribution of protein similarity scores obtained from large-scale inter- library comparisons using the hypercube computer. In addition, we will develop a new method for classifying members of protein sequence superfamilies, the """"""""club"""""""" algorithm. We will also examine the use of the hypercube parallel computer for simultaneously constructing multiple alignments and evolutionary trees using an algorithm developed by Sankoff (1973). A second multiple alignment approach will also be developed further to provide a general platform for heuristic alignment that can use a variety of functions for measuring the quality of an alignment. As sequence comparison becomes more routine and sequence databases grow, more investigators are tempted to infer structural similarity from sequence similarity. The basis for such an inference is very weak. We propose to examine the hypothesis that some local protein sequence similarities are due to common tertiary structure rather than common ancestry by comparing the sequences in the protein crystal-structure database, and examining sequence alignments with high similarity scores in the absence known homology. These sequences will then be compared at the structural level, to determine whether structural similarity can be detected from sequence similarity in the absence of common ancestry. We also plan to examine methods for aligning and finding local similarities in very long DNA sequences (>200,000 nt). Some of the methods used in the FASTA and LFASTA programs can be applied to this problem, but more sophisticated management of similar regions is required than is currently provided. In DNA sequence comparison ( in contrast to protein sequence comparison), speed is still of paramount importance, and the LFASTA approach may be able to speed-up comparisons by several orders of magnitude.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
5R01LM004969-05
Application #
3374092
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Project Start
1988-08-01
Project End
1995-07-31
Budget Start
1992-08-01
Budget End
1993-07-31
Support Year
5
Fiscal Year
1992
Total Cost
Indirect Cost
Name
University of Virginia
Department
Type
Schools of Medicine
DUNS #
001910777
City
Charlottesville
State
VA
Country
United States
Zip Code
22904
Pearson, William R; Mackey, Aaron J (2017) Using SQL Databases for Sequence Similarity Searching and Analysis. Curr Protoc Bioinformatics 59:9.4.1-9.4.22
Pearson, William R (2016) Finding Protein and Nucleotide Similarities with FASTA. Curr Protoc Bioinformatics 53:3.9.1-25
Triant, Deborah A; Pearson, William R (2015) Most partial domains in proteins are alignment and annotation artifacts. Genome Biol 16:99
Pearson, William R (2013) An introduction to sequence similarity (""homology"") searching. Curr Protoc Bioinformatics Chapter 3:Unit3.1
Pearson, William R (2013) Selecting the Right Similarity-Scoring Matrix. Curr Protoc Bioinformatics 43:3.5.1-9
Mills, Lauren J; Pearson, William R (2013) Adjusting scoring matrices to correct overextended alignments. Bioinformatics 29:3007-13
Li, Weizhong; McWilliam, Hamish; Goujon, Mickael et al. (2012) PSI-Search: iterative HOE-reduced profile SSEARCH searching. Bioinformatics 28:1650-1
Holliday, Gemma L; Andreini, Claudia; Fischer, Julia D et al. (2012) MACiE: exploring the diversity of biochemical reactions. Nucleic Acids Res 40:D783-9
Gonzalez, Mileidy W; Pearson, William R (2010) RefProtDom: a protein database with improved domain boundaries and homology relationships. Bioinformatics 26:2361-2
Sierk, Michael L; Smoot, Michael E; Bass, Ellen J et al. (2010) Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments. BMC Bioinformatics 11:146

Showing the most recent 10 out of 29 publications