This subproject is one of many research subprojects utilizing the resources provided by a Center grant funded by NIH/NCRR. The subproject and investigator (PI) may have received primary funding from another NIH source, and thus could be represented in other CRISP entries. The institution listed is for the Center, which is not necessarily the institution for the investigator. The structure of a protein is often a key to its function. However, significant time and cost is required to determine the structure of a protein by experimental methods, such as the X-ray crystallography or the Nuclear Magnetic Resonance. There are currently less than 50,000 protein structures deposited in the Protein Data Bank (PDB), of which about 80% are redundant. On the other hand, the genomic sequencing efforts, such as the Human Genome Project, have populated protein sequence databases with well over 5 million sequences. With the increasing gap between known sequences and experimentally determined structures, the computational methods capable of predicting the structure and function of proteins will play an increasing role in protein annotation studies. The ultimate goal of the research described in this proposal is to develop a new protein sequence homology detection method that leverages the growing body of protein sequence data in ways that existing methods do not. The increased sensitivity in recognizing relationships between amino acid sequences will be achieved through the applications of intermediate sequence search strategies and profile-profile techniques. To date, the progress in this area has been limited by the lack of the computational resources needed to perform the transitive profile-profile search. We propose to utilize the TeraGrid to develop and test the first intermediate profile-profile algorithm for detecting protein sequence similarities. The algorithm constructs a sequential profile for the input amino acid sequence (target) and uses it to transitively search the database of all representative profiles for sequences in nr. In the transitive search, the matches found after running the first sequence comparison are used as new queries against the database. The whole process is repeated, iteratively with these new matches. The similarity between the target profile and the profile from the database is established through the intermediate sequences. Our project will be carried out in two stages: 1. In the first stage we will generate the set of representative alignment profiles for sequences from the non-redundant protein sequence database nr. 2. In the second phase we will deploy and test our algorithm.
Showing the most recent 10 out of 292 publications