Identification of protein families and the sequence motifs typical of them is increasingly important as sequence information from the Human Genome Project begins to be available. Our work in this area centers on the profile analysis technique, a method for describing sequence patterns as weight matrices and matching to them using dynamic programing techniques. We have developed a new technique for calculating sequence based profiles that we call """"""""evolutionary pro-files."""""""" Briefly, this method fits an explicit evolutionary model to each aligned position in a group of aligned sequences that describes the best evolutionary distance for each of the twenty amino acid residues. A finite mixture model is then calculated in which each of the twenty possible ancestral residues is weighted by its probability of giving rise to the observed distribution at the given evolutionary distance. Preliminary testing has shown this method to be superior to the earlier """"""""average profile"""""""" method as determined by cross-validation using the receiver-operating characteristic as a metric. Work has continued on extension of the PCOMPLIB code on the Intel Paragon. The current version supports both sequence and profile comparisons with user selectable scoring systems and gap penalties. Further work on the length normalization function has resulted in an improved system for calculating the significance of comparisons based on a model of the comparison as a gaussian extreme value process. We have also been able to increase the performance of the code by a factor of three using a hardware specific optimization developed by SDSC Senior Fellow, Dr. Larry Carter. In the next year we will continue to work in these two main areas. We will complete the testing and validation of the evolutionary profile methods and make this code, and more importantly, a web-based server available to the community. We will begin to work on extensions of this work to homology-based modeling of proteins, in which it can be used to combine sequence and structural information, and we will investigate the usefulness of applying """"""""evolutionary scaling"""""""" to predict the pattern of conserved residues in a protein over greater evolutionary distances. We also continue to work with the PCOMPLIB code, making public versions of this code available by summer 1996. We also expect to have an MPI compliant version of this code running on the CRAY T3D within the next several months. In new work, we will begin to investigate the possibility of integrating the sequence motifs learned by the MEME program (see below), with the profile comparison code. Gribskov, M, and Veretnik, S. """"""""Identification of Sequence Patterns with Profile Analysis"""""""", Methods in Enzymology, In Press.
Showing the most recent 10 out of 270 publications