This subproject is one of many research subprojects utilizing the resources provided by a Center grant funded by NIH/NCRR. The subproject and investigator (PI) may have received primary funding from another NIH source, and thus could be represented in other CRISP entries. The institution listed is for the Center, which is not necessarily the institution for the investigator. In the past year we have developed an algorithm that identifies the residues in biological macromolecules that confer the necessary specificity of interaction on the members of a paralogous family of molecules that carry out the same function with or upon different and distinct partners or substrates. For example, each paralogous tRNA interacts selectively with only one of twenty different aminoacyl tRNA synthetases. Each paralogous serine protease cleaves peptide bonds involving only particular amino acids, and each paralogous heterotrimeric G binding protein binds only a specific receptor and activates only particular kinases or other members of specific signaling pathways. The algorithm we have developed identifies the sequence features that confer this specificity on individual molecules. In applying this analysis, we have discovered that the analysis not only identifies sequence elements that confer the desired unique properties but also identifies ensembles of co-evolving sequence elements which we describe below. We believe these coevolving ensembles to be an important part of the mechanism by which biological macromolecules fine tune their specificity of action. The identification of sequence elements that confer specificity of action on biological macromolecules is achieved by dividing sequence residues into three categories, the second of which is the focus of our research: 1. Highly conserved sequence residues essential to the structure and activity of the entire homologous family of macromolecules. 2. Highly circumscribed sequence residues that maintain the specificity of the activity within the paralogous subfamilies. 3. Sequence residues that may vary freely. We assign residues to these three categories based on the amounts of two different kinds of entropy associated with each sequence residue in a multiple sequence alignment. The first is the family relative entropy, the entropy calculated at a particular position in the alignment over all of the sequences in the alignment (all the sequences in the family). The family relative entropy achieves its highest value when all of the sequences in an alignment have the same kind of residue at that position in the alignment and that kind of residue is rare compared to other possible residues. The family relative entropy is computed as: where pi is the fraction of residue type i in a particular position of the alignment and qi is the fraction of residue type i expected in random sequence. qi is usually taken as the fractions of residue types in an appropriate sequence database. The second kind of entropy considered is the group cross entropy. The group cross entropy achieves its highest value when only a single kind of residue is found within the group and a different single kind of residue is found in the rest of the sequences in the alignment. It is computed as: sum(i) {(qi-pi)*log2(pi/qi)} where pi is the fraction of residue type i in a particular position of the alignment for sequences in the predefined group and qi is the fraction of residue type i in a particular position of the alignment for sequences not in the predefined group. This form of the cross entropy is symmetric and hence usable as a distance measure in various clustering procedures. Category 1 residues are those that have a high family relative entropy and a low group cross entropy for all groups. Category 2 residues are those that have a low family relative entropy and a high group cross entropy for at least one of the predefined groups. Category 3 residues are those where both the family relative entropy and the group cross entropy are low. We generally define high entropy score to be a normalized Z score of 3 or greater, although for some analyses a value as low as 2 can be useful. (The normalized Z score is the raw score minus the average score and this difference divided by the standard deviation of the scores.) Note that the underlying entropy values are not normally distributed and thus the Z scores should not be used for inferring statistical significance. The analysis is not specific to either protein or nucleic acid sequences. This allowed me to check that the methods would work on a biologically important system where the answers were already known from extensive experimental work as well as previous analysis. I applied the new analysis to the same system of 67 tRNAs that I had analyzed earlier with the initial, simple counting models (McClain and Nicholas, 1987). The experimental work confirming the correctness of this earlier analysis is reviewed in McClain (1995). The new, information-based methods provided the same answers with a substantially improved signal to noise ratio (Nicholas, 1999). I presented this as an invited talk at the 'Emerging Sources of RNA Information' workshop in December of 1998. The improved signal to noise ratio will make it easier select which answers to test by laboratory experiment.
Showing the most recent 10 out of 292 publications