This subproject is one of many research subprojects utilizing theresources provided by a Center grant funded by NIH/NCRR. The subproject andinvestigator (PI) may have received primary funding from another NIH source,and thus could be represented in other CRISP entries. The institution listed isfor the Center, which is not necessarily the institution for the investigator.In the past year we have developed an algorithm that identifies the residues inbiological macromolecules that confer the necessary specificity of interactionon the members of a paralogous family of molecules that carry out the samefunction with or upon different and distinct partners or substrates. Forexample, each paralogous tRNA interacts selectively with only one of twentydifferent aminoacyl tRNA synthetases. Each paralogous serine protease cleavespeptide bonds involving only particular amino acids, and each paralogousheterotrimeric G binding protein binds only a specific receptor and activatesonly particular kinases or other members of specific signaling pathways. Thealgorithm we have developed identifies the sequence features that confer thisspecificity on individual molecules. In applying this analysis, we havediscovered that the analysis not only identifies sequence elements that conferthe desired unique properties but also identifies ensembles of co-evolvingsequence elements which we describe below. We believe these coevolvingensembles to be an important part of the mechanism by which biologicalmacromolecules fine tune their specificity of action.The identification of sequence elements that confer specificity of action onbiological macromolecules is achieved by dividing sequence residues into threecategories, the second of which is the focus of our research: 1. Highly conserved sequence residues essential to the structure andactivity of the entire homologous family of macromolecules. 2. Highly circumscribed sequence residues that maintain the specificity ofthe activity within the paralogous subfamilies. 3. Sequence residues that may vary freely.We assign residues to these three categories based on the amounts of twodifferent kinds of entropy associated with each sequence residue in a multiplesequence alignment. The first is the family relative entropy, the entropycalculated at a particular position in the alignment over all of the sequencesin the alignment (all the sequences in the family). The family relative entropyachieves its highest value when all of the sequences in an alignment have thesame kind of residue at that position in the alignment and that kind of residueis rare compared to other possible residues. The family relative entropy iscomputed as: where pi is the fraction of residue type i in a particular position of thealignment and qi is the fraction of residue type i expected in random sequence. qi is usually taken as the fractions of residue types in an appropriatesequence database.The second kind of entropy considered is the group cross entropy. The groupcross entropy achieves its highest value when only a single kind of residue isfound within the group and a different single kind of residue is found in therest of the sequences in the alignment. It is computed as: sum(i) {(qi-pi)*log2(pi/qi)} where pi is the fraction of residue type i in a particular position of thealignment for sequences in the predefined group and qi is the fraction ofresidue type i in a particular position of the alignment for sequences not inthe predefined group. This form of the cross entropy is symmetric and henceusable as a distance measure in various clustering procedures.Category 1 residues are those that have a high family relative entropy and alow group cross entropy for all groups. Category 2 residues are those thathave a low family relative entropy and a high group cross entropy for at leastone of the predefined groups. Category 3 residues are those where both thefamily relative entropy and the group cross entropy are low. We generallydefine high entropy score to be a normalized Z score of 3 or greater, althoughfor some analyses a value as low as 2 can be useful. (The normalized Z scoreis the raw score minus the average score and this difference divided by thestandard deviation of the scores.) Note that the underlying entropy values arenot normally distributed and thus the Z scores should not be used for inferringstatistical significance.The analysis is not specific to either protein or nucleic acid sequences. Thisallowed me to check that the methods would work on a biologically importantsystem where the answers were already known from extensive experimental work aswell as previous analysis. I applied the new analysis to the same system of 67tRNAs that I had analyzed earlier with the initial, simple counting models(McClain and Nicholas, 1987). The experimental work confirming the correctnessof this earlier analysis is reviewed in McClain (1995). The new,information-based methods provided the same answers with a substantiallyimproved signal to noise ratio (Nicholas, 1999). I presented this as aninvited talk at the 'Emerging Sources of RNA Information' workshop in Decemberof 1998. The improved signal to noise ratio will make it easier select whichanswers to test by laboratory experiment.
Showing the most recent 10 out of 292 publications