This subproject is one of many research subprojects utilizing theresources provided by a Center grant funded by NIH/NCRR. The subproject andinvestigator (PI) may have received primary funding from another NIH source,and thus could be represented in other CRISP entries. The institution listed isfor the Center, which is not necessarily the institution for the investigator.Seven focus areas in the realm of protein structure have been identified for application of the language analogy approach. These focus areas are: protein folding, conformational changes, protein-protein interactions, protein/gene networks and pathways, secondary structure and repetitive folds prediction and segmentation, protein family classification, and genome comparison. The ultimate goal is to develop linguistic models for each that are capable of advancing the understanding of these areas. The protocol followed in this process consists of several steps. The first step is to utilize existing 'benchmark' datasets or to define datasets suitable for training and testing of these models. As controls, existing approaches in the focus areas, if available, are studied and a scheme is designed for evaluating the language model approaches and comparing them to existing other approaches. The next step is to implement our language approach. This implementation initially needs to meet one or both of two requirements: (i) the system has to perform equally well or better than existing systems as defined in step 2 and/or (ii) it needs to provide interpretable biological hypotheses. For example, a neural network might be the algorithm with best performance in a classification task, but the underlying features resulting in this performance can be unclear. A language-based approach that might have lesser performance but allows the researcher to analyze the types of features that result in successful classification can be used to build hypotheses on the fundamental building blocks of protein sequence language. The final step in the protocol is to design and carry out experiments that specifically test these hypotheses. The following systems have been chosen as experimental test cases for the language models: G protein coupled receptors (GPCR) such as rhodopsin, metabotropic glutamate receptors, epidermal growth factor receptor, viral tailspike protein, virus infection process, peptide n-grams. For each of the seven focus areas, we are working to identify or develop benchmark datasets for training and testing of linguistic models. Students and postdoctoral fellows participate in all aspects of the projects.
Showing the most recent 10 out of 253 publications