The Human Genome project and related genome projects have stirred great hopes for improving our understanding and treatment of diseases. Central to this process is the automated detection of functional motifs and classification of protein sequences into families and/or subfamilies. Conventional approaches for protein sequence classification usually employ sequence alignment methods; other methods depend on the choice of the features included in the training sets, and on accuracy and availability of data. We propose an alignment-independent classification approach based on a search engine technology that had been successfully used in classifying medical records. Each protein is represented by a multidimensional vector, the elements of which refer to the protein's most discriminative eta-grams (sequences of eta amino acids). Preliminary studies on G protein coupled receptors (GPCRs) showed that a simple Naive Bayes classifier using straightforward eta-gram feature selection in its preprocessing, can outperform existing classifiers including support vector machines on previously investigated, standardized GPCR sequence data subsets. Jackknife tests applied to the Protein Information Resource (PIR) Protein Sequence Database PSD and to the Pfam database (DB) of protein families showed that approximately 70% of the protein sequences are classified correctly. More significantly, the most discriminative eta-grams in a given protein family appear to have a functional or structural role, as suggested by their comparison with the sequence motifs known to be conserved or active in existing DBs and by the examination of the three-dimensional structure of representative members of the family. Encouraged by these results, we propose to pursue the following specific aims: (1) develop a new computational tool for protein sequence analysis and protein classification based on eta-gram distributions, (2) build a comprehensive DB of protein families based on eta-gram distributions and investigate the relationships between this DB and the leading protein classification DBs, (3) determine the functional significance of the top-ranking n-grams, and (4) develop a Java based toolkit that will provide easy-to-use, yet flexible, web interface to researchers from various background. The expected deliverables are the methodology and software for classification without alignment (CWA); a new database of classified proteins, based on CWA; and an on-line server and GUI that will deliver the database and data mining tools to the scientific community in a user-friendly environment.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Ye, Jane
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pittsburgh
Schools of Medicine
United States
Zip Code
Koes, David R; Vries, John K (2017) Evaluating amber force fields using computed NMR chemical shifts. Proteins 85:1944-1956
Koes, David R; Vries, John K (2017) Error assessment in molecular dynamics trajectories using computed NMR chemical shifts. Comput Theor Chem 1099:152-166
Dutta, Anindita; Krieger, James; Lee, Ji Young et al. (2015) Cooperative Dynamics of Intact AMPA and NMDA Glutamate Receptors: Similarities and Subfamily-Specific Differences. Structure 23:1692-1704
Dutta, Arpana; Altenbach, Christian; Mangahas, Sheryll et al. (2014) Differential dynamics of extracellular and cytoplasmic domains in denatured States of rhodopsin. Biochemistry 53:7160-9
Huang, Grace T; Cunningham, Kathryn I; Benos, Panayiotis V et al. (2013) Spectral clustering strategies for heterogeneous disease expression data. Pac Symp Biocomput :212-23
Coronnello, Claudia; Benos, Panayiotis V (2013) ComiR: Combinatorial microRNA target prediction tool. Nucleic Acids Res 41:W159-64
Zomot, Elia; Bahar, Ivet (2013) Intracellular gating in an inward-facing state of aspartate transporter Glt(Ph) is regulated by the movements of the helical hairpin HP2. J Biol Chem 288:8231-7
Schlattner, Uwe; Tokarska-Schlattner, Malgorzata; Ramirez, Sacnicte et al. (2013) Dual function of mitochondrial Nm23-H4 protein in phosphotransfer and intermembrane lipid transfer: a cardiolipin-dependent switch. J Biol Chem 288:111-21
Kshirsagar, Meghana; Carbonell, Jaime; Klein-Seetharaman, Judith (2013) Multitask learning for host-pathogen protein interactions. Bioinformatics 29:i217-26
Jain, Shilpa; Kapetanaki, Maria G; Raghavachari, Nalini et al. (2013) Expression of regulatory platelet microRNAs in patients with sickle cell disease. PLoS One 8:e60932

Showing the most recent 10 out of 77 publications