Alignment-independent Classification of Proteins

Bahar, Ivet

Abstract

The Human Genome project and related genome projects have stirred great hopes for improving our understanding and treatment of diseases. Central to this process is the automated detection of functional motifs and classification of protein sequences into families and/or subfamilies. Conventional approaches for protein sequence classification usually employ sequence alignment methods; other methods depend on the choice of the features included in the training sets, and on accuracy and availability of data. We propose an alignment-independent classification approach based on a search engine technology that had been successfully used in classifying medical records. Each protein is represented by a multidimensional vector, the elements of which refer to the protein's most discriminative eta-grams (sequences of eta amino acids). Preliminary studies on G protein coupled receptors (GPCRs) showed that a simple Naive Bayes classifier using straightforward eta-gram feature selection in its preprocessing, can outperform existing classifiers including support vector machines on previously investigated, standardized GPCR sequence data subsets. Jackknife tests applied to the Protein Information Resource (PIR) Protein Sequence Database PSD and to the Pfam database (DB) of protein families showed that approximately 70% of the protein sequences are classified correctly. More significantly, the most discriminative eta-grams in a given protein family appear to have a functional or structural role, as suggested by their comparison with the sequence motifs known to be conserved or active in existing DBs and by the examination of the three-dimensional structure of representative members of the family. Encouraged by these results, we propose to pursue the following specific aims: (1) develop a new computational tool for protein sequence analysis and protein classification based on eta-gram distributions, (2) build a comprehensive DB of protein families based on eta-gram distributions and investigate the relationships between this DB and the leading protein classification DBs, (3) determine the functional significance of the top-ranking n-grams, and (4) develop a Java based toolkit that will provide easy-to-use, yet flexible, web interface to researchers from various background. The expected deliverables are the methodology and software for classification without alignment (CWA); a new database of classified proteins, based on CWA; and an on-line server and GUI that will deliver the database and data mining tools to the scientific community in a user-friendly environment.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Research Project (R01)
Project #: 5R01LM007994-03
Application #: 7050072
Study Section: Biomedical Library and Informatics Review Committee (BLR)
Program Officer: Ye, Jane

Project Start: 2004-05-01
Project End: 2008-04-30
Budget Start: 2006-05-01
Budget End: 2007-04-30
Support Year: 3
Fiscal Year: 2006
Total Cost: $272,070
Indirect Cost

Institution

Name: University of Pittsburgh
Department: Genetics
Type: Schools of Medicine
DUNS #: 004514360

City: Pittsburgh
State: PA
Country: United States
Zip Code: 15213

Related projects


NIH 2011 R01 LM	Bridging Sequence Patterns and Structural Dynamics Bahar, Ivet / University of Pittsburgh	$463,922
NIH 2010 R01 LM	Bridging Sequence Patterns and Structural Dynamics Bahar, Ivet / University of Pittsburgh	$473,694
NIH 2009 R01 LM	Bridging Sequence Patterns and Structural Dynamics Bahar, Ivet / University of Pittsburgh	$471,102
NIH 2008 R01 LM	Bridging Sequence Patterns and Structural Dynamics Bahar, Ivet / University of Pittsburgh	$457,754
NIH 2007 R01 LM	Alignment-independent Classification of Proteins Bahar, Ivet / University of Pittsburgh	$264,180
NIH 2006 R01 LM	Alignment-independent Classification of Proteins Bahar, Ivet / University of Pittsburgh	$272,070
NIH 2005 R01 LM	Alignment-independent Classification of Proteins Bahar, Ivet / University of Pittsburgh	$278,617
NIH 2004 R01 LM	Alignment-independent Classification of Proteins Bahar, Ivet / University of Pittsburgh	$278,711

Publications

Koes, David R; Vries, John K (2017) Evaluating amber force fields using computed NMR chemical shifts. Proteins 85:1944-1956

Koes, David R; Vries, John K (2017) Error assessment in molecular dynamics trajectories using computed NMR chemical shifts. Comput Theor Chem 1099:152-166

Dutta, Anindita; Krieger, James; Lee, Ji Young et al. (2015) Cooperative Dynamics of Intact AMPA and NMDA Glutamate Receptors: Similarities and Subfamily-Specific Differences. Structure 23:1692-1704

Dutta, Arpana; Altenbach, Christian; Mangahas, Sheryll et al. (2014) Differential dynamics of extracellular and cytoplasmic domains in denatured States of rhodopsin. Biochemistry 53:7160-9

Huang, Grace T; Cunningham, Kathryn I; Benos, Panayiotis V et al. (2013) Spectral clustering strategies for heterogeneous disease expression data. Pac Symp Biocomput :212-23

Coronnello, Claudia; Benos, Panayiotis V (2013) ComiR: Combinatorial microRNA target prediction tool. Nucleic Acids Res 41:W159-64

Zomot, Elia; Bahar, Ivet (2013) Intracellular gating in an inward-facing state of aspartate transporter Glt(Ph) is regulated by the movements of the helical hairpin HP2. J Biol Chem 288:8231-7

Schlattner, Uwe; Tokarska-Schlattner, Malgorzata; Ramirez, Sacnicte et al. (2013) Dual function of mitochondrial Nm23-H4 protein in phosphotransfer and intermembrane lipid transfer: a cardiolipin-dependent switch. J Biol Chem 288:111-21

Kshirsagar, Meghana; Carbonell, Jaime; Klein-Seetharaman, Judith (2013) Multitask learning for host-pathogen protein interactions. Bioinformatics 29:i217-26

Jain, Shilpa; Kapetanaki, Maria G; Raghavachari, Nalini et al. (2013) Expression of regulatory platelet microRNAs in patients with sickle cell disease. PLoS One 8:e60932

Showing the most recent 10 out of 77 publications

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: