The Human Genome project and related genome projects have stirred great hopes for improving our understanding and treatment of diseases. Central to this process is the automated detection of functional motifs and classification of protein sequences into families and/or subfamilies. Conventional approaches for protein sequence classification usually employ sequence alignment methods; other methods depend on the choice of the features included in the training sets, and on accuracy and availability of data. We propose an alignment-independent classification approach based on a search engine technology that had been successfully used in classifying medical records. Each protein is represented by a multidimensional vector, the elements of which refer to the protein's most discriminative eta-grams (sequences of eta amino acids). Preliminary studies on G protein coupled receptors (GPCRs) showed that a simple Naive Bayes classifier using straightforward eta-gram feature selection in its preprocessing, can outperform existing classifiers including support vector machines on previously investigated, standardized GPCR sequence data subsets. Jackknife tests applied to the Protein Information Resource (PIR) Protein Sequence Database PSD and to the Pfam database (DB) of protein families showed that approximately 70% of the protein sequences are classified correctly. More significantly, the most discriminative eta-grams in a given protein family appear to have a functional or structural role, as suggested by their comparison with the sequence motifs known to be conserved or active in existing DBs and by the examination of the three-dimensional structure of representative members of the family. Encouraged by these results, we propose to pursue the following specific aims: (1) develop a new computational tool for protein sequence analysis and protein classification based on eta-gram distributions, (2) build a comprehensive DB of protein families based on eta-gram distributions and investigate the relationships between this DB and the leading protein classification DBs, (3) determine the functional significance of the top-ranking n-grams, and (4) develop a Java based toolkit that will provide easy-to-use, yet flexible, web interface to researchers from various background. The expected deliverables are the methodology and software for classification without alignment (CWA); a new database of classified proteins, based on CWA; and an on-line server and GUI that will deliver the database and data mining tools to the scientific community in a user-friendly environment.
Showing the most recent 10 out of 77 publications