Precise knowledge of the subcellular localization of proteins is very important in systems biology research because most cellular processes are spatially constrained in the cell. This spatial context is essential to gain a better understanding of the various roles of proteins involved in the intra-cellular cross-talk and cell signaling associated with disease pathways that span across subcellular boundaries. Experimentally-determined localizations are available only for about 1% of the proteins in the UniProt database. Computational methods can complement experimental efforts in determining the localization of many proteins with unknown localization. Existing computational methods have limited scope and applicability, and hence are not suitable for proteome-wide prediction of localizations. Moreover, the reliability of these predictions is questionable due to lack of any experimental validation. In this project, we propose the development of a comprehensive system that will enable us to create accurate and comprehensive catalogs of subcellular and suborganellar proteomes of all sequenced genomes of animal species. This system is based on our recently published computational method known as ngLOC, that uses 'n-gram'peptides (fixed-length subsequences of proteins) to build accurate Bayesian models for classification of subcellular and suborganellar classes. Additionally, ngLOC is well suited for proteome-wide predictions and to predict proteins localized to multiple organelles. Based on the ngLOC approach, we propose to develop a new method by using advanced computational concepts such as semi-supervised learning, hierarchical Bayesian classification and ensemble approaches, and by implementing substitutions matrices to compare n-gram homology. All of these methods have proven success in other domains and hence are expected to substantially improve the accuracy of our method. A set of 400 human proteins whose localizations are predicted by our new method will be experimentally tested in normal and cancer cell lines of human, using GFP-fusion and expression followed by visualization under confocal microscope. This step would allow us to determine the prediction accuracy of our method at each score threshold for each organelle. Using optimal score thresholds, proteome-wide predictions will be carried out and detailed catalogs of experimentally-known and predicted subcellular and suborganellar proteomes will be generated for all sequenced genomes of animal species. Additionally, a standalone software package for the improved method will be developed and released to the research community under the General Public License (GPL). An online web server will be developed to make predictions online, and to enable access to the cataloged data and to the software produced in this project. In summary, the proposed comprehensive system will deliver a 'gold-standard'dataset of experimentally established localizations, a novel methodology for prediction, experimental validation of predicted localizations, and a public web server to predict or to access datasets and the software tool developed in this project. These resources will prove to be very valuable to the biomedical research community in advancing the many facets of systems biology research.

Public Health Relevance

Proteins are synthesized in the cytoplasm of a cell, but are destined to localize into specific subcellular compartment(s) to carry out their intended functions. A number of human diseases are caused by mislocalization of proteins to unintended subcellular locations resulting in functional interference with a vital cellular process. The current project proposes a comprehensive system that uses computational and experimental approaches to accurately determine the subcellular localization of proteins and to generate detailed catalogs of subcellular proteomes for all sequenced genomes of animal species. The outcomes of this project will help advance our understanding of protein localization and function and consequently, our understanding of the causative factors for many human diseases.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Hagan, Ann A
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Nebraska Medical Center
Schools of Medicine
United States
Zip Code
Shen, Ru; Guda, Chittibabu (2014) Applied graph-mining algorithms to study biomolecular interaction networks. Biomed Res Int 2014:439476
Srinivasan, Satish M; Vural, Suleyman; King, Brian R et al. (2013) Mining for class-specific motifs in protein sequence classification. BMC Bioinformatics 14:96
Mohammed, Akram; Guda, Chittibabu (2011) Computational Approaches for Automated Classification of Enzyme Sequences. J Proteomics Bioinform 4:147-152