Precise knowledge of the subcellular localization of proteins is very important in systems biology research because most cellular processes are spatially constrained in the cell. This spatial context is essential to gain a better understanding of the various roles of proteins involved in the intra-cellular cross-talk and cell signaling associated with disease pathways that span across subcellular boundaries. Experimentally-determined localizations are available only for about 1% of the proteins in the UniProt database. Computational methods can complement experimental efforts in determining the localization of many proteins with unknown localization. Existing computational methods have limited scope and applicability, and hence are not suitable for proteome-wide prediction of localizations. Moreover, the reliability of these predictions is questionable due to lack of any experimental validation. In this project, we propose the development of a comprehensive system that will enable us to create accurate and comprehensive catalogs of subcellular and suborganellar proteomes of all sequenced genomes of animal species. This system is based on our recently published computational method known as ngLOC, that uses 'n-gram'peptides (fixed-length subsequences of proteins) to build accurate Bayesian models for classification of subcellular and suborganellar classes. Additionally, ngLOC is well suited for proteome-wide predictions and to predict proteins localized to multiple organelles. Based on the ngLOC approach, we propose to develop a new method by using advanced computational concepts such as semi-supervised learning, hierarchical Bayesian classification and ensemble approaches, and by implementing substitutions matrices to compare n-gram homology. All of these methods have proven success in other domains and hence are expected to substantially improve the accuracy of our method. A set of 400 human proteins whose localizations are predicted by our new method will be experimentally tested in normal and cancer cell lines of human, using GFP-fusion and expression followed by visualization under confocal microscope. This step would allow us to determine the prediction accuracy of our method at each score threshold for each organelle. Using optimal score thresholds, proteome-wide predictions will be carried out and detailed catalogs of experimentally-known and predicted subcellular and suborganellar proteomes will be generated for all sequenced genomes of animal species. Additionally, a standalone software package for the improved method will be developed and released to the research community under the General Public License (GPL). An online web server will be developed to make predictions online, and to enable access to the cataloged data and to the software produced in this project. In summary, the proposed comprehensive system will deliver a 'gold-standard'dataset of experimentally established localizations, a novel methodology for prediction, experimental validation of predicted localizations, and a public web server to predict or to access datasets and the software tool developed in this project. These resources will prove to be very valuable to the biomedical research community in advancing the many facets of systems biology research.

Public Health Relevance

Proteins are synthesized in the cytoplasm of a cell, but are destined to localize into specific subcellular compartment(s) to carry out their intended functions. A number of human diseases are caused by mislocalization of proteins to unintended subcellular locations resulting in functional interference with a vital cellular process. The current project proposes a comprehensive system that uses computational and experimental approaches to accurately determine the subcellular localization of proteins and to generate detailed catalogs of subcellular proteomes for all sequenced genomes of animal species. The outcomes of this project will help advance our understanding of protein localization and function and consequently, our understanding of the causative factors for many human diseases.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM086533-05
Application #
8331447
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Hagan, Ann A
Project Start
2009-09-01
Project End
2014-08-31
Budget Start
2012-09-01
Budget End
2013-08-31
Support Year
5
Fiscal Year
2012
Total Cost
$218,318
Indirect Cost
$71,303
Name
University of Nebraska Medical Center
Department
Genetics
Type
Schools of Medicine
DUNS #
168559177
City
Omaha
State
NE
Country
United States
Zip Code
68198
Shen, Ru; Guda, Chittibabu (2014) Applied graph-mining algorithms to study biomolecular interaction networks. Biomed Res Int 2014:439476
Srinivasan, Satish M; Vural, Suleyman; King, Brian R et al. (2013) Mining for class-specific motifs in protein sequence classification. BMC Bioinformatics 14:96
Mohammed, Akram; Guda, Chittibabu (2011) Computational Approaches for Automated Classification of Enzyme Sequences. J Proteomics Bioinform 4:147-152