Deep learning for protein subcellular/sub-organelle localizations and localization motifs

Xu, Dong

Abstract

Eukaryotic cells have diverse cellular components, including subcellular organelles and sub-organelle compartments. The accurate targeting of proteins to these cellular components is crucial in establishing and maintaining cellular organizations and functions. Mis-localization of proteins is often associated with metabolic disorders and diseases. However, the vast majority of proteins lack subcellular/sub-organelle localization annotation. Compared with experimental methods, computational prediction of protein localization provides an efficient and effective way for proteome annotation and experimental design. The current prediction tools for protein localization have significant room for improvement. In addition, no tool can predict localization at the sub- organelle resolution or internal localization signals. Deep learning, as the cutting-edge technology in machine learning, presents a new opportunity for this classical bioinformatics problem. The availability of recent high- throughput localization data can also train deep learning well. The PI?s lab has demonstrated some success on a special case, i.e., predicting mitochondrial localizations for plants using deep learning. In this project, the PI proposes to develop new methods and a standalone toolkit for accurate and scalable protein localization prediction at the subcellular and sub-organelle levels, as well as for characterization of localization motifs (including novel internal motifs). The general approach is to design a semi-supervised deep- learning method that utilizes both annotated protein sequences with known localization and unannotated protein sequences as training data. Through the realization of an unsupervised deep-learning approach, a general representation of protein sequences will be implemented, characterizing both local and global features of protein sequences. By visualizing and characterizing the deep-learning models, novel, interpretable protein sequence patterns will be predicted as putative targeting peptides and compared with known localization signals. We will also use the methods to be developed and the unsupervised models to be trained on all protein sequences as a general framework for other sequence-based prediction problems that predict the label of a protein and the key residues contributing to the label. We will make the platform highly customizable and apply it to three applications, including ubiquitination protein prediction, enzyme EC number prediction, and protein family/subfamily classification. The innovative contributions to protein sequence-based analyses and predictions include: (1) using raw amino acid sequences as training inputs without feature engineering; (2) utilizing the huge amount of unannotated data in an unsupervised deep learning to characterize a general protein feature representation; (3) identifying potential targeting signals (especially internal motifs) by decoding the trained deep- learning models, augmented with sophisticated attention mechanisms; 4) detecting multiple-organelle targeting and sub-organelle localizations by a novel hierarchical multi-label architecture; and (5) combining features from different data sources by a multiplicative fused CNN model.

Public Health Relevance

A protein typically has a well-defined localization in a cell to perform its function, and mis-localization of proteins is often associated with metabolic disorders and diseases. Protein localization prediction can provide valuable information for understanding disease mechanisms and designing treatment. To address the limitations of current computational methods, this project will apply cutting-edge deep-learning methods to deliver new computational methods and tools with improved accuracy and detailed sub-organelle prediction for protein localization.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Exploratory/Developmental Grants (R21)
Project #: 1R21LM012790-01A1
Application #: 9600437
Study Section: Biomedical Library and Informatics Review Committee (BLR)
Program Officer: Ye, Jane

Project Start: 2018-09-01
Project End: 2020-08-31
Budget Start: 2018-09-01
Budget End: 2019-08-31
Support Year: 1
Fiscal Year: 2018
Total Cost
Indirect Cost

Institution

Name: University of Missouri-Columbia
Department: Biostatistics & Other Math Sci
Type: Biomed Engr/Col Engr/Engr Sta
DUNS #: 153890272

City: Columbia
State: MO
Country: United States
Zip Code: 65211

Related projects


NIH 2019 R21 LM	Deep learning for protein subcellular/sub-organelle localizations and localization motifs Xu, Dong / University of Missouri-Columbia
NIH 2018 R21 LM	Deep learning for protein subcellular/sub-organelle localizations and localization motifs Xu, Dong / University of Missouri-Columbia

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: