Eukaryotic cells have diverse cellular components, including subcellular organelles and sub-organelle compartments. The accurate targeting of proteins to these cellular components is crucial in establishing and maintaining cellular organizations and functions. Mis-localization of proteins is often associated with metabolic disorders and diseases. However, the vast majority of proteins lack subcellular/sub-organelle localization annotation. Compared with experimental methods, computational prediction of protein localization provides an efficient and effective way for proteome annotation and experimental design. The current prediction tools for protein localization have significant room for improvement. In addition, no tool can predict localization at the sub- organelle resolution or internal localization signals. Deep learning, as the cutting-edge technology in machine learning, presents a new opportunity for this classical bioinformatics problem. The availability of recent high- throughput localization data can also train deep learning well. The PI?s lab has demonstrated some success on a special case, i.e., predicting mitochondrial localizations for plants using deep learning. In this project, the PI proposes to develop new methods and a standalone toolkit for accurate and scalable protein localization prediction at the subcellular and sub-organelle levels, as well as for characterization of localization motifs (including novel internal motifs). The general approach is to design a semi-supervised deep- learning method that utilizes both annotated protein sequences with known localization and unannotated protein sequences as training data. Through the realization of an unsupervised deep-learning approach, a general representation of protein sequences will be implemented, characterizing both local and global features of protein sequences. By visualizing and characterizing the deep-learning models, novel, interpretable protein sequence patterns will be predicted as putative targeting peptides and compared with known localization signals. We will also use the methods to be developed and the unsupervised models to be trained on all protein sequences as a general framework for other sequence-based prediction problems that predict the label of a protein and the key residues contributing to the label. We will make the platform highly customizable and apply it to three applications, including ubiquitination protein prediction, enzyme EC number prediction, and protein family/subfamily classification. The innovative contributions to protein sequence-based analyses and predictions include: (1) using raw amino acid sequences as training inputs without feature engineering; (2) utilizing the huge amount of unannotated data in an unsupervised deep learning to characterize a general protein feature representation; (3) identifying potential targeting signals (especially internal motifs) by decoding the trained deep- learning models, augmented with sophisticated attention mechanisms; 4) detecting multiple-organelle targeting and sub-organelle localizations by a novel hierarchical multi-label architecture; and (5) combining features from different data sources by a multiplicative fused CNN model.
A protein typically has a well-defined localization in a cell to perform its function, and mis-localization of proteins is often associated with metabolic disorders and diseases. Protein localization prediction can provide valuable information for understanding disease mechanisms and designing treatment. To address the limitations of current computational methods, this project will apply cutting-edge deep-learning methods to deliver new computational methods and tools with improved accuracy and detailed sub-organelle prediction for protein localization.