Protein sequence is a language of a living system, which encodes protein structure and function critical for the survival of any organism. Therefore, understanding how protein sequence describes function and structure is a fundamental problem in biological research. Yet, the traditional interpretation of protein sequences is 3ased on either manual identification of sub-sequence patterns or some arbitrary dissection of a sequence into subsequences of fixed size; neither approach can accurately recognize all of the semantic components in a protein sequence that are relevant to its structure and function. In this project, powerful artificial intelligence methods, based on deep learning, will be designed such that they can automatically map protein sequences into high-level semantic features that are meaningful when related to protein structure and function. This will not only improve the accuracy of predicting protein structural and functional properties, but also provide a new way of representing and interpreting proteins biological function, transforming how protein data are interpreted. The impact of the basic research will be broadened through open source software dissemination to other researchers, seminars on deep learning and bioinformatics, student training, involvement of minority and female students, publications, presentations, workshops, and outreach activities for high school students, as well as thoughtfully crafted communication with the Missouri state legislature, and other members of the general public.
During the research, novel deep one-dimensional (1D), 2D, and 3D convolutional neural networks will be developed to translate protein sequences or structures of arbitrary size into high-level features under the guidance of improving the prediction of multiple residue-wise local structural/functional properties (secondary structures, solvent accessibility, torsion angle, disorder, contact map, disulfide bonds, beta-sheet pairings, and protein functional sites) as well as global properties such as folds. The 1D convolutional neural network for interpreting protein sequence data will also be supplemented by the long- and short-term memory networks. The comprehensive deep learning models will be trained by innovative multi-task learning and transfer learning to enhance prediction performance. The 1D, 2D and 3D convolutional networks will be further integrated to improve the accuracy of analyzing protein sequence, structure and function. The 1D and 3D convolutional neural networks are completely original, and the new 2D convolutional architecture is more comprehensive and versatile than existing approaches. In addition to advancing the classic protein prediction tasks through the novel deep learning architectures, the hidden features automatically extracted by the deep learning models will provide a new semantic representation of proteins, which will likely transform various protein bioinformatics tasks such as classification, clustering, comparison, and ranking. The URL of this project is: http://calla.rnet.missouri.edu/cheng/nsf_deepbioinfo.html .
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.