Proteins are working molecules, playing crucial roles in almost all activities of a living cell. Therefore, elucidating the biological function of proteins is fundamental in any modern molecular biology, biochemistry, medical science, and drug development. In the post-genomics era, when a vast quantity of genomics and proteomics data are awaiting biological interpretation, substantial improvement of computational function prediction methods is essential to achieve the scale and reliability required for practical use by experimental biologists. Computational prediction is crucially useful in biological studies for designing experiments and for interpreting experimental data. In this project, a comprehensive framework for protein function prediction will be built that effectively integrates various aspects of protein features that are indicative of function. Moreover, a web-based portal will be developed, which will provide biologists with easy-to-access function prediction, visualization, and analysis tools as well as pre-computed genome function annotation. The project will train next generation interdisciplinary students through course work and direct involvement with research. Interdisciplinary proteomics approaches will be learned through local and national workshops.
The framework will integrate several different types of state-of-the-art deep neural networks. Multiple relationships of proteins, including physical similarities and proteomics data similarities, will be represented as similarity graphs centered at the target proteins, where the functional inference will be performed using deep convolutional neural networks. Among the protein features to be considered, we incorporate three-dimensional structure similarity of proteins, which will be measured through encoded local protein structures detected from protein sequence information using deep convolutional neural network. The developed methods will be used for functional analysis of photosynthesis and nitrogen fixation pathways of photosynthetic cyanobacteria, Cyanothece ATCC51142, which provides promising platforms for light-driven biofuel production. Proteins involved in photosynthesis and nitrogen fixation cycles will be experimentally identified using a new protein complex profiling method that combines chromatography separation techniques with quantitative mass spectrometry. Then, we will apply the developed prediction methods to determine their function and validate the predicted functions with the expressed proteome. All project outputs will be available at http://kiharalab.org/software.php
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.