Mass spectrometry (MS) data is high-dimensional data that is used for large-scale system biology proteomics. The current state of the art mass spectrometers can generate thousands of spectra from a single organism and experiment. This high-dimensional data is processed using database searches and denovo algorithms with varying degrees of success. The overarching objective of this study is to develop, test, integrate and evaluate novel image-processing and deep-learning algorithms that will allow us to deduce and identify reliable peptide sequences in a definitive and quantitative fashion. Our long-term goal is to improve on identification of MS based proteomics data using novel and scalable algorithms. The objective of this proposal is to investigate, design and implement machine-learning deep-learning algorithms for identification of peptides from MS data. Since deep-learning is very good at discovering intricate structures in high-dimensional data it will be ideal solution for discovering dark proteomics data and more accurate deduction of peptides. We predict that the integration of these methods, along with traditional numerical algorithms, will lead to a multimodal fusion-based approach for an optimized and accurate peptide deduction system for large-scale MS data. Further, we will design and implement data augmentation, memory-efficient indexing, and high-performance computing (HPC) to achieve these outcomes more efficiently with a shorter computational time. Therefore, this new line of investigation is significant since it has the potential to improve on long-stalled effort to increase accuracy, reliability and reproducibility of MS data analysis and search tools. The proximate expected outcome of this work is a novel set of deep-learning and image-processing tools which will allow much better insight in MS based proteomics data. The results will have an important positive impact immediately because these proposed research tasks will lay the groundwork to develop a new class of algorithms and will provide rapid, high-throughput, sensitive, and reproducible and reliable tools for MS based proteomics.
The proposed research is relevant to public health because understanding Mass Spectrometry (MS) based proteomics can allow systematic analysis of thousands of proteins with the promise of discovering new protein biomarkers for different disease conditions and better understanding of human systems biology. Because of high-dimensionality of the big data generated from MS machines efficient, accurate and reproducible tools are required to mine and analyze the data and is the subject of this proposal. Such high-performance tools will be instrumental in elucidating the microbiome which affects virtually all aspects of human health. Therefore, this proposal is relevant to NIH?s broader mission which support fundamental and innovative research strategies which can become the basis of protecting and improving human health.