Proteins form the major building blocks of cells, and protein-protein interactions provide information about cell functions. Characterizing these interactions is made possible through mass spectrometry (MS), a technique that breaks down complex biological samples into much simpler ions and measures their individual masses. Computer algorithms can then be used to interpret the output of MS experiments. The advent of tandem mass spectrometry (MS/MS), also known as shotgun proteomics, led to a huge increase in the speed with which researchers can execute proteomics experiments, which in turn has enabled the creation of massive databases containing millions of known spectra. This research will create novel algorithms that will be able to quickly identify which proteins exist in a biological sample by comparing unknown spectra from the sample against entire libraries of known spectra. Ultimately, this project will make it easier for humans to understand the molecular basis of disease and will enable personalized medicine and identifying new drugs to tackle currently incurable diseases.

The goal of this project is to develop novel methods for protein characterization in MS/MS experiment results that will provide increased spectral match effectiveness while scaling to search the largest existing protein databases and beyond. The key computational component in shotgun proteomics is matching MS/MS spectra against theoretical spectra or actual spectra in spectral databases to identify possible peptides (protein sections). In essence, given a translation of the spectra to points in the Euclidean space and a chosen proximity function, the algorithmic component in the search is a nearest neighbor search algorithm. Due to the large size of spectral databases, the problem has been traditionally solved through a variety of approximate nearest neighbor search methods and a combination of vector space and probabilistic proximity measures which are often not scalable and lead to missed spectral matches. This project aims to address these limitations in two ways. First, it will develop novel filtering-based exact nearest neighbor search methods for the shifted dot-product proximity measure, which has been recently shown to outperform alternatives by accounting for spectral post translational modifications while searching for matches. The proposed filtering-based methods prune much of the search space by eliminating potential candidates without computing their proximity to the query, based on their composition and on theoretic properties of the proximity measure. Second, the project will develop effective decomposition techniques for the inherently irregular computation requirements of the proposed pruning-based search that will enable distributed methods to search the largest proteomics databases of today, and beyond. The project will result in the dissemination of the developed methods to the large computational genomics community and will involve research education of underrepresented undergraduate students.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
2002321
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2019-08-17
Budget End
2022-01-31
Support Year
Fiscal Year
2020
Total Cost
$141,988
Indirect Cost
Name
Santa Clara University
Department
Type
DUNS #
City
Santa Clara
State
CA
Country
United States
Zip Code
95050