Proteins are the primary functional molecules in living cells, and tandem mass spectrometry provides the most ef?cient means of studying proteins in a high-throughput fashion. The proposal aims to use state-of-the-art methods from the ?elds of machine learning, statistics, and natural language processing to improve our ability to make sense of large tandem mass spectrometry data sets. Our project will focus on three key problems in the analysis of such data: 1. facilitating the use of previously annotated spectra to improve our ability to annotate new spectra by creating a hybrid search scheme that compares an observed spectrum to a database comprised of theoretical spectra and previously annotated spectra, 2. enabling the ef?cient and accurate detection of peptides containing post-translational modi?cations and sequence variants, and 3. detecting sets of peptide species that are co-fragmented in the mass spectrometer and hence give rise to complex, mixture spectra. Each of these aims will improve the ability of mass spectrometrists to ef?ciently and accurately identify and quantify proteins in complex mixtures. To increase the impact of our work, we will continue to make all of our tools available as free software.
The applications of mass spectrometry, and its promises for improvements of human health, are numerous, including an increased understanding of disease phenotypes and the molecular mechanisms that underlie them, and vastly more sensitive and speci?c diagnostic and prognostic screens. However, making optimal use of mass spectrometry data requires sophisticated computational methods. This project will develop and apply novel statistical and machine learning methods for interpreting mass spectra.