Proteins are the primary functional molecules in living cells, and tandem mass spectrometry provides the most efficient means of studying proteins in a high-throughput fashion. The proposal aims to use state-of-the-art methods from the fields of machine learning, statistics and natural language processing to improve our ability to make sense of large tandem mass spectrometry data sets. The core of the proposal is a type of probabilistic model, known as a dynamic Bayesian network that allows us to reason efficiently and accurately about complex sequential data sets. This modeling framework leverages a large body of related work from the fields of natural language processing and speech recognition. Much of this prior work has not yet been exploited by computational biologists, so the proposal represents a valuable cross-fertilization across disciplines. More specifically, this project employs a collection of cooperating dynamic Bayesian networks to model jointly an entire mass spectrometry experiment. Relative to most existing methods for analyzing mass spectrometry data, which tend to divide the analysis of an experiment into a series of small independent subtasks, the proposed unified model jointly, considers all of the available data. This approach can thus exploit valuable dependencies among spectra and along various dimensions of the data. Dynamic Bayesian networks also provide a rigorous framework for performing inference from a combination of observed data and qualitative expert knowledge. The project is divided into five aims, each of which concerns a particular type of mass spectrometry experiment. These experiments involve (1) identifying all of the proteins in a given complex biological sample using a standard mass spectrometry protocol;(2) identifying proteins using a modified protocol in which the mass spectrometer samples the data in a systematic, rather than data-dependent, fashion, with the goal of identifying lower abundance proteins;(3) quantifying the relative abundance of proteins within or between biological samples;(4) identifying post-translational modified proteins or proteins that contain sequence variation;and (5) performing targeted quantification of a specified set of proteins, such as proteins in a pathway of interest or protein biomarkers. The methods described in this proposal have the potential to dramatically improve our ability to draw conclusions from and formulate hypotheses on the basis of high-throughput shotgun proteomics experiments. Experiments like the ones described above can, for example, identify proteins involved in fundamental disease processes, identify previously unknown protein isoforms, or quantify the re- sponses of proteins to environmental stressors or disease states.

Public Health Relevance

The applications of mass spectrometry and its promises for improvements of human health are nu- merous, including an increased understanding of disease phenotypes and the molecular mechanisms that underlie them, and vastly more sensitive and specific diagnostic and prognostic screens. How- ever, making optimal use of mass spectrometry data requires sophisticated computational methods. This project will develop and apply novel statistical and machine learning methods for interpreting mass spectra.

National Institute of Health (NIH)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Brazhnik, Paul
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Washington
Schools of Medicine
United States
Zip Code
Halloran, John T; Bilmes, Jeff A; Noble, William S (2014) Learning Peptide-Spectrum Alignment Models for Tandem Mass Spectrometry. Uncertain Artif Intell 30:320-329
Noble, William Stafford; MacCoss, Michael J (2012) Computational and statistical analysis of protein mass spectrometry data. PLoS Comput Biol 8:e1002296