Proteins are the primary functional molecules in living cells, and tandem mass spectrometry provides the most efficient means of studying proteins in a high-throughput fashion. The proposal aims to use state-of-the-art methods from the fields of machine learning, statistics and natural language processing to improve our ability to make sense of large tandem mass spectrometry data sets. The core of the proposal is a type of probabilistic model, known as a dynamic Bayesian network that allows us to reason efficiently and accurately about complex sequential data sets. This modeling framework leverages a large body of related work from the fields of natural language processing and speech recognition. Much of this prior work has not yet been exploited by computational biologists, so the proposal represents a valuable cross-fertilization across disciplines. More specifically, this project employs a collection of cooperating dynamic Bayesian networks to model jointly an entire mass spectrometry experiment. Relative to most existing methods for analyzing mass spectrometry data, which tend to divide the analysis of an experiment into a series of small independent subtasks, the proposed unified model jointly, considers all of the available data. This approach can thus exploit valuable dependencies among spectra and along various dimensions of the data. Dynamic Bayesian networks also provide a rigorous framework for performing inference from a combination of observed data and qualitative expert knowledge. The project is divided into five aims, each of which concerns a particular type of mass spectrometry experiment. These experiments involve (1) identifying all of the proteins in a given complex biological sample using a standard mass spectrometry protocol;(2) identifying proteins using a modified protocol in which the mass spectrometer samples the data in a systematic, rather than data-dependent, fashion, with the goal of identifying lower abundance proteins;(3) quantifying the relative abundance of proteins within or between biological samples;(4) identifying post-translational modified proteins or proteins that contain sequence variation;and (5) performing targeted quantification of a specified set of proteins, such as proteins in a pathway of interest or protein biomarkers. The methods described in this proposal have the potential to dramatically improve our ability to draw conclusions from and formulate hypotheses on the basis of high-throughput shotgun proteomics experiments. Experiments like the ones described above can, for example, identify proteins involved in fundamental disease processes, identify previously unknown protein isoforms, or quantify the re- sponses of proteins to environmental stressors or disease states.

Public Health Relevance

The applications of mass spectrometry and its promises for improvements of human health are nu- merous, including an increased understanding of disease phenotypes and the molecular mechanisms that underlie them, and vastly more sensitive and specific diagnostic and prognostic screens. How- ever, making optimal use of mass spectrometry data requires sophisticated computational methods. This project will develop and apply novel statistical and machine learning methods for interpreting mass spectra.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Brazhnik, Paul
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Washington
Schools of Medicine
United States
Zip Code
Wang, Shengjie; Halloran, John T; Bilmes, Jeff A et al. (2016) Faster and more accurate graphical model identification of tandem mass spectra using trellises. Bioinformatics 32:i322-i331
Halloran, John T; Bilmes, Jeff A; Noble, William S (2016) Dynamic Bayesian Network for Accurate Detection of Peptides from Tandem Mass Spectra. J Proteome Res 15:2749-59
Kertesz-Farkas, Attila; Keich, Uri; Noble, William Stafford (2015) Tandem Mass Spectrum Identification via Cascaded Search. J Proteome Res 14:3027-38
Ting, Ying S; Egertson, Jarrett D; Payne, Samuel H et al. (2015) Peptide-Centric Proteome Analysis: An Alternative Strategy for the Analysis of Tandem Mass Spectrometry Data. Mol Cell Proteomics 14:2301-7
Noble, William Stafford (2015) Mass spectrometrists should search only for peptides they care about. Nat Methods 12:605-8
Eng, Jimmy K; Hoopmann, Michael R; Jahan, Tahmina A et al. (2015) A deeper look into Comet--implementation and features. J Am Soc Mass Spectrom 26:1865-74
Keich, Uri; Noble, William Stafford (2015) On the importance of well-calibrated scores for identifying shotgun proteomics spectra. J Proteome Res 14:1147-60
Halloran, John T; Bilmes, Jeff A; Noble, William S (2014) Learning Peptide-Spectrum Alignment Models for Tandem Mass Spectrometry. Uncertain Artif Intell 30:320-329
Granholm, Viktor; Navarro, José Fernández; Noble, William Stafford et al. (2013) Determining the calibration of confidence estimation procedures for unique peptides in shotgun proteomics. J Proteomics 80:123-31
Granholm, Viktor; Noble, William Stafford; Kall, Lukas (2012) A cross-validation scheme for machine learning algorithms in shotgun proteomics. BMC Bioinformatics 13 Suppl 16:S3

Showing the most recent 10 out of 18 publications