One way to accelerate the understanding of the molecular basis of cancer is through the application of robust, quantitative, proteomic technologies and corresponding computational methodologies. Mass spectroscopy measurement technology for peptides (LC-MS/MS) is rapidly advancing, and there is a great need for more development of the corresponding bioinformatics analysis techniques to infer proteins from the peptide spectra. The Latent Dirichlet Allocation (LDA) for Protein Inference in Quantitative Proteomics research project will adapt LDA, an established method of topic modeling from text mining, to the problem of protein inference. Advances in protein inference will be of great utility and interest in cancer clinical proteomics studies. Successfully deploying these methods will directly lead to an increase in the ability of proteomics to augment cancer research in many important areas such as biomarker discovery, pathogenesis, and patient-specific tumor therapies.
Two specific aims i n support of these goals will be undertaken during the proposed project: * Aim 1. Investigate how to best apply latent Dirichlet allocation modeling techniques previously used in text mining to the problem of protein inference. Areas to explore include the application of biological and domain knowledge constraints to the model as well as parameter optimization techniques. Tune and evaluate the approach in terms of accuracy, sensitivity, and specificity on a set of simulated protein-peptide fragment data with various amounts of noise and errors in the peptide reading process. Further evaluation and validation will be performed using LC-MS/MS data produced from proteomic laboratory standards that provide a known solution to complex real-world data samples. * Aim 2. Demonstrate the utility of the latent Dirichlet allocation-based protein inference techniques by application to experimental cancer data. A head and neck squamous cell carcinoma (SCC) study from the Vanderbilt-Ingram Cancer Center providing public data will be utilized allowing the comparison of results using LDA with those obtained by current standard techniques in terms of prediction overlap, differences, and confidence levels.

Public Health Relevance

This project will improve the ability of scientists and researchers to accurately and comprehensively analyze the protein content of biological specimens, including patient medical samples. These improvements will accelerate progress in personalized medicine for cancer, chronic diseases such as diabetes mellitus, and other disease areas by providing a new perspective and detailed view into biological processes necessary for health.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Exploratory/Developmental Grants (R21)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1-RPRB-7 (O1))
Program Officer
Li, Jerry
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Oregon Health and Science University
Biostatistics & Other Math Sci
Schools of Medicine
United States
Zip Code