The problem: High-throughput biomedical data from biomarker profiling studies aimed at early detection of diseases like lung cancer are accumulating rapidly. Although many popular machine learning methods have been utilized for analysis of such high-dimensional datasets, no single method has consistently outperformed others. Moreover, scientists have the need to simultaneously address two related tasks: disease prediction and biomarker discovery, using the same sets of data and tools. One way, as undertaken in this project, to address this need is to find the most accurate classifier for the disease from a given set of profiles and present the discriminative markers used in that model to the scientist for further verification. The large space of possible models coupled with the small sample size of the data make it hard to accurately estimate predictive accuracy. The solution: This project will develop, evaluate and refine novel Bayesian Rule Learning (BRL) methods that are algorithmically efficient, result in parsimonious models and accurately estimate predictive uncertainty from sparse biomedical datasets. BRL methods utilize a Bayesian score to evaluate rule models, thereby quantifying the uncertainty in the validity of the rule itself. This novel technique that combines the mathematical rigor of Bayesian network learning with rule-based modeling opens up a hitherto underexplored area of fundamental research in informatics involving such hybrid methodologies. Rules enable modular representation of knowledge and collaboration with scientists, as it is easier to present the model and extract markers both visually and computationally. Rule-based inference is also simpler and more tractable. The Bayesian approach enables prior knowledge to be incorporated and evaluated in a continual fashion with a human in the loop. The latter is very important for refinement of both tools and models.
The specific aims : This project will test the hypothesis that the BRL methods developed and extended herein produce more accurate and parsimonious models for disease state prediction than other state-of-the-art machine learning methods. This project evaluates BRL methods and models using existing proteomic datasets for three diverse diseases - rare, neurodegenerative Amyotrophic Lateral Sclerosis (ALS), and the two most common cancers in the world, lung and breast cancers. Experimental verification will be performed using a new set of retrospectively collected breast cancer sera samples to evaluate model generalizability. The significance: This project will produce: (1) a novel biomedical data mining tool for analyzing data from biomarker profiling studies of any disease, (2) methodological insights into the applicability of this tool and current machine learning methods for such tasks, and (3) new data for research on the early detection of breast cancer. It has potential to help develop new diagnostic tests for early detection of ALS, lung and breast cancers and lays a firm foundation for building modeling frameworks that can incorporate both prior knowledge and data to provide the technological capability for combining evidence from multiple, heterogeneous sources.

Public Health Relevance

This project will develop highly-needed data mining methods for analyzing the spate of datasets arising from high-throughput technologies for molecular biomarker profiling. It will generate new experimental data for early detection of breast cancer, and has the potential to help create new diagnostic screening tools for three diverse diseases: two of the most common cancers in the world - lung and breast cancers, and rare, neurodegenerative Amyotrophic Lateral Sclerosis.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZLM1-ZH-C (01))
Program Officer
Ye, Jane
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pittsburgh
Schools of Medicine
United States
Zip Code
Dutta-Moscato, Joyeeta; Gopalakrishnan, Vanathi; Lotze, Michael T et al. (2014) Creating a pipeline of talent for informatics: STEM initiative for high school students in computer science, biology, and biomedical informatics. J Pathol Inform 5:12
Grover, Himanshu; Wallstrom, Garrick; Wu, Christine C et al. (2013) Context-sensitive markov models for peptide scoring and identification from tandem mass spectrometry. OMICS 17:94-105