Few biomarkers derived from genome scale data have translated into improved clinical classification of cancer subtypes, in spite of the wealth of available genome-wide studies and of the corresponding application of numerous statistical algorithms. This widespread shortcoming derives from the pervasive use of off the shelf algorithms and machine learning techniques developed for image classification and language processing, which are nave of the underlying biology of the system. Furthermore, for genome-wide data, the number of samples is often small relative to the number of potential candidate biomarkers, resulting in variable accuracy on independent test data despite high accuracy in the samples used for discovery, which contributes to the failure of clinical biomarkers. This problem - so called curse of dimensionality - is further exacerbated by the prohibitive cost of dramatically increasing sample size and by patient stratification into smaller subgroups for personalized and precision medicine. Disease phenotypes arise from distinct and specific perturbations in selected networks and pathways defined by the interactions of their molecular constituents. In cancer, these perturbations may reside in gene regulatory networks topology and state, in cell signaling activity, or in metabolic conditions. We hypothesize that by leveragin such prior biological information on cancer biology we will be able to reduce model complexity and build mechanistically justified predictive models. To pursue this hypothesis, we will develop an analytical framework to embed mechanistic constraints derived from network biology into the statistical learning process itself. Hence, this application will develop a novel suite of statistial learning algorithms that embed (Aim 1) gene expression regulatory networks, (Aim 2) cell signaling activity, and (Aim 3) metabolism to classify breast and prostate cancer. Throughout the study we will work closely with clinical collaborators to ensure that our method improve over and above current predictive and prognostic models. Finally, since in our study we will also generate mechanistic classifiers based on gene expression measurements obtained from clinical assays that are already commercially available (i.e., MammaPrint, and Decipher), our innovative models and predictors will be also readily available for clinical translation. Our mechanism-driven classifiers will simultaneously have greater accuracy and interpretability than classifiers developed without regard for the underlying biology of the disease. Furthermore, embedding biological mechanisms in the classifiers will also facilitate the identification of alternative therapeutic targets specific to each cancer subtype, potentially improving patient prognosis and health outcomes. Finally, the substantial curation of molecular pathways and biological networks we will carry on in the project will also provide a powerful resource for futur studies, and the methodologies we will develop will be also applicable to other cancer and other human diseases, like neurodegenerative disorders, hearth disease, and diabetes.

Public Health Relevance

This study will use mechanistic biological knowledge to implement a suite of statistical algorithms for cancer patient classification, which will also reveal the biological reasoning behind the classification decision rules. In the study we will also apply our novel methods to develop improved biomarkers for breast and prostate cancer patient clinical stratification, with the ultimate goal of facilitating the selection of appropriate treatment. To achieve this goal we will directly improve the existing commercial tests offered to the patients, with an immediate clinical impact, since our biomarkers will be directly implementable into these assays, with the potential to quickly reach the bedside.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Research Project (R01)
Project #
Application #
Study Section
Cancer Biomarkers Study Section (CBSS)
Program Officer
Li, Jerry
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Johns Hopkins University
Internal Medicine/Medicine
Schools of Medicine
United States
Zip Code
Dinalankara, Wikum; Ke, Qian; Xu, Yiran et al. (2018) Digitizing omics profiles by divergence from a baseline. Proc Natl Acad Sci U S A 115:4545-4552
Kearney, Paul; Boniface, J Jay; Price, Nathan D et al. (2018) The building blocks of successful translation of proteomics to the clinic. Curr Opin Biotechnol 51:123-129
Gandy, Lisa M; Gumm, Jordan; Fertig, Benjamin et al. (2017) Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques. PLoS One 12:e0175860
Marchionni, Luigi; Hayashi, Masamichi; Guida, Elisa et al. (2017) MicroRNA expression profiling of Xp11 renal cell carcinoma. Hum Pathol 67:18-29