Opening the Black Box of Machine Learning Models

Lee, Su-In

Abstract

Biomedical data is vastly increasing in quantity, scope, and generality, expanding opportunities to discover novel biological processes and clinically translatable outcomes. Machine learning (ML), a key technology in modern biology that addresses these changing dynamics, aims to infer meaningful interactions among variables by learning their statistical relationships from data consisting of measurements on variables across samples. Accurate inference of such interactions from big biological data can lead to novel biological discoveries, therapeutic targets, and predictive models for patient outcomes. However, a greatly increased hypothesis space, complex dependencies among variables, and complex ?black-box? ML models pose complex, open challenges. To meet these challenges, we have been developing innovative, rigorous, and principled ML techniques to infer reliable, accurate, and interpretable statistical relationships in various kinds of biological network inference problems, pushing the boundaries of both ML and biology. Fundamental limitations of current ML techniques leave many future opportunities to translate inferred statistical relationships into biological knowledge, as exemplified in a standard biomarker discovery problem ? an extremely important problem for precision medicine. Biomarker discovery using high-throughput molecular data (e.g., gene expression data) has significantly advanced our knowledge of molecular biology and genetics. The current approach attempts to find a set of features (e.g., gene expression levels) that best predict a phenotype and use the selected features, or molecular markers, to determine the molecular basis for the phenotype. However, the low success rates of replication in independent data and of reaching clinical practice indicate three challenges posed by current ML approach. First, high-dimensionality, hidden variables, and feature correlations create a discrepancy between predictability (i.e., statistical associations) and true biological interactions; we need new feature selection criteria to make the model better explain rather than simply predict phenotypes. Second, complex models (e.g., deep learning or ensemble models) can more accurately describe intricate relationships between genes and phenotypes than simpler, linear models, but they lack interpretability. Third, analyzing observational data without conducting interventional experiments does not prove causal relations. To address these problems, we propose an integrated machine learning methodology for learning interpretable models from data that will: 1) select interpretable features likely to provide meaningful phenotype explanations, 2) make interpretable predictions by estimating the importance of each feature to a prediction, and 3) iteratively validate and refine predictions through interventional experiments. For each challenge, we will develop a generalizable ML framework that focuses on different aspects of model interpretability and will therefore be applicable to any formerly intractable, high-impact healthcare problems. We will also demonstrate the effectiveness of each ML framework for a wide range of topics, from basic science to disease biology to bedside applications.

Public Health Relevance

The development of effective computational methods that can extract meaningful and interpretable signals from noisy, big data has become an integral part of biomedical research, which aims to discover novel biological processes and clinically translatable outcomes. The proposed research seeks to radically shift the current paradigm in data-driven discovery from ?learning a statistical model that best fits specific training data? to ?learning an explainable model? for a wide range of topics, from basic science to disease biology to bedside applications. Successful completion of this project will result in novel biological discoveries, therapeutic targets, predictive models for patient outcomes, and powerful computational frameworks generalizable to critical problems in various diseases.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of General Medical Sciences (NIGMS)
Type: Unknown (R35)
Project #: 5R35GM128638-02
Application #: 9733308
Study Section: Special Emphasis Panel (ZGM1)
Program Officer: Brazhnik, Paul

Project Start: 2018-07-01
Project End: 2023-06-30
Budget Start: 2019-07-01
Budget End: 2020-06-30
Support Year: 2
Fiscal Year: 2019
Total Cost
Indirect Cost

Institution

Name: University of Washington
Department: Genetics
Type: Schools of Medicine
DUNS #: 605799469

City: Seattle
State: WA
Country: United States
Zip Code: 98195

Related projects


NIH 2020 R35 GM	Opening the Black Box of Machine Learning Models Lee, Su-In / University of Washington
NIH 2019 R35 GM	Opening the Black Box of Machine Learning Models Lee, Su-In / University of Washington
NIH 2018 R35 GM	Opening the Black Box of Machine Learning Models Lee, Su-In / University of Washington

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: