Biomedical data is vastly increasing in quantity, scope, and generality, expanding opportunities to discover novel biological processes and clinically translatable outcomes. Machine learning (ML), a key technology in modern biology that addresses these changing dynamics, aims to infer meaningful interactions among variables by learning their statistical relationships from data consisting of measurements on variables across samples. Accurate inference of such interactions from big biological data can lead to novel biological discoveries, therapeutic targets, and predictive models for patient outcomes. However, a greatly increased hypothesis space, complex dependencies among variables, and complex ?black-box? ML models pose complex, open challenges. To meet these challenges, we have been developing innovative, rigorous, and principled ML techniques to infer reliable, accurate, and interpretable statistical relationships in various kinds of biological network inference problems, pushing the boundaries of both ML and biology. Fundamental limitations of current ML techniques leave many future opportunities to translate inferred statistical relationships into biological knowledge, as exemplified in a standard biomarker discovery problem ? an extremely important problem for precision medicine. Biomarker discovery using high-throughput molecular data (e.g., gene expression data) has significantly advanced our knowledge of molecular biology and genetics. The current approach attempts to find a set of features (e.g., gene expression levels) that best predict a phenotype and use the selected features, or molecular markers, to determine the molecular basis for the phenotype. However, the low success rates of replication in independent data and of reaching clinical practice indicate three challenges posed by current ML approach. First, high-dimensionality, hidden variables, and feature correlations create a discrepancy between predictability (i.e., statistical associations) and true biological interactions; we need new feature selection criteria to make the model better explain rather than simply predict phenotypes. Second, complex models (e.g., deep learning or ensemble models) can more accurately describe intricate relationships between genes and phenotypes than simpler, linear models, but they lack interpretability. Third, analyzing observational data without conducting interventional experiments does not prove causal relations. To address these problems, we propose an integrated machine learning methodology for learning interpretable models from data that will: 1) select interpretable features likely to provide meaningful phenotype explanations, 2) make interpretable predictions by estimating the importance of each feature to a prediction, and 3) iteratively validate and refine predictions through interventional experiments. For each challenge, we will develop a generalizable ML framework that focuses on different aspects of model interpretability and will therefore be applicable to any formerly intractable, high-impact healthcare problems. We will also demonstrate the effectiveness of each ML framework for a wide range of topics, from basic science to disease biology to bedside applications.

Public Health Relevance

The development of effective computational methods that can extract meaningful and interpretable signals from noisy, big data has become an integral part of biomedical research, which aims to discover novel biological processes and clinically translatable outcomes. The proposed research seeks to radically shift the current paradigm in data-driven discovery from ?learning a statistical model that best fits specific training data? to ?learning an explainable model? for a wide range of topics, from basic science to disease biology to bedside applications. Successful completion of this project will result in novel biological discoveries, therapeutic targets, predictive models for patient outcomes, and powerful computational frameworks generalizable to critical problems in various diseases.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Unknown (R35)
Project #
1R35GM128638-01
Application #
9573854
Study Section
Special Emphasis Panel (ZGM1)
Program Officer
Brazhnik, Paul
Project Start
2018-07-01
Project End
2023-06-30
Budget Start
2018-07-01
Budget End
2019-06-30
Support Year
1
Fiscal Year
2018
Total Cost
Indirect Cost
Name
University of Washington
Department
Genetics
Type
Schools of Medicine
DUNS #
605799469
City
Seattle
State
WA
Country
United States
Zip Code
98195