Biologists are now able to gather complete sets of gene expression data and protein concentrations for particular targets from specific tissues. The presence and concentrations of these molecules serve as features when determining a diagnostic pattern for specific states of development or disease. The approach to biomarker identification taken in this research attempts to find a set of features (here, gene expression levels) that best predict an outcome (protein levels occurring in the condition). The identified features, biomarkers, can help determine the molecular basis for the condition. Unfortunately, false positive biomarkers are very common, as evidenced by low success rates of replication in independent data sets and therefore low success in such markers becoming important in applications such as diagnostics in clinical practice. We seek to radically shift the current paradigm in biomarker discovery by resolving fundamental problems with the current approach by using novel, theoretically well-founded machine learning (ML) methods to learn interpretable models from data, and follow this up with a systematic experimental validation system in model organisms. The disease model we are using is for Alzheimer's disease (AD), an urgent national and international research priority. Amyloid plaques and neurofibrillary tangles are the hallmark of AD, and their building blocks are Amyloid-alpha and tau proteins, respectively. These proteins can be measured accurately from human brain tissues, as can global gene expression values. At present, we lack an understanding of the set of genes that affect formation of plaques and tangles, or any protective or pathological responses to these toxic peptides.
Biomarker discovery using high-throughput molecular data (e.g., gene expression data) has significantly advanced our knowledge of molecular biology and genetics. The current approach attempts to find a set of features (e.g., gene expression levels) that best predict a phenotype and use the selected features, molecular markers, to determine the molecular basis for the phenotype. However, the low success rates of replication in independent data indicate three fundamental problems with this approach. First, high-dimensionality, hidden variables, and feature correlations create a discrepancy between predictability (i.e., statistical associations) and true biological interactions; we need new feature selection criteria to make the model better explain rather than simply predict phenotypes. Second, complex models (e.g., deep learning or ensemble models) can more accurately describe intricate relationships between genes and phenotypes than simpler, linear models, but they lack interpretability. Third, analyzing observational data without conducting interventional experiments does not prove causal relations. To address these problems, we propose an integrated machine learning methodology for learning interpretable models from data by 1) selecting interpretable features, 2) making interpretable predictions, and 3) validating and refining predictions through interventional experiments. This approach has the following aims: 1. Develop NEBULA (network-based unsupervised feature learning) framework to learn interpretable features that will likely provide meaningful phenotype explanations from publicly available multi-omic data sets. 2. Develop a unified framework, called SHAP (Shapley additive explanation), to interpret the predictions of complex models by estimating the importance of each feature to a particular prediction. 3. Validate and refine predictions through interventional experiments using high-throughput assays of gene knockdown on powerful nematode models of proteotoxicity. For further information see the project website at: http://suinlee.cs.washington.edu/projects/im3.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.