Alzheimer?s disease (AD) is an urgent national and international research priority. Amyloid plaques and neurofibrillary tangles are the hallmark of AD. Their building blocks are Amyloid-? (A?) and tau, respectively. At present, we lack an understanding of the set of genes that affect formation of plaques and tangles along with protective and pathological responses to these toxic peptides. Biologists are now gathering gene expression data and A? and tau measures from human brain tissues. The current approach attempts to find a set of features (here, gene expression levels) that best predict an outcome (A? or tau level). The identified features, biomarkers, can help determine the molecular basis for plaques and tangles. Unfortunately, false positive biomarkers are very common, as evidenced by low success rates of replication in independent data and low success reaching clinical practice (less than 1%). We seek to radically shift the current paradigm in biomarker discovery by resolving three fundamental problems with the current approach using novel, theoretically well-founded machine learning (ML) methods to learn interpretable models from data.
Aim 1. Learn an interpretable feature representation from publicly available, high-throughput brain data. High-dimensionality, hidden variables, and complex feature correlations create a discrepancy between predictability (i.e., observed statistical associations) and true biological interactions. To increase the chance to identify true positive biomarkers, we need new feature selection criteria to learn a model that better explains rather than simply predicts the outcome. To do so, our proposed ML algorithms will identify the genes that are likely to give a meaningful explanation of the outcome (A? or tau level) by inferring both the functions of genes in the cellular processes contributing to AD and the gene interaction network from many existing brain datasets.
Aim 2. Make interpretable predictions using a unified framework to explain model predictions. Due to disease heterogeneity, complex models (e.g., deep learning or ensemble models) often more accurately describe relationships between genes and an outcome than simpler, linear models, but lack interpretability. We will develop a novel ML framework that interprets complex model predictions by estimating the importance of each feature to a specific prediction, which will identify features of high importance for each individual as personalized markers and classify subjects based on these importance estimates.
Aim 3. Validate the identified candidate biomarkers using powerful worm models of AD. Analyzing observational data without doing interventional experiments cannot prove causal relationships. In collaboration with co-I Matt Kaeberlein, we will utilize powerful nematode models of AD to test our hypotheses on the role of certain genes as disease modifiers, and develop a new way to refine the models based on this knowledge. Successful completion of this project will result in previously unknown molecular basis for A? and tau levels, potential therapeutic targets, and general ML techniques widely applicable to many other data science problems.

Public Health Relevance

In the United States alone, someone receives an Alzheimer?s diagnosis every 66 seconds, and the disease has become the 6th leading cause of death in this country; Alzheimer?s disease (AD) currently has no cure, no prevention, and no treatment to reverse or halt its deadly progression. The recent, rapid growth of gene expression data from human brain tissues hold great promise for identifying therapeutic targets, but extremely low success rates to identify true positive biomarkers indicate fundamental problems with the current computational approach being used. We seek to revolutionize the way we identify drug targets by developing novel machine learning techniques that extract meaningful and interpretable signals from noisy, big data, combined with biological validation in an animal model of AD.

National Institute of Health (NIH)
National Institute on Aging (NIA)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Petanceska, Suzana
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Washington
Biostatistics & Other Math Sci
Biomed Engr/Col Engr/Engr Sta
United States
Zip Code