With ageing populations world-wide, neurodegenerative diseases are placing an ever increasing burden on long- term well-being, healthcare costs and family life. Despite decades of research and enormous investment, no disease-modifying treatment is available for the most common of these diseases: Alzheimer?s (AD). The majority of these, to-date unsuccessful, efforts have focused on one potential cause of AD: amyloid-? aggregation. Combining population-scale data collection, human genetics and machine learning provides a way forward to uncover and characterize new causal cellular processes involved in AD. This would provide an array of potential therapeutic targets, increasing the chance that one will be more easily modulated than the amyloid-? pathway. AD-specific genomic datasets of unprecedented scale are being actively collected: whole genome sequencing (WGS) from ~20k individuals, gene expression (RNA-seq) and epigenomics (ATAC-seq, histone ChIP-seq) from >1000 post-mortem AD brains, single-cell transcriptomes and similar modalities in peripheral and brain-resident innate immune cells (which we and others have shown to be AD-relevant). Effectively integrating these diverse data to better understand AD represents a substantial computational challenge, both in terms of data scale and analysis complexity. This proposal leverages state-of-the-art deep learning (DL) and machine learning (ML), combined with human genetic analyses, to address this challenge. We will train DL models to predict epigenomic signals and RNA splicing from genomic sequence, enabling in silico mutagenesis to estimate the functional impact (a ?delta score?) of any genetic variant. The delta scores will be used in genetic analyses that distinguish causal associations: cellular changes that drive AD pathogenesis rather than downstream/side effects of disease. Delta scores will aid in associating both rare and common variants to AD. To achieve sufficient power, rare variants must be aggregated (e.g. for a gene): delta scores will allow filtering out many likely non-functional (particularly non-coding) variants. Most common variants from AD Genome Wide Association Studies (GWAS) are simply correlated with the causal variant due to linkage disequilibrium (LD). Delta scores, combined with trans-ethnic GWAS, will enable estimation of the likely causal variant(s). These analyses will highlight variants and genes involved in AD. However, genes do not operate in a vacuum so robust probabilistic ML will be used to learn cell-type and disease-specific gene regulatory networks from sorted bulk and single-cell RNA-seq. The detected networks will be integrated with our genetic findings to discover network neighborhoods/pathways especially enriched in AD variants. Such pathways will be prime candidates for future functional and therapeutic studies of AD.

Public Health Relevance

The goal of this research is to use machine learning algorithms to work out which genetic differences in the genomes of Alzheimer?s disease (AD) patients might have caused their disease. We will do this by learning computational models of how genes are controlled by genetic sequence and other genes. The proposed study will discover what genes, pathways and molecular mechanisms are involved in AD, which will provide novel therapeutic targets for AD patients.

National Institute of Health (NIH)
National Institute on Aging (NIA)
Research Project--Cooperative Agreements (U01)
Project #
Application #
Study Section
Special Emphasis Panel (ZAG1)
Program Officer
Miller, Marilyn
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Icahn School of Medicine at Mount Sinai
Schools of Medicine
New York
United States
Zip Code