Genomics, EHRs, GPUs, and Next Generation Computational Statistics

Sobel, Eric

Abstract

The future challenges of statistical genetics are enormous. Data sets continue to grow; studies with 106 cases and 107 markers have become feasible, but current algorithms and software do not scale to this size. We need to rethink and rebuild many of our statistical analysis techniques and tools to scale effectively. In addition, health data will soon be commonly collected from mobile and wearable devices, dramatically increasing its volume and utility. Precision health and predictive medicine raise the stakes even further. Concurrently, the nature of computing is rapidly changing. To take advantage of hardware advances, particularly ubiquitous parallel computing, new statistical approaches and algorithms and new programming paradigms must be brought online. This renewal proposal targets the application of state-of-the-art statistical techniques and tools to develop genetic analysis algorithms that can scale to studies with millions of subjects, such as the US Department of Veterans Affairs' Million Veteran Program (MVP) and the UK Biobank. Biobank-scale data sets have many ben- e?ts, particularly the potential power to detect the subtle effects of each of the many genes involved in common diseases. Another bene?t is that these data sets can be more representative of the populace by including large numbers of people from multiple ancestries, different social strata, and all sexes. To effectively and ef?ciently analyze these massive data sets requires advances in the current statistical genetics tools. Effective statistical analysis takes many forms: algorithms that converge in fewer iterations, powerful statistics that accommodate all available data, and computational methods that take advantage of massively parallel computing hardware such as graphics processing units (GPUs) and other coprocessors. We will deliver algorithms that can directly handle biobank-scale data sets for many computationally-challenging statistical genetics tasks, including genome-wide association studies (GWAS) with trait data from electronic health records (EHRs). More generally, our algorithm focus will bene?t all scienti?c ?elds driven by computational statistics and high-dimensional optimization. Of course, for statistical algorithm development to be immediately useful it must be accompanied by fast, easy-to-use software. We will promptly deliver open-source software that (1) enables interactive and reproducible analyses with informative intermediate results, (2) provides quality graphics, (3) scales to big data analytics, (4) embraces parallel and distributed computing, (5) adapts to rapid hardware evolution, (6) allows cloud computing, and (7) fosters easy communication between clinicians, geneticists, statisticians, and computer scientists. Recent breakthroughs in computer languages bring all these goals within reach. Our overall objective is the design and construction of state-of-the-art statistical genetics algorithms and software for modern, massive genetic and EHR data. Numerical accuracy, computational ef?ciency, and software sustainability are our priorities. We will deliver a uni?ed, cross-platform, high-level, reproducible, interactive analysis environment that is fast and ef?cient even for biobank-scale data sets.

Public Health Relevance

The human genome project and electronic health records (EHRs) have dramatically increased the amount of genetic data. In fact, our ability to collect genetic information has far outstripped our ability to make use of this information in understanding the basis of human disease and diversity. Our aim is to design, develop, and distribute new, more ef?cient statistical and computational approaches that make full use of the vast amount of genetic data, and thus improve genetic researchers' ability to map and characterize genes that lead to human diseases and to trait variation.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 2R01HG006139-08A1
Application #: 10051250
Study Section: Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer: Sofia, Heidi J

Project Start: 2011-08-26
Project End: 2024-06-30
Budget Start: 2020-09-17
Budget End: 2021-06-30
Support Year: 8
Fiscal Year: 2020
Total Cost
Indirect Cost

Institution

Name: University of California Los Angeles
Department: Genetics
Type: Schools of Medicine
DUNS #: 092530369

City: Los Angeles
State: CA
Country: United States
Zip Code: 90095

Related projects


NIH 2020 R01 HG	Genomics, EHRs, GPUs, and Next Generation Computational Statistics Sobel, Eric / University of California Los Angeles
NIH 2017 R01 HG	Genomics, GPUs, and Next Generation Computational Statistics Sobel, Eric / University of California Los Angeles
NIH 2016 R01 HG	Genomics, GPUs, and Next Generation Computational Statistics Sobel, Eric / University of California Los Angeles
NIH 2015 R01 HG	Genomics, GPUs, and Next Generation Computational Statistics Sobel, Eric / University of California Los Angeles
NIH 2014 R01 HG	Genomics GPUs and next generation computational statistics Sobel, Eric / University of California Los Angeles	$349,937
NIH 2013 R01 HG	Genomics GPUs and next generation computational statistics Sobel, Eric / University of California Los Angeles	$341,953
NIH 2012 R01 HG	Genomics GPUs and next generation computational statistics Sobel, Eric / University of California Los Angeles	$359,174
NIH 2011 R01 HG	Genomics GPUs and next generation computational statistics Sobel, Eric / University of California Los Angeles	$359,971

Publications

Suchard, Marc A; Lemey, Philippe; Baele, Guy et al. (2018) Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol 4:vey016

Ho, Lam Si Tung; Xu, Jason; Crawford, Forrest W et al. (2018) Birth/birth-death processes and their computable transition probabilities with biological applications. J Math Biol 76:911-944

Tolkoff, Max R; Alfaro, Michael E; Baele, Guy et al. (2018) Phylogenetic Factor Analysis. Syst Biol 67:384-399

Crawford, Forrest W; Ho, Lam Si Tung; Suchard, Marc A (2018) Computational methods for birth-death processes. Wiley Interdiscip Rev Comput Stat 10:

Cybis, Gabriela B; Sinsheimer, Janet S; Bedford, Trevor et al. (2018) Bayesian nonparametric clustering in phylogenetics: modeling antigenic evolution in influenza. Stat Med 37:195-206

Dudas, Gytis; Carvalho, Luiz Max; Bedford, Trevor et al. (2017) Virus genomes reveal factors that spread and sustained the Ebola epidemic. Nature 544:309-315

Keys, Kevin L; Chen, Gary K; Lange, Kenneth (2017) Iterative hard thresholding for model selection in genome-wide association studies. Genet Epidemiol 41:756-768

Baele, Guy; Lemey, Philippe; Rambaut, Andrew et al. (2017) Adaptive MCMC in Bayesian phylogenetics: an application to analyzing partitioned data in BEAST. Bioinformatics 33:1798-1805

Zhang, Yiwen; Zhou, Hua; Zhou, Jin et al. (2017) Regression Models For Multivariate Count Data. J Comput Graph Stat 26:1-13

Baele, Guy; Suchard, Marc A; Rambaut, Andrew et al. (2017) Emerging Concepts of Data Integration in Pathogen Phylodynamics. Syst Biol 66:e47-e65

Showing the most recent 10 out of 85 publications

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: