The future challenges of statistical genetics are enormous. Data sets continue to grow; studies with 106 cases and 107 markers have become feasible, but current algorithms and software do not scale to this size. We need to rethink and rebuild many of our statistical analysis techniques and tools to scale effectively. In addition, health data will soon be commonly collected from mobile and wearable devices, dramatically increasing its volume and utility. Precision health and predictive medicine raise the stakes even further. Concurrently, the nature of computing is rapidly changing. To take advantage of hardware advances, particularly ubiquitous parallel computing, new statistical approaches and algorithms and new programming paradigms must be brought online. This renewal proposal targets the application of state-of-the-art statistical techniques and tools to develop genetic analysis algorithms that can scale to studies with millions of subjects, such as the US Department of Veterans Affairs' Million Veteran Program (MVP) and the UK Biobank. Biobank-scale data sets have many ben- e?ts, particularly the potential power to detect the subtle effects of each of the many genes involved in common diseases. Another bene?t is that these data sets can be more representative of the populace by including large numbers of people from multiple ancestries, different social strata, and all sexes. To effectively and ef?ciently analyze these massive data sets requires advances in the current statistical genetics tools. Effective statistical analysis takes many forms: algorithms that converge in fewer iterations, powerful statistics that accommodate all available data, and computational methods that take advantage of massively parallel computing hardware such as graphics processing units (GPUs) and other coprocessors. We will deliver algorithms that can directly handle biobank-scale data sets for many computationally-challenging statistical genetics tasks, including genome-wide association studies (GWAS) with trait data from electronic health records (EHRs). More generally, our algorithm focus will bene?t all scienti?c ?elds driven by computational statistics and high-dimensional optimization. Of course, for statistical algorithm development to be immediately useful it must be accompanied by fast, easy-to-use software. We will promptly deliver open-source software that (1) enables interactive and reproducible analyses with informative intermediate results, (2) provides quality graphics, (3) scales to big data analytics, (4) embraces parallel and distributed computing, (5) adapts to rapid hardware evolution, (6) allows cloud computing, and (7) fosters easy communication between clinicians, geneticists, statisticians, and computer scientists. Recent breakthroughs in computer languages bring all these goals within reach. Our overall objective is the design and construction of state-of-the-art statistical genetics algorithms and software for modern, massive genetic and EHR data. Numerical accuracy, computational ef?ciency, and software sustainability are our priorities. We will deliver a uni?ed, cross-platform, high-level, reproducible, interactive analysis environment that is fast and ef?cient even for biobank-scale data sets.

Public Health Relevance

The human genome project and electronic health records (EHRs) have dramatically increased the amount of genetic data. In fact, our ability to collect genetic information has far outstripped our ability to make use of this information in understanding the basis of human disease and diversity. Our aim is to design, develop, and distribute new, more ef?cient statistical and computational approaches that make full use of the vast amount of genetic data, and thus improve genetic researchers' ability to map and characterize genes that lead to human diseases and to trait variation.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
2R01HG006139-08A1
Application #
10051250
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Sofia, Heidi J
Project Start
2011-08-26
Project End
2024-06-30
Budget Start
2020-09-17
Budget End
2021-06-30
Support Year
8
Fiscal Year
2020
Total Cost
Indirect Cost
Name
University of California Los Angeles
Department
Genetics
Type
Schools of Medicine
DUNS #
092530369
City
Los Angeles
State
CA
Country
United States
Zip Code
90095
Suchard, Marc A; Lemey, Philippe; Baele, Guy et al. (2018) Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol 4:vey016
Ho, Lam Si Tung; Xu, Jason; Crawford, Forrest W et al. (2018) Birth/birth-death processes and their computable transition probabilities with biological applications. J Math Biol 76:911-944
Tolkoff, Max R; Alfaro, Michael E; Baele, Guy et al. (2018) Phylogenetic Factor Analysis. Syst Biol 67:384-399
Crawford, Forrest W; Ho, Lam Si Tung; Suchard, Marc A (2018) Computational methods for birth-death processes. Wiley Interdiscip Rev Comput Stat 10:
Cybis, Gabriela B; Sinsheimer, Janet S; Bedford, Trevor et al. (2018) Bayesian nonparametric clustering in phylogenetics: modeling antigenic evolution in influenza. Stat Med 37:195-206
Dudas, Gytis; Carvalho, Luiz Max; Bedford, Trevor et al. (2017) Virus genomes reveal factors that spread and sustained the Ebola epidemic. Nature 544:309-315
Keys, Kevin L; Chen, Gary K; Lange, Kenneth (2017) Iterative hard thresholding for model selection in genome-wide association studies. Genet Epidemiol 41:756-768
Baele, Guy; Lemey, Philippe; Rambaut, Andrew et al. (2017) Adaptive MCMC in Bayesian phylogenetics: an application to analyzing partitioned data in BEAST. Bioinformatics 33:1798-1805
Zhang, Yiwen; Zhou, Hua; Zhou, Jin et al. (2017) Regression Models For Multivariate Count Data. J Comput Graph Stat 26:1-13
Baele, Guy; Suchard, Marc A; Rambaut, Andrew et al. (2017) Emerging Concepts of Data Integration in Pathogen Phylodynamics. Syst Biol 66:e47-e65

Showing the most recent 10 out of 85 publications