Genomics, GPUs, and Next Generation Computational Statistics

Sobel, Eric

Abstract

With the size of genetic data sets and their computational demands growing exponentially, concerns are rising whether traditional statistical approaches and standard CPUs can deliver the needed analytical and computing power. Parallel computing has been touted for several years, but massively parallel CPU computers are enormously expensive and limited to a few national centers. Graphics processing unit (GPU) and many integrated core (MIC) coprocessors offer a far cheaper and more distributed solution. Each GPU or MIC card can run hundreds of computational threads simultaneously, and several cards t inside a desktop computer. Today, almost all new laptop and desktop computers are equipped with multiple CPU cores and some GPU coprocessor. Thus, cheap hardware currently exists that promises a hundred-fold speedup of many basic computational procedures. Appropriate algorithm design and software development is the main hurdle hindering the exploitation of GPUs and MICs. This proposal targets this weak link in the chain of modern computing. By demonstrating the advantages of massively parallel processing on a few genetic problems, and by distributing general low-level software libraries for these and many other problems, we hope to catalyze the use of GPUs and MICs in genetics. The specific projects include: use of RNA-seq data for the discovery and analysis of isoforms, pedigree-informed genotype imputation, and analysis of pathogens' phenotype evolution. High-dimensional optimization is a common thread enabling these applications. We will pursue a promising new technique for optimization that is particularly well adapted to high dimensions and parallelization, the proximal distance algorithms. This procedure avoids major pitfalls of current state of the art methods, especially shrinkage, which distorts parameter estimates and model selection. Implementation of our demonstration projects on GPUs and MICs will require the production of subroutines of considerable general value in computational statistics. We intend to release our toolbox libraries to the open source community, including C/C++, Fortran, and R software wrappers. This may lead to a multiplier effect that will improve the computing climate in many disciplines through- out the health and physical sciences. All other application programs produced under this proposal will be freely distributed to the scientific community. Our record of producing and distributing usable parallel software with superior documentation shows our commitment to this philosophy.

Public Health Relevance

The human genome project and its offshoots have dramatically increased the amount of genetic data. In fact, our ability to collect genetic information has currently far outstripped our ability to make use of this information in understanding the basis of disease and human diversity. Our aim is to develop, implement, and freely distribute new, more efficient computational and statistical approaches that make full use of the vast amount of genetic data, and thus improve genetic re- searchers' ability to map and characterize genes that lead to human diseases and to trait variation.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 5R01HG006139-07
Application #: 9322875
Study Section: Biodata Management and Analysis Study Section (BDMA)
Program Officer: Sofia, Heidi J

Project Start: 2011-08-26
Project End: 2019-06-30
Budget Start: 2017-07-01
Budget End: 2019-06-30
Support Year: 7
Fiscal Year: 2017
Total Cost
Indirect Cost

Institution

Name: University of California Los Angeles
Department: Genetics
Type: Schools of Medicine
DUNS #: 092530369

City: Los Angeles
State: CA
Country: United States
Zip Code: 90095

Related projects


NIH 2020 R01 HG	Genomics, EHRs, GPUs, and Next Generation Computational Statistics Sobel, Eric / University of California Los Angeles
NIH 2017 R01 HG	Genomics, GPUs, and Next Generation Computational Statistics Sobel, Eric / University of California Los Angeles
NIH 2016 R01 HG	Genomics, GPUs, and Next Generation Computational Statistics Sobel, Eric / University of California Los Angeles
NIH 2015 R01 HG	Genomics, GPUs, and Next Generation Computational Statistics Sobel, Eric / University of California Los Angeles
NIH 2014 R01 HG	Genomics GPUs and next generation computational statistics Sobel, Eric / University of California Los Angeles	$349,937
NIH 2013 R01 HG	Genomics GPUs and next generation computational statistics Sobel, Eric / University of California Los Angeles	$341,953
NIH 2012 R01 HG	Genomics GPUs and next generation computational statistics Sobel, Eric / University of California Los Angeles	$359,174
NIH 2011 R01 HG	Genomics GPUs and next generation computational statistics Sobel, Eric / University of California Los Angeles	$359,971

Publications

Suchard, Marc A; Lemey, Philippe; Baele, Guy et al. (2018) Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol 4:vey016

Ho, Lam Si Tung; Xu, Jason; Crawford, Forrest W et al. (2018) Birth/birth-death processes and their computable transition probabilities with biological applications. J Math Biol 76:911-944

Tolkoff, Max R; Alfaro, Michael E; Baele, Guy et al. (2018) Phylogenetic Factor Analysis. Syst Biol 67:384-399

Crawford, Forrest W; Ho, Lam Si Tung; Suchard, Marc A (2018) Computational methods for birth-death processes. Wiley Interdiscip Rev Comput Stat 10:

Cybis, Gabriela B; Sinsheimer, Janet S; Bedford, Trevor et al. (2018) Bayesian nonparametric clustering in phylogenetics: modeling antigenic evolution in influenza. Stat Med 37:195-206

Dudas, Gytis; Carvalho, Luiz Max; Bedford, Trevor et al. (2017) Virus genomes reveal factors that spread and sustained the Ebola epidemic. Nature 544:309-315

Keys, Kevin L; Chen, Gary K; Lange, Kenneth (2017) Iterative hard thresholding for model selection in genome-wide association studies. Genet Epidemiol 41:756-768

Baele, Guy; Lemey, Philippe; Rambaut, Andrew et al. (2017) Adaptive MCMC in Bayesian phylogenetics: an application to analyzing partitioned data in BEAST. Bioinformatics 33:1798-1805

Zhang, Yiwen; Zhou, Hua; Zhou, Jin et al. (2017) Regression Models For Multivariate Count Data. J Comput Graph Stat 26:1-13

Baele, Guy; Suchard, Marc A; Rambaut, Andrew et al. (2017) Emerging Concepts of Data Integration in Pathogen Phylodynamics. Syst Biol 66:e47-e65

Showing the most recent 10 out of 85 publications

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: