With the size of genetic data sets and their computational demands growing exponentially, concerns are rising whether traditional statistical approaches and standard CPUs can deliver the needed analytical and computing power. Parallel computing has been touted for several years, but massively parallel CPU computers are enormously expensive and limited to a few national centers. Graphics processing unit (GPU) and many integrated core (MIC) coprocessors offer a far cheaper and more distributed solution. Each GPU or MIC card can run hundreds of computational threads simultaneously, and several cards t inside a desktop computer. Today, almost all new laptop and desktop computers are equipped with multiple CPU cores and some GPU coprocessor. Thus, cheap hardware currently exists that promises a hundred-fold speedup of many basic computational procedures. Appropriate algorithm design and software development is the main hurdle hindering the exploitation of GPUs and MICs. This proposal targets this weak link in the chain of modern computing. By demonstrating the advantages of massively parallel processing on a few genetic problems, and by distributing general low-level software libraries for these and many other problems, we hope to catalyze the use of GPUs and MICs in genetics. The specific projects include: use of RNA-seq data for the discovery and analysis of isoforms, pedigree-informed genotype imputation, and analysis of pathogens' phenotype evolution. High-dimensional optimization is a common thread enabling these applications. We will pursue a promising new technique for optimization that is particularly well adapted to high dimensions and parallelization, the proximal distance algorithms. This procedure avoids major pitfalls of current state of the art methods, especially shrinkage, which distorts parameter estimates and model selection. Implementation of our demonstration projects on GPUs and MICs will require the production of subroutines of considerable general value in computational statistics. We intend to release our toolbox libraries to the open source community, including C/C++, Fortran, and R software wrappers. This may lead to a multiplier effect that will improve the computing climate in many disciplines through- out the health and physical sciences. All other application programs produced under this proposal will be freely distributed to the scientific community. Our record of producing and distributing usable parallel software with superior documentation shows our commitment to this philosophy.

Public Health Relevance

The human genome project and its offshoots have dramatically increased the amount of genetic data. In fact, our ability to collect genetic information has currently far outstripped our ability to make use of this information in understanding the basis of disease and human diversity. Our aim is to develop, implement, and freely distribute new, more efficient computational and statistical approaches that make full use of the vast amount of genetic data, and thus improve genetic re- searchers' ability to map and characterize genes that lead to human diseases and to trait variation.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG006139-07
Application #
9322875
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Sofia, Heidi J
Project Start
2011-08-26
Project End
2019-06-30
Budget Start
2017-07-01
Budget End
2019-06-30
Support Year
7
Fiscal Year
2017
Total Cost
Indirect Cost
Name
University of California Los Angeles
Department
Genetics
Type
Schools of Medicine
DUNS #
092530369
City
Los Angeles
State
CA
Country
United States
Zip Code
90095
Suchard, Marc A; Lemey, Philippe; Baele, Guy et al. (2018) Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol 4:vey016
Ho, Lam Si Tung; Xu, Jason; Crawford, Forrest W et al. (2018) Birth/birth-death processes and their computable transition probabilities with biological applications. J Math Biol 76:911-944
Tolkoff, Max R; Alfaro, Michael E; Baele, Guy et al. (2018) Phylogenetic Factor Analysis. Syst Biol 67:384-399
Crawford, Forrest W; Ho, Lam Si Tung; Suchard, Marc A (2018) Computational methods for birth-death processes. Wiley Interdiscip Rev Comput Stat 10:
Cybis, Gabriela B; Sinsheimer, Janet S; Bedford, Trevor et al. (2018) Bayesian nonparametric clustering in phylogenetics: modeling antigenic evolution in influenza. Stat Med 37:195-206
Dudas, Gytis; Carvalho, Luiz Max; Bedford, Trevor et al. (2017) Virus genomes reveal factors that spread and sustained the Ebola epidemic. Nature 544:309-315
Keys, Kevin L; Chen, Gary K; Lange, Kenneth (2017) Iterative hard thresholding for model selection in genome-wide association studies. Genet Epidemiol 41:756-768
Baele, Guy; Lemey, Philippe; Rambaut, Andrew et al. (2017) Adaptive MCMC in Bayesian phylogenetics: an application to analyzing partitioned data in BEAST. Bioinformatics 33:1798-1805
Zhang, Yiwen; Zhou, Hua; Zhou, Jin et al. (2017) Regression Models For Multivariate Count Data. J Comput Graph Stat 26:1-13
Baele, Guy; Suchard, Marc A; Rambaut, Andrew et al. (2017) Emerging Concepts of Data Integration in Pathogen Phylodynamics. Syst Biol 66:e47-e65

Showing the most recent 10 out of 85 publications