Large-scale population biobanks around the world, the disease focused NHGRI Genome Sequencing Program (GSP), and the United States? All of Us Precision Medicine Initiative project will generate massive genomic datasets combined with disease outcomes, and other health measurements. These genomic studies will identify genomic variants relevant to health and disease. However, their association in the context of all possible associations identified will remain unclear if the data are separately analyzed. There is a growing recognition that most traits are polygenic. In addition, it is increasingly appreciated that pleiotropy is pervasive. Due to privacy concerns, it is challenging to share all possible genotype and phenotype data. Methods that can perform inference on summary level data, e.g. p-values, effect size estimates, and frequency, will facilitate our understanding of the genetics of human diseases and health. Here, we propose to develop software for large-scale inference of the genetics of lifestyle measures, biomarkers, and common and rare diseases. Achieving this goal requires expertise in medical and population genetics, statistical methods development, and expertise in management of large-scale databases. The project has three main objectives. First, we will create Global Biobank Engine: a powerful, interactive web platform for inference of the genetics of lifestyle measures, biomarkers, common and rare diseases. We will expand the features by implementing quality control visualizations and methods for flagging variants and phenotypes. We will add tools for study design that use empirical data to estimate statistical power, and create a flexible framework for statistical models that jointly analyze multiple phenotypes while controlling for false positive and negative findings. Secondly, we will improve Global Biobank Engine performance, scalability, and accessibility to facilitate future population biobanks and targeted common and rare disease. We will create a hosted, secure, and cost-effective cloud-based community resource, and design a database system that reduces the loading time for genetic association studies from hours to minutes and allows for streaming of statistical algorithms directly to genetic data. Lastly, we will improve genomic interpretation, visualization, and data sharing to dramatically increase the rate of translational discoveries by implementing novel analysis methods. We will support new variant annotation methods and integrate coding and non-coding information, including data from large-scale epigenomics studies, for variant and gene level inference. We will implement new Bayesian statistical models implemented in probabilistic programming languages, sparse canonical correlation analysis, and truncated singular value decomposition. PI Rivas and his team have ample experience with NIH-funded consortia, and they are dedicated to the overall mission of NIH and its funded investigators to uncover new knowledge that will lead to better health for everyone.
The goal of this project is to empower and democratize the discovery and inference process for genetics research of biomarkers, lifestyle measures, and common, rare diseases around the world by developing the Global Biobank Engine. Of particular importance is ensuring we have powerful tools and statistical methods for analyzing summary statistic data from population biobanks, and disease-focused genome sequencing programs. Achieving this goal requires expertise across many domains of knowledge including: medical and population genomics, algorithm development for disease mapping, and expertise in large-scale databases.