Statistical and computational analysis in whole genome sequencing studies.

Wong, Wing

Abstract

This project will investigate several issues arising from the statistical and computational analysis of whole genome sequencing (WGS) based genomics studies. In the area of data management in WGS studies, we address the rapidly increasing cost associated with the transfer and storage of the massive files for the sequence reads and their associated quality scores. We will develop data compression methods to achieve a further compression of several folds beyond current standards, with minimal incurred errors. In the area of secondary analysis, we will develop new statistical learning methods to improve variant quality score recalibration and to filter out unreliable calls. This will improve te reliability of the key information provided by the WGS data, which are the variants calls indicating the locations where the genome differs from the reference and the nature of the differences. We will study methods for case-control studies based on WGS. In particular, we will develop statistical models to enable the integrating of information from multiple types of variants to obtain more powerful tests of association. We will apply the methods developed in this aim to the analysis of WGS data from a study on abdominal aortic aneurysm. Finally, we will address selected new questions associated with population scale WGS projects. Several national programs have recently been initiated to generate WGS data for hundreds of thousands of individuals with longitudinal medical records. The availability of this comprehensive data on a population scale will open up a rich frontier for genome medicine and will pose many new challenges for statistical analysis. We will formulate some of these new challenges and develop the statistical methods needed to meet these challenges.

Public Health Relevance

The research in this project concerns the design and implementation of statistical and computational methods for the analysis of data from whole genome sequencing studies. Methods will be developed for sequence quality score compression, variant call filtering, and methods for case-control association analysis and mega-cohort analysis based on whole genome sequencing.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 1R01HG007834-01
Application #: 8750827
Study Section: Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer: Brooks, Lisa

Project Start: 2014-09-22
Project End: 2017-06-30
Budget Start: 2014-09-22
Budget End: 2015-06-30
Support Year: 1
Fiscal Year: 2014
Total Cost: $300,000
Indirect Cost: $94,741

Institution

Name: Stanford University
Department: Biostatistics & Other Math Sci
Type: Schools of Arts and Sciences
DUNS #: 009214214

City: Stanford
State: CA
Country: United States
Zip Code: 94305

Related projects


NIH 2016 R01 HG	Statistical and computational analysis in whole genome sequencing studies. Wong, Wing H. / Stanford University
NIH 2015 R01 HG	Statistical and computational analysis in whole genome sequencing studies. Wong, Wing H. / Stanford University
NIH 2014 R01 HG	Statistical and computational analysis in whole genome sequencing studies. Wong, Wing H. / Stanford University	$300,000

Publications

Zamanighomi, Mahdi; Lin, Zhixiang; Daley, Timothy et al. (2018) Unsupervised clustering and epigenetic classification of single cells. Nat Commun 9:2410

Daley, Timothy P; Lin, Zhixiang; Lin, Xueqiu et al. (2018) CRISPhieRmix: a hierarchical mixture model for CRISPR pooled screens. Genome Biol 19:159

Zhou, Bo; Arthur, Joseph G; Ho, Steve S et al. (2018) Extensive and deep sequencing of the Venter/HuRef genome for developing and benchmarking genome analysis tools. Sci Data 5:180261

Duren, Zhana; Chen, Xi; Zamanighomi, Mahdi et al. (2018) Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations. Proc Natl Acad Sci U S A 115:7723-7728

Afshar, Pegah Tootoonchi; Wong, Wing Hung (2017) COSINE: non-seeding method for mapping long noisy sequences. Nucleic Acids Res 45:e132

Carter, Ava C; Chang, Howard Y; Church, George et al. (2017) Challenges and recommendations for epigenomics in precision health. Nat Biotechnol 35:1128-1132

Duren, Zhana; Chen, Xi; Jiang, Rui et al. (2017) Modeling gene regulation from paired expression and chromatin accessibility data. Proc Natl Acad Sci U S A 114:E4914-E4923

Chen, Xi; Yang, Hong; Wong, Wing Hung (2017) Phased Genome Sequencing Through Chromosome Sorting. Methods Mol Biol 1551:171-188

Sahraeian, Sayed Mohammad Ebrahim; Mohiyuddin, Marghoob; Sebra, Robert et al. (2017) Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis. Nat Commun 8:59

Wu, Mengmeng; Lin, Zhixiang; Ma, Shining et al. (2017) Simultaneous inference of phenotype-associated genes and relevant tissues from GWAS data via Bayesian integration of multiple tissue-specific gene networks. J Mol Cell Biol 9:436-452

Showing the most recent 10 out of 19 publications