Statistical Models for Genome Sequencing and Association

Ji, Hanlee; Zhang, Nancy

Abstract

In the next few years, high throughput short-read sequencing will become the de facto method for profiling genome variation. With experimental platforms moving beyond the proof-of-principal stage, large multi-sample studies are underway at the Stanford Genome Technology Center, with a focus on profiling mutations in cancers and in evolving virus populations. Current methods for DNA variant detection are mostly designed for the analysis of DNA from normal samples, and lack power for the analysis of genetically heterogeneous cell populations such as tumors and viruses. The goal of this proposal is to develop statistical models and methods for detecting mutations and estimating their prevalence in genetically heterogeneous samples, and to derive fast, analytic approaches for estimating their significance and power. Methods will also be developed for the aggregation of genetic profiles across multiple samples in the search for mutation hotspots associated with clinical outcome.
Our specific aims are: 1. Develop statistical models for the calling of single nucleotide polymorphism/mutations, copy number changes, and structural variants in genetically heterogeneous samples. Derive fast, simulation free methods to estimate the false discovery rates of detection schemes under these models. 2. A statistical framework for aggregating mutation profiles across samples. Most current studies group mutations in to genes or exons, or use arbitrary binning schemes. We propose a new approach to this problem by modeling the mutation profile across patients as aligned point processes. We will extend our work on multi-sample scan statistics to develop a genome-wide variable-window width adaptive test for identifying genomic regions where the occurrence of mutations is associated with a given phenotype. This framework can potentially also be applied to genetic association studies with rare variants. The PI, Dr. Nancy R. Zhang, was trained in mathematics (BA), computer sciences (MS) and statistics (PhD), and, as a faculty in the Department of Statistics at Stanford University, has focused on the statistical analysis of DNA copy number and other types of genome-wide profiling data. Much of her published work address the issue of cross-sample and cross-platform aggregation and multiple-testing control in genome profiling studies. At the heart of this proposal is the collaboration with Dr. Hanlee Ji, an assistant professor in the Department of Medicine and senior associate director at the Stanford Genome Technology Center. This proposal timely responds to the growing need of a statistical data analysis platform for genome resequencing at Stanford and in the larger scientific community. Public, open source software will be made available for all of the developed methods.

Public Health Relevance

In this project, Dr. Zhang and her research team will design and implement statistical methods for detecting genomic variants in data produced by massively parallel sequencing technologies. The methods proposed focus on achieving high sensitivity in clinical DNA samples, which may be contaminated or derived from genetically heterogeneous populations (e.g. viruses and tumors). They will also develop rigorous means to estimate and control the error of these detection schemes, which will allow such studies to be compared and evaluated in a systematic way.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 5R01HG006137-03
Application #: 8523046
Study Section: Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer: Brooks, Lisa

Project Start: 2011-07-06
Project End: 2014-04-30
Budget Start: 2013-05-01
Budget End: 2014-04-30
Support Year: 3
Fiscal Year: 2013
Total Cost: $215,645
Indirect Cost: $79,161

Institution

Name: Stanford University
Department: Biostatistics & Other Math Sci
Type: Schools of Arts and Sciences
DUNS #: 009214214

City: Stanford
State: CA
Country: United States
Zip Code: 94305

Related projects


NIH 2020 R01 HG	Single Cell Transcriptomic and Genetic Diversity by Single Molecule Long Read Sequencing Zhang, Nancy R.; Ji, Hanlee P. / University of Pennsylvania
NIH 2019 R01 HG	Genomic and Cellular Variation from Single Molecules to Single Cells Zhang, Nancy R.; Ji, Hanlee P. / University of Pennsylvania
NIH 2018 R01 HG	Genomic and Cellular Variation from Single Molecules to Single Cells Zhang, Nancy R.; Ji, Hanlee P. / University of Pennsylvania
NIH 2017 R01 HG	Genomic and Cellular Variation from Single Molecules to Single Cells Zhang, Nancy R.; Ji, Hanlee P. / University of Pennsylvania
NIH 2016 R01 HG	Statistical Models and Analysis of Complex Genomic Variation in Clonal Mixtures Zhang, Nancy R.; Ji, Hanlee P. / University of Pennsylvania
NIH 2015 R01 HG	Statistical Models and Analysis of Complex Genomic Variation in Clonal Mixtures Zhang, Nancy R.; Ji, Hanlee P. / University of Pennsylvania
NIH 2014 R01 HG	Statistical Models and Analysis of Complex Genomic Variation in Clonal Mixtures Zhang, Nancy R.; Ji, Hanlee / University of Pennsylvania	$258,774
NIH 2013 R01 HG	Statistical Models for Genome Sequencing and Association Ji, Hanlee; Zhang, Nancy R. / Stanford University	$215,645
NIH 2012 R01 HG	Statistical Models for Genome Sequencing and Association Ji, Hanlee; Zhang, Nancy R. / Stanford University	$215,498
NIH 2011 R01 HG	Statistical Models for Genome Sequencing and Association Zhang, Nancy R.; Ji, Hanlee / Stanford University	$216,462

Publications

Zhou, Zilu; Wang, Weixin; Wang, Li-San et al. (2018) Integrative DNA copy number detection and genotyping from sequencing and array-based platforms. Bioinformatics 34:2349-2355

Xia, Li Charlie; Ai, Dongmei; Lee, Hojoon et al. (2018) SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution. Gigascience 7:

Urrutia, Eugene; Chen, Hao; Zhou, Zilu et al. (2018) Integrative pipeline for profiling DNA copy number and inferring tumor phylogeny. Bioinformatics 34:2126-2128

Wang, Jingshu; Huang, Mo; Torre, Eduardo et al. (2018) Gene expression distribution deconvolution in single-cell RNA sequencing. Proc Natl Acad Sci U S A 115:E6437-E6446

Zhang, Hanrui; Zhang, Nancy R; Li, Mingyao et al. (2018) First Giant Steps Toward a Cell Atlas of Atherosclerosis. Circ Res 122:1632-1634

Huang, Mo; Wang, Jingshu; Torre, Eduardo et al. (2018) SAVER: gene expression recovery for single-cell RNA sequencing. Nat Methods 15:539-542

Ai, Dongmei; Huang, Ruocheng; Wen, Jin et al. (2017) Integrated metagenomic data analysis demonstrates that a loss of diversity in oral microbiota is associated with periodontitis. BMC Genomics 18:1041

Chen, Hao; Jiang, Yuchao; Maxwell, Kara N et al. (2017) ALLELE-SPECIFIC COPY NUMBER ESTIMATION BY WHOLE EXOME SEQUENCING. Ann Appl Stat 11:1169-1192

Jiang, Yuchao; Zhang, Nancy R; Li, Mingyao (2017) SCALE: modeling allele-specific gene expression by single-cell RNA sequencing. Genome Biol 18:74

Lau, Billy T; Ji, Hanlee P (2017) Single molecule counting and assessment of random molecular tagging errors with transposable giga-scale error-correcting barcodes. BMC Genomics 18:745

Showing the most recent 10 out of 38 publications

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: