In the next few years, high throughput short-read sequencing will become the de facto method for profiling genome variation. With experimental platforms moving beyond the proof-of-principal stage, large multi-sample studies are underway at the Stanford Genome Technology Center, with a focus on profiling mutations in cancers and in evolving virus populations. Current methods for DNA variant detection are mostly designed for the analysis of DNA from normal samples, and lack power for the analysis of genetically heterogeneous cell populations such as tumors and viruses. The goal of this proposal is to develop statistical models and methods for detecting mutations and estimating their prevalence in genetically heterogeneous samples, and to derive fast, analytic approaches for estimating their significance and power. Methods will also be developed for the aggregation of genetic profiles across multiple samples in the search for mutation hotspots associated with clinical outcome.
Our specific aims are: 1. Develop statistical models for the calling of single nucleotide polymorphism/mutations, copy number changes, and structural variants in genetically heterogeneous samples. Derive fast, simulation free methods to estimate the false discovery rates of detection schemes under these models. 2. A statistical framework for aggregating mutation profiles across samples. Most current studies group mutations in to genes or exons, or use arbitrary binning schemes. We propose a new approach to this problem by modeling the mutation profile across patients as aligned point processes. We will extend our work on multi-sample scan statistics to develop a genome-wide variable-window width adaptive test for identifying genomic regions where the occurrence of mutations is associated with a given phenotype. This framework can potentially also be applied to genetic association studies with rare variants. The PI, Dr. Nancy R. Zhang, was trained in mathematics (BA), computer sciences (MS) and statistics (PhD), and, as a faculty in the Department of Statistics at Stanford University, has focused on the statistical analysis of DNA copy number and other types of genome-wide profiling data. Much of her published work address the issue of cross-sample and cross-platform aggregation and multiple-testing control in genome profiling studies. At the heart of this proposal is the collaboration with Dr. Hanlee Ji, an assistant professor in the Department of Medicine and senior associate director at the Stanford Genome Technology Center. This proposal timely responds to the growing need of a statistical data analysis platform for genome resequencing at Stanford and in the larger scientific community. Public, open source software will be made available for all of the developed methods.

Public Health Relevance

In this project, Dr. Zhang and her research team will design and implement statistical methods for detecting genomic variants in data produced by massively parallel sequencing technologies. The methods proposed focus on achieving high sensitivity in clinical DNA samples, which may be contaminated or derived from genetically heterogeneous populations (e.g. viruses and tumors). They will also develop rigorous means to estimate and control the error of these detection schemes, which will allow such studies to be compared and evaluated in a systematic way.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG006137-02
Application #
8296508
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Brooks, Lisa
Project Start
2011-07-06
Project End
2014-04-30
Budget Start
2012-05-01
Budget End
2013-04-30
Support Year
2
Fiscal Year
2012
Total Cost
$215,498
Indirect Cost
$79,107
Name
Stanford University
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
009214214
City
Stanford
State
CA
Country
United States
Zip Code
94305
Zhang, Hanrui; Zhang, Nancy R; Li, Mingyao et al. (2018) First Giant Steps Toward a Cell Atlas of Atherosclerosis. Circ Res 122:1632-1634
Huang, Mo; Wang, Jingshu; Torre, Eduardo et al. (2018) SAVER: gene expression recovery for single-cell RNA sequencing. Nat Methods 15:539-542
Zhou, Zilu; Wang, Weixin; Wang, Li-San et al. (2018) Integrative DNA copy number detection and genotyping from sequencing and array-based platforms. Bioinformatics 34:2349-2355
Xia, Li Charlie; Ai, Dongmei; Lee, Hojoon et al. (2018) SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution. Gigascience 7:
Urrutia, Eugene; Chen, Hao; Zhou, Zilu et al. (2018) Integrative pipeline for profiling DNA copy number and inferring tumor phylogeny. Bioinformatics 34:2126-2128
Wang, Jingshu; Huang, Mo; Torre, Eduardo et al. (2018) Gene expression distribution deconvolution in single-cell RNA sequencing. Proc Natl Acad Sci U S A 115:E6437-E6446
Ai, Dongmei; Huang, Ruocheng; Wen, Jin et al. (2017) Integrated metagenomic data analysis demonstrates that a loss of diversity in oral microbiota is associated with periodontitis. BMC Genomics 18:1041
Chen, Hao; Jiang, Yuchao; Maxwell, Kara N et al. (2017) ALLELE-SPECIFIC COPY NUMBER ESTIMATION BY WHOLE EXOME SEQUENCING. Ann Appl Stat 11:1169-1192
Jiang, Yuchao; Zhang, Nancy R; Li, Mingyao (2017) SCALE: modeling allele-specific gene expression by single-cell RNA sequencing. Genome Biol 18:74
Lau, Billy T; Ji, Hanlee P (2017) Single molecule counting and assessment of random molecular tagging errors with transposable giga-scale error-correcting barcodes. BMC Genomics 18:745

Showing the most recent 10 out of 38 publications