In the next few years, high throughput short-read sequencing will become the de facto method for profiling genome variation. With experimental platforms moving beyond the proof-of-principal stage, large multi-sample studies are underway at the Stanford Genome Technology Center, with a focus on profiling mutations in cancers and in evolving virus populations. Current methods for DNA variant detection are mostly designed for the analysis of DNA from normal samples, and lack power for the analysis of genetically heterogeneous cell populations such as tumors and viruses. The goal of this proposal is to develop statistical models and methods for detecting mutations and estimating their prevalence in genetically heterogeneous samples, and to derive fast, analytic approaches for estimating their significance and power. Methods will also be developed for the aggregation of genetic profiles across multiple samples in the search for mutation hotspots associated with clinical outcome.
Our specific aims are: 1. Develop statistical models for the calling of single nucleotide polymorphism/mutations, copy number changes, and structural variants in genetically heterogeneous samples. Derive fast, simulation free methods to estimate the false discovery rates of detection schemes under these models. 2. A statistical framework for aggregating mutation profiles across samples. Most current studies group mutations in to genes or exons, or use arbitrary binning schemes. We propose a new approach to this problem by modeling the mutation profile across patients as aligned point processes. We will extend our work on multi-sample scan statistics to develop a genome-wide variable-window width adaptive test for identifying genomic regions where the occurrence of mutations is associated with a given phenotype. This framework can potentially also be applied to genetic association studies with rare variants. The PI, Dr. Nancy R. Zhang, was trained in mathematics (BA), computer sciences (MS) and statistics (PhD), and, as a faculty in the Department of Statistics at Stanford University, has focused on the statistical analysis of DNA copy number and other types of genome-wide profiling data. Much of her published work address the issue of cross-sample and cross-platform aggregation and multiple-testing control in genome profiling studies. At the heart of this proposal is the collaboration with Dr. Hanlee Ji, an assistant professor in the Department of Medicine and senior associate director at the Stanford Genome Technology Center. This proposal timely responds to the growing need of a statistical data analysis platform for genome resequencing at Stanford and in the larger scientific community. Public, open source software will be made available for all of the developed methods.

Public Health Relevance

In this project, Dr. Zhang and her research team will design and implement statistical methods for detecting genomic variants in data produced by massively parallel sequencing technologies. The methods proposed focus on achieving high sensitivity in clinical DNA samples, which may be contaminated or derived from genetically heterogeneous populations (e.g. viruses and tumors). They will also develop rigorous means to estimate and control the error of these detection schemes, which will allow such studies to be compared and evaluated in a systematic way.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Brooks, Lisa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Stanford University
Biostatistics & Other Math Sci
Schools of Arts and Sciences
United States
Zip Code
Andor, Noemi; Graham, Trevor A; Jansen, Marnix et al. (2016) Pan-cancer analysis of the extent and consequences of intratumor heterogeneity. Nat Med 22:105-13
Zheng, Grace X Y; Lau, Billy T; Schnall-Levin, Michael et al. (2016) Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat Biotechnol 34:303-11
Jiang, Yuchao; Qiu, Yu; Minn, Andy J et al. (2016) Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by next-generation sequencing. Proc Natl Acad Sci U S A 113:E5528-37
Chen, Hao; Bell, John M; Zavala, Nicolas A et al. (2015) Allele-specific copy number profiling by next-generation DNA sequencing. Nucleic Acids Res 43:e23
Xia, Li C; Ai, Dongmei; Cram, Jacob A et al. (2015) Statistical significance approximation in local trend analysis of high-throughput time-series data using the theory of Markov chains. BMC Bioinformatics 16:301
Jiang, Yuchao; Oldridge, Derek A; Diskin, Sharon J et al. (2015) CODEX: a normalization and copy number variation detection method for whole exome sequencing. Nucleic Acids Res 43:e39
Cushing, Anna; Kamali, Amanda; Winters, Mark et al. (2015) Emergence of Hemagglutinin Mutations During the Course of Influenza Infection. Sci Rep 5:16178
Hopmans, Erik S; Natsoulis, Georges; Bell, John M et al. (2014) A programmable method for massively parallel targeted sequencing. Nucleic Acids Res 42:e88
Nadauld, Lincoln D; Garcia, Sarah; Natsoulis, Georges et al. (2014) Metastatic tumor evolution and organoid modeling implicate TGFBR2 as a cancer driver in diffuse gastric cancer. Genome Biol 15:428
Xia, Li C; Ai, Dongmei; Cram, Jacob et al. (2013) Efficient statistical significance approximation for local similarity analysis of high-throughput time series data. Bioinformatics 29:230-7

Showing the most recent 10 out of 24 publications