Next generation DNA sequencing (NGS) approaches are widely used in studying human diseases and identifying causative genetic variants. Increasingly, NGS methods are being used to define biologically relevant clonal mixtures, a frequently observed phenomenon in human disease. Examples of clonal mixtures in human disease include tumor cell subpopulations that are a part of cancer. Within a single tumor and clearly evident in metastatic tumor sites, cancer cell clonal populations exist, are genetically distinct and carry their own unique set of somatic variants. A similar phenomenon occurs in viral infection where multiple viral quasispecies are harbored within an infected individual;each quasispecies has their own unique set of genetic variants. One can quantitatively measure expansions or shrinkage in clonal populations as seen in changes in allelic representation of clonal variants. Specific cellular phenotypes are attributable to the unique clonal variants and changes in their representation can be indicators of evolutionary processes. This is frequently the case for drug resistance in cancer and viral infections. Thus, clonal genetic variation has major implications for the pathogenesis of human disease and is increasingly being tested as a longitudinal indicator of disease progression and treatment resistance. The general availability of whole genome and deep targeted resequencing provides an opportunity to conduct systematic analysis of heterogeneous DNA mixtures that have different clonal components. However, in many cases the genetic variant of interest is present at very small proportions (<5%) and this makes the delineation of these clonal variants exceeding difficult. Many of the widely employed NGS analysis methods are optimized for detecting normal diploid genome variation. These approaches are not optimal for delineating genomic variants from complex clonal mixtures. Some genomic DNA variant classes such as genomic rearrangements are extremely difficult to detect in the context of clonal mixtures. To improve the assessment of clonal variation and evolution of specific clonal populations, we will develop innovative models and robust, sensitive statistical procedures. These methods will enable one to deconvolute genomic variation in clonal mixtures and consider clonal alterations through time and space. We will focus on improving the delineation of complex variations such as genomic rearrangements and other structural variations in genetic mixtures. To develop our methods, we will use heterogeneous DNA sequence data sets with in silico spike in variants and consider the lowest threshold of detection that we can achieve with the best sensitivity and specificity. Subsequently, we will test these methods on NGS data sets from clinical samples, delineate clonal populations based on unique variants and consider quantitative changes in allelic representation as seen in clonal expansion. These samples will be subject to whole genome and targeted resequencing. Cancer relevant samples will include tumors with matched normal, primary and metastatic DNA. We will consider viral quasispecies for a set of clinical samples where we have matched viral nucleic samples obtained longitudinally over the course of infection from a single individual. As a final milestone, we will release our methods as open source software for the biomedical research community.

Public Health Relevance

Breakthroughs in DNA sequencing technologies are having a major impact on the study of human diseases and these methods are increasingly being applied to improve diagnosis and treatment. A hallmark of diseases such as cancer and viral infections is their genetic complexity. For example, even within a single patient, cancer or viral infections are not homogeneous in their genetic composition, but rather contain smaller populations that have unique genetic changes. As a result, these disease states are genetic mixtures and determining the most important genetic changes is complicated and difficult. We will develop methods and approaches that will improve the analysis and detection of disease-related genetic changes from mixtures with direct application in cancer and viral infections.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Brooks, Lisa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pennsylvania
Biostatistics & Other Math Sci
Schools of Arts and Sciences
United States
Zip Code
Zhou, Zilu; Wang, Weixin; Wang, Li-San et al. (2018) Integrative DNA copy number detection and genotyping from sequencing and array-based platforms. Bioinformatics 34:2349-2355
Xia, Li Charlie; Ai, Dongmei; Lee, Hojoon et al. (2018) SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution. Gigascience 7:
Urrutia, Eugene; Chen, Hao; Zhou, Zilu et al. (2018) Integrative pipeline for profiling DNA copy number and inferring tumor phylogeny. Bioinformatics 34:2126-2128
Wang, Jingshu; Huang, Mo; Torre, Eduardo et al. (2018) Gene expression distribution deconvolution in single-cell RNA sequencing. Proc Natl Acad Sci U S A 115:E6437-E6446
Zhang, Hanrui; Zhang, Nancy R; Li, Mingyao et al. (2018) First Giant Steps Toward a Cell Atlas of Atherosclerosis. Circ Res 122:1632-1634
Huang, Mo; Wang, Jingshu; Torre, Eduardo et al. (2018) SAVER: gene expression recovery for single-cell RNA sequencing. Nat Methods 15:539-542
Ai, Dongmei; Huang, Ruocheng; Wen, Jin et al. (2017) Integrated metagenomic data analysis demonstrates that a loss of diversity in oral microbiota is associated with periodontitis. BMC Genomics 18:1041
Chen, Hao; Jiang, Yuchao; Maxwell, Kara N et al. (2017) ALLELE-SPECIFIC COPY NUMBER ESTIMATION BY WHOLE EXOME SEQUENCING. Ann Appl Stat 11:1169-1192
Jiang, Yuchao; Zhang, Nancy R; Li, Mingyao (2017) SCALE: modeling allele-specific gene expression by single-cell RNA sequencing. Genome Biol 18:74
Lau, Billy T; Ji, Hanlee P (2017) Single molecule counting and assessment of random molecular tagging errors with transposable giga-scale error-correcting barcodes. BMC Genomics 18:745

Showing the most recent 10 out of 38 publications