Next generation DNA sequencing (NGS) approaches are widely used in studying human diseases and identifying causative genetic variants. Increasingly, NGS methods are being used to define biologically relevant clonal mixtures, a frequently observed phenomenon in human disease. Examples of clonal mixtures in human disease include tumor cell subpopulations that are a part of cancer. Within a single tumor and clearly evident in metastatic tumor sites, cancer cell clonal populations exist, are genetically distinct and carry their own unique set of somatic variants. A similar phenomenon occurs in viral infection where multiple viral quasispecies are harbored within an infected individual;each quasispecies has their own unique set of genetic variants. One can quantitatively measure expansions or shrinkage in clonal populations as seen in changes in allelic representation of clonal variants. Specific cellular phenotypes are attributable to the unique clonal variants and changes in their representation can be indicators of evolutionary processes. This is frequently the case for drug resistance in cancer and viral infections. Thus, clonal genetic variation has major implications for the pathogenesis of human disease and is increasingly being tested as a longitudinal indicator of disease progression and treatment resistance. The general availability of whole genome and deep targeted resequencing provides an opportunity to conduct systematic analysis of heterogeneous DNA mixtures that have different clonal components. However, in many cases the genetic variant of interest is present at very small proportions (<5%) and this makes the delineation of these clonal variants exceeding difficult. Many of the widely employed NGS analysis methods are optimized for detecting normal diploid genome variation. These approaches are not optimal for delineating genomic variants from complex clonal mixtures. Some genomic DNA variant classes such as genomic rearrangements are extremely difficult to detect in the context of clonal mixtures. To improve the assessment of clonal variation and evolution of specific clonal populations, we will develop innovative models and robust, sensitive statistical procedures. These methods will enable one to deconvolute genomic variation in clonal mixtures and consider clonal alterations through time and space. We will focus on improving the delineation of complex variations such as genomic rearrangements and other structural variations in genetic mixtures. To develop our methods, we will use heterogeneous DNA sequence data sets with in silico spike in variants and consider the lowest threshold of detection that we can achieve with the best sensitivity and specificity. Subsequently, we will test these methods on NGS data sets from clinical samples, delineate clonal populations based on unique variants and consider quantitative changes in allelic representation as seen in clonal expansion. These samples will be subject to whole genome and targeted resequencing. Cancer relevant samples will include tumors with matched normal, primary and metastatic DNA. We will consider viral quasispecies for a set of clinical samples where we have matched viral nucleic samples obtained longitudinally over the course of infection from a single individual. As a final milestone, we will release our methods as open source software for the biomedical research community.

Public Health Relevance

Breakthroughs in DNA sequencing technologies are having a major impact on the study of human diseases and these methods are increasingly being applied to improve diagnosis and treatment. A hallmark of diseases such as cancer and viral infections is their genetic complexity. For example, even within a single patient, cancer or viral infections are not homogeneous in their genetic composition, but rather contain smaller populations that have unique genetic changes. As a result, these disease states are genetic mixtures and determining the most important genetic changes is complicated and difficult. We will develop methods and approaches that will improve the analysis and detection of disease-related genetic changes from mixtures with direct application in cancer and viral infections.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Brooks, Lisa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pennsylvania
Biostatistics & Other Math Sci
Schools of Arts and Sciences
United States
Zip Code
Nadauld, Lincoln D; Garcia, Sarah; Natsoulis, Georges et al. (2014) Metastatic tumor evolution and organoid modeling implicate TGFBR2 as a cancer driver in diffuse gastric cancer. Genome Biol 15:428
Cushing, Anna; Flaherty, Patrick; Hopmans, Erik et al. (2013) RVD: a command-line program for ultrasensitive rare single nucleotide variant detection using targeted next-generation DNA resequencing. BMC Res Notes 6:206
Muralidharan, Omkar; Natsoulis, Georges; Bell, John et al. (2012) A cross-sample statistical model for SNP detection in short-read sequencing data. Nucleic Acids Res 40:e5