One of the central challenges in cancer genomics is the ability to accurately detect somatic mutations in heterogeneous tumors, and precisely determine which fraction of cancer cells harbor these mutations and at what frequency. Deeper understanding of the biological principals behind cancer evolution is central to the discovery of new cancer therapies. However, despite the tremendous advances in sequencing technologies over the last twenty years, most widely used computational approaches and biotechnologies still do not provide enough context to fully resolve the clonal structure in a tumor, due to a combination of low resolution, high cost, and prohibitive sample requirements. 10X Genomics has recently developed a new technology, called ?linked reads?, that address some of these limitations by providing long-range phasing information at low cost and with minimal sample requirements. However, for this data to achieve its full potential and benefit the whole cancer research community, new computational tools must be developed combining novel algorithms for next- generation sequencing data analysis with the long-range information stored in the linked-reads. We propose to overcome these challenges by developing a new variant caller that combines the long-range information in the linked-reads with powerful colored de Bruijn graph data structures to accurately discover and phase inherited and somatic variants (SNVs and indels) in tumor-normal paired sequencing data. The colored de Bruijn graph approach will exploit the full information in the data by jointly analyzing reads coming from the tumor and the normal samples together. This will reduce the false-discovery rate of alignment-based variant caller when detecting longer insertions and deletions, without sacrificing the additional variant calling power provided by the assembly method. Furthermore, the linked-reads data will allow phasing of the variants and improve determination of subclonal structure by directly observing which variants are present on the same molecule, and therefore within the same subclone. We will develop and carefully test our novel variant calling framework using a combination of synthetic and genuine datasets designed to assess the variant calling abilities under diverse sequencing conditions, tumor clonality, and sequencing platforms.
The goal of this proposal is to develop a novel, reliable, well tested and documented open-source software package to accurately identify and phase germline and somatic variants in highly heterogeneous tumors using paired tumor-normal linked-read sequencing. This method will substantially improve the resolution at which researchers can study sub-clonal populations of cells in tumors, and potentially lead to the development of new targeted therapies that directly address tumor evolution.