Although overt and latent viral infections are widespread in human populations, only few viruses have been linked to tumorigenesis. While this may be the result of coevolution between human and viruses, it appears more likely that the low number is a reflection of the difficulties in establishing causal relations between viral infection and cancer. For the few confirmed oncoviruses (HPV, HBV, HCV, EBV, etc.), some infected patients develop cancers with variable clinical course and presentation. We hypothesize that additional essential events in the host or the viruses are required for the cancers to initiate/progress and that many more cancers than we currently know have a viral connection. A wealth of cancer sequence data is being produced by multiple large- scale cancer projects, such as the Cancer Genome Atlas (TCGA), International Cancer Genome Consortium (ICGC), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and Pediatric Cancer Genome Project (PCGP). In order to take advantage of these valuable data sets for testing the above hypothesis, we propose to develop a set of computational methods and analysis strategies to characterize the viromes, genomes, and transcriptomes in different cancers with known clinical features. In particular, we will first build a computational pipeline for the identification and characterization of viruses in cancer. This effort will be augmented by developing statistical approaches to establish the association among virus characteristics, host genetic alterations, and clinical features (Aim 1). This pipeline will first be applied to detect and characterize viruses in cancer types such as cervical cancer, head and neck cancer, and hepatocellular carcinoma, all of which are known to be associated with viral etiology. Beyond confirming the links between these cancers and their known oncoviruses, HPV and HBV/HCV respectively, we will aim at a thorough characterization of all genomic and transcriptomic changes in both host and virus. Combined with clinical features of the cancers, we expect to establish association between such changes and the status of the cancers (Aim 2). Taking a step further, we will also utilize this validated pipeline to systematically analyze sequence data from cancer types having some initial evidence of viral involvement from animal model and epidemiology studies, the aim being to perform more sensitive detections of cancer-causing viruses missed by traditional approaches and to establish the statistical association between viral infection and tumor formation using uniform and high quality data from a large number of tumor samples (Aim 3). The successful analysis of the viral and host genes, transcriptomes, and genomes of over 6,000 cancer cases from many cancer types already sequenced by several major efforts will produce, for the first time, a state-of-the-art knowledge base of the cancer virome. We anticipate that new pathogenic viruses and/or subtypes will be discovered and more cancers will be explained by viruses. This would lead to a paradigm shift for cancer prevention and treatment. Finally, both pipeline and results from this project will be made publically available, facilitating the analysis and interpretation by the research community to better discover and understand viruses in cancer.

Public Health Relevance

Viral infections are currently estimated to cause 15-20% of all human cancers. The computational tools and analysis strategies described in this proposal will enable efficient and cost-effective discovery of known and novel viruses relevant to various cancer types using available large-scale sequencing data. In turn, this will accelerate the overall understanding of virus infection in human cancers and improve cancer prevention and treatment.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Research Project (R01)
Project #
Application #
Study Section
Cancer Genetics Study Section (CG)
Program Officer
Li, Jerry
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Washington University
Internal Medicine/Medicine
Schools of Medicine
Saint Louis
United States
Zip Code
Huang, Kuan-Lin; Mashl, R Jay; Wu, Yige et al. (2018) Pathogenic Germline Variants in 10,389 Adult Cancers. Cell 173:355-370.e14
Jayasinghe, Reyka G; Cao, Song; Gao, Qingsong et al. (2018) Systematic Analysis of Splice-Site-Creating Mutations in Cancer. Cell Rep 23:270-281.e3
Gao, Qingsong; Liang, Wen-Wei; Foltz, Steven M et al. (2018) Driver Fusions and Their Implications in the Development and Treatment of Human Cancers. Cell Rep 23:227-238.e3
Bailey, Matthew H; Tokheim, Collin; Porta-Pardo, Eduard et al. (2018) Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell 173:371-385.e18
Sengupta, Sohini; Sun, Sam Q; Huang, Kuan-Lin et al. (2018) Integrative omics analyses broaden treatment targets in human cancer. Genome Med 10:60
Mashl, R Jay; Scott, Adam D; Huang, Kuan-Lin et al. (2017) GenomeVIP: a cloud platform for genomic variant discovery and interpretation. Genome Res 27:1450-1459
Foltz, Steven M; Liang, Wen-Wei; Xie, Mingchao et al. (2017) MIRMMR: binary classification of microsatellite instability using methylation and mutations. Bioinformatics 33:3799-3801
Wyczalkowski, Matthew A; Wylie, Kristine M; Cao, Song et al. (2017) BreakPoint Surveyor: a pipeline for structural variant visualization. Bioinformatics 33:3121-3122
Cao, Song; Wendl, Michael C; Wyczalkowski, Matthew A et al. (2016) Divergent viral presentation among human tumors and adjacent normal tissues. Sci Rep 6:28294
Niu, Beifang; Scott, Adam D; Sengupta, Sohini et al. (2016) Protein-structure-guided discovery of functional mutations across 19 cancer types. Nat Genet 48:827-37

Showing the most recent 10 out of 12 publications