Although overt and latent viral infections are widespread in human populations, only few viruses have been linked to tumorigenesis. While this may be the result of coevolution between human and viruses, it appears more likely that the low number is a reflection of the difficulties in establishing causal relations between viral infection and cancer. For the few confirmed oncoviruses (HPV, HBV, HCV, EBV, etc.), some infected patients develop cancers with variable clinical course and presentation. We hypothesize that additional essential events in the host or the viruses are required for the cancers to initiate/progress and that many more cancers than we currently know have a viral connection. A wealth of cancer sequence data is being produced by multiple large- scale cancer projects, such as the Cancer Genome Atlas (TCGA), International Cancer Genome Consortium (ICGC), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and Pediatric Cancer Genome Project (PCGP). In order to take advantage of these valuable data sets for testing the above hypothesis, we propose to develop a set of computational methods and analysis strategies to characterize the viromes, genomes, and transcriptomes in different cancers with known clinical features. In particular, we will first build a computational pipeline for the identification and characterization of viruses in cancer. This effort will be augmented by developing statistical approaches to establish the association among virus characteristics, host genetic alterations, and clinical features (Aim 1). This pipeline will first be applied to detect and characterize viruses in cancer types such as cervical cancer, head and neck cancer, and hepatocellular carcinoma, all of which are known to be associated with viral etiology. Beyond confirming the links between these cancers and their known oncoviruses, HPV and HBV/HCV respectively, we will aim at a thorough characterization of all genomic and transcriptomic changes in both host and virus. Combined with clinical features of the cancers, we expect to establish association between such changes and the status of the cancers (Aim 2). Taking a step further, we will also utilize this validated pipeline to systematically analyze sequence data from cancer types having some initial evidence of viral involvement from animal model and epidemiology studies, the aim being to perform more sensitive detections of cancer-causing viruses missed by traditional approaches and to establish the statistical association between viral infection and tumor formation using uniform and high quality data from a large number of tumor samples (Aim 3). The successful analysis of the viral and host genes, transcriptomes, and genomes of over 6,000 cancer cases from many cancer types already sequenced by several major efforts will produce, for the first time, a state-of-the-art knowledge base of the cancer virome. We anticipate that new pathogenic viruses and/or subtypes will be discovered and more cancers will be explained by viruses. This would lead to a paradigm shift for cancer prevention and treatment. Finally, both pipeline and results from this project will be made publically available, facilitating the analysis and interpretation by the research community to better discover and understand viruses in cancer.
Viral infections are currently estimated to cause 15-20% of all human cancers. The computational tools and analysis strategies described in this proposal will enable efficient and cost-effective discovery of known and novel viruses relevant to various cancer types using available large-scale sequencing data. In turn, this will accelerate the overall understanding of virus infection in human cancers and improve cancer prevention and treatment.