Our goal is to build an infrastructure to discover novel viruses associated with human cancer from next-generation sequencing data, using a sequence-based computational subtraction approach that we developed. This proposed project responds to the ARRA Research and Research Infrastructure Grand Opportunities RFA on """"""""Identifying Potential Viral Signatures in Large Scale Studies of Germline and Somatic Changes in Cancer Genomes Pilot Program"""""""". Large-scale genome projects, including The Cancer Genome Atlas (TCGA) as well as studies of germ-line genetic correlations with cancer, are now moving towards the application of ultra-high- throughput next-generation sequencing approaches, both on the cDNA level (""""""""RNA-seq"""""""") and the DNA level, especially whole genome sequencing (""""""""WGS""""""""). The study is focused on the computational analysis of large next-generation sequencing data sets for virus discovery. Specifically, we plan to build an infrastructure to apply sequence-based computational subtraction, a method developed by the PI and co-investigator jointly, to evaluate the presence of novel non-human nucleic acid sequences in databases generated by these large-scale cancer genome projects. This approach starts with the assumption that virally-induced cancers contain both human and viral nucleic acids, and that subtraction of the human genome from cancer-derived sequences will leave residual candidate non-human and potentially viral sequences. First, we will build a software pipeline for computational subtraction-based data analysis and candidate pathogen sequence discovery. Second, we will apply this pipeline to the incoming flood of next-generation sequencing data from TCGA and other large-scale data sets. Third, we will experimentally test non-human sequences that we have identified for their presence in validation cohorts for the cancers in which they were discovered. Fourth, we will use the validation data to circle back and improve the quality of our computational pipeline. In the long run, we anticipate that we can build a sustainable pipeline that could be supported either as an academic or industrial effort. Identification of a novel infectious agent associated with human cancer would have immediate preventive, diagnostic and therapeutic significance. The infrastructure that we develop in this two-year project pilot will lay the groundwork for discovering additional cancer-associated pathogens in the future, by analyzing the ever-increasing quantities of next-generation cancer sequencing data.

Public Health Relevance

Viruses are among the major causes of human cancer. Discovering these viruses can lead to major improvements in public health, because virally induced cancers can be prevented by vaccination. In recent years, hepatitis B vaccination has led to a dramatic decrease in the occurrence of liver cancer, and human papillomavirus vaccination has been shown to decrease the rates of cervical carcinoma. Genome analysis and sequencing technologies are being used to discover the causes of human cancer, in projects such as The Cancer Genome Atlas, or TCGA. These technologies can also lead to the discovery of new viruses. Therefore the National Cancer Institute is investing funds from the American Recovery and Reinvestment Act of 2009 to support the discovery of new viruses in data from cancer genome projects such as TCGA. Our proposal is responsive to the National Cancer Institute request, entitled """"""""Identifying Potential Viral Signatures in Large Scale Studies of Germline and Somatic Changes in Cancer Genomes Pilot Program"""""""". We have developed a powerful computational approach to compare DNA and RNA sequences from cancer, or from cancer patients, to the normal human genome. Sequences that are unique to cancers, or to cancer patients, may represent novel cancer- causing viruses. In this plan, we will build a stable software infrastructure to perform this sequence comparison, apply this infrastructure to data from large-scale cancer genome projects, test candidate sequences for whether they are likely to represent viruses, and then continue to improve the software infrastructure. This effort will enable discovery of viruses by the entire cancer research community.

Agency
National Institute of Health (NIH)
Institute
National Cancer Institute (NCI)
Type
High Impact Research and Research Infrastructure Programs (RC2)
Project #
1RC2CA148317-01
Application #
7856252
Study Section
Special Emphasis Panel (ZCA1-GRB-I (O9))
Program Officer
Lee, Jerry S
Project Start
2009-09-30
Project End
2011-08-31
Budget Start
2009-09-30
Budget End
2010-09-29
Support Year
1
Fiscal Year
2009
Total Cost
$765,137
Indirect Cost
Name
Dana-Farber Cancer Institute
Department
Type
DUNS #
076580745
City
Boston
State
MA
Country
United States
Zip Code
02215
Arvey, Aaron; Ojesina, Akinyemi I; Pedamallu, Chandra Sekhar et al. (2015) The tumor virus landscape of AIDS-related lymphomas. Blood 125:e14-22
Bhatt, Ami S; Manzo, Veronica E; Pedamallu, Chandra Sekhar et al. (2014) In search of a candidate pathogen for giant cell arteritis: sequencing-based characterization of the giant cell arteritis microbiome. Arthritis Rheumatol 66:1939-44
Kostic, Aleksandar D; Chun, Eunyoung; Robertson, Lauren et al. (2013) Fusobacterium nucleatum potentiates intestinal tumorigenesis and modulates the tumor-immune microenvironment. Cell Host Microbe 14:207-15
Bhatt, Ami S; Freeman, Samuel S; Herrera, Alex F et al. (2013) Sequence-based discovery of Bradyrhizobium enterica in cord colitis syndrome. N Engl J Med 369:517-28
Kostic, Aleksandar D; Gevers, Dirk; Pedamallu, Chandra Sekhar et al. (2012) Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Genome Res 22:292-8
Cesarman, Ethel (2011) Gammaherpesvirus and lymphoproliferative disorders in immunocompromised patients. Cancer Lett 305:163-74
Kostic, Aleksandar D; Ojesina, Akinyemi I; Pedamallu, Chandra Sekhar et al. (2011) PathSeq: software to identify or discover microbes by deep sequencing of human tissue. Nat Biotechnol 29:393-6