Cancer genomes typically harbor a substantial number of somatic mutations. Relatively few driver mutations actually alter the function of proteins in tumor cells, whereas most mutations are considered to be functionally neutral passenger mutations. Over the past decade, the search for cancer driver mutations has focused on coding regions and several mutational significance algorithms have been developed for coding mutations. The contribution of mutations in noncoding regulatory regions to tumor formation largely remains unknown and current mutational significance algorithms are not designed to detect driver mutations in noncoding regions, due to biological differences between coding and noncoding mutations. The emerging availability of large whole- genome sequencing datasets (e.g. PCAWG and HMF datasets) creates an ample opportunity to develop new mutational significance algorithms that are particularly designed for the interpretation of noncoding regions. Recently, we have developed a new statistical approach that identifies driver mutations in coding regions based on the nucleotide context. Critically, consideration of the nucleotide context around mutations does not require prior knowledge for functional consequences associated with these mutations. Hence, we hypothesize that generalizing our nucleotide context model to noncoding regions will uncover novel noncoding driver mutations that cannot be detected using the mutational significance approaches currently available. For this purpose, we will develop a statistical framework that incorporates the biological differences between coding and noncoding mutations and that is specifically designed to detect driver mutations in noncoding regions. Specifically, we will consider the context-dependent distribution of passenger mutations, modeling of the background mutation rate, accurately partition the background mutation rate, model the sequence composition of the reference genome, and account for coverage fluctuation. We will then combine these statistical components by computing an independent product of their underlying probabilities. We will derive a significance p-value using a Monte-Carlo simulation approach, and use FDR for multiple hypothesis test correction. This strategy will allow us to accurately estimate the significance of somatic mutations in noncoding genomic regions. We will next apply this statistical framework to whole-genome sequencing data of 5,523 tumor patients, thereby deriving a comprehensive list of candidate driver mutations in noncoding regions. Finally, we will investigate whether noncoding mutations are overrepresented in transcription factor binding sites, regulate gene expression levels, induce alternative splicing, or affect epigenomic states. Upon the completion of this project, we will have developed and applied a statistical framework for discovery of significant somatic mutations in noncoding regions, and defined the mutational landscape of the non-coding cancer genome. All aspects of the methods developed and applied in this project will be made open source and developed in an online platform.

Public Health Relevance

While coding cancer driver mutations have been characterized in detail over the past decade, the contribution of noncoding mutations to tumor formation remains - apart from few examples (e.g. mutations in TERT promoters) - largely unknown. Recently, large-scale whole-genome sequencing datasets have been made available, but a major bottleneck for the biological and clinical interpretation of these cancer whole-genome cohorts is the lack of statistical models that identify driver mutations in noncoding regions. We developed a new statistical approach that characterizes driver mutations based on their surrounding nucleotide context in coding regions, and herein we propose a concrete plan to generalize our computational model to noncoding regions, apply our model to aggregated whole-genome sequencing data of 5,523 tumor patients (PCAWG, HMF datasets), and define the noncoding driver and passenger mutational landscape for biological discovery and focused clinical application.

Agency
National Institute of Health (NIH)
Institute
National Cancer Institute (NCI)
Type
Exploratory/Developmental Grants (R21)
Project #
1R21CA242861-01
Application #
9825986
Study Section
Special Emphasis Panel (ZCA1)
Program Officer
Miller, David J
Project Start
2019-07-01
Project End
2021-06-30
Budget Start
2019-07-01
Budget End
2020-06-30
Support Year
1
Fiscal Year
2019
Total Cost
Indirect Cost
Name
Dana-Farber Cancer Institute
Department
Type
DUNS #
076580745
City
Boston
State
MA
Country
United States
Zip Code
02215