The goals of this proposal are to develop novel statistical tools and a software package for performing mutational signature deconvolution in cancer samples. Mutational signatures are patterns of co-occurring mutations that can reveal insights into a cancer's etiology and evolution. Currently, non-negative matrix factorization (NMF) is the ?gold-standard? for mutational signature deconvolution. However, NMF has several deficiencies in that it cannot do the following things: 1) easily characterize patterns within the flanking sequence beyond the trinucleotide context 2) simultaneously characterize patterns of several genomic features, and 3) predict mutational signatures of new samples given a previously trained model. In this proposal, we will develop a novel discrete Bayesian hierarchical model to characterize mutational signatures in tumor sequencing data that overcomes the limitations of NMF. These types of models are commonly used in text mining applications to infer topics by examining co-occurring word counts across documents. Our model will be able to characterize information about the flanking sequence far beyond the trinucleotide context, incorporate information from other genomic features such as strand or region, and predict signatures in single samples. Importantly, unlike NMF, the inclusion of extra genomic features in our clustering algorithm will not result in loss of power for discovery and will aid in prediction of mutational signatures targeted sequencing data by incorporating additional information. We will also develop an R/Bioconductor package for data preprocessing, inference, and visualization, which will streamline mutational signature analysis for researchers. Both NMF and our novel model will be available in the package so users can compare and contrast the different computational approaches for mutational signature inference. Interestingly, this package will have the capability to interface with several existing projects from the Informatics Technology for Cancer Research (ITCR) program. Finally, we will generate reference mutational signatures by analyzing a large-scale cancer exome sequencing dataset from The Cancer Genome Atlas (TCGA) that can be used to predict mutational signatures in single samples in clinical workflows. Overall, our model will be of great interest to the cancer community as it will provide greater insights into mutational signature patterns and will be useful in clinical settings where mutational signature inference is performed in single samples.

Public Health Relevance

Chemicals and biological processes can cause the mutations that are observed in human tumors. Understanding the patterns of mutations (i.e. mutational signatures) in tumors that have undergone DNA sequencing can reveal insights about different mutagenic processes and how tumors develop. Since current computational methods such as non-negative matrix factorization (NMF) do not completely characterize these mutational patterns and cannot predict signatures in single samples, we will develop a novel computational method and corresponding R package that can better characterize mutational patterns in tumors and be used to predict signatures in single clinical samples.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Exploratory/Developmental Grants (R21)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1)
Program Officer
Miller, David J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Boston University
Internal Medicine/Medicine
Schools of Medicine
United States
Zip Code