Carcinogenesis, progression of normal cells to malignant cancer, derives from hallmark capabilities of cancer driven by acquiring (somatic) mutations in driver genes with a selective advantage for cellular proliferation and potentially metastasis. A major motivation for modern cancer genomics studies is to decipher the genetic architecture of cancer by discovering new driver genes. The most widely-used approaches to predict and prioritize driver genes are based on statistics of mutation frequencies. Several methods have been proposed to identify genes with an excessive number of somatic mutations [9-11], known as significantly mutated genes. I propose to address two major limitations of this approach. First, these methods are insufficiently statistically powered given the amount of sequencing data currently available [15]. I will improve statistical power by leveraging diverse information in cancer genomics currently available into a developed machine learning method. Second, there is little objective clarity about the true effectiveness of these methods [11, 14], since there is no agreed-upon gold standard of driver genes, with the exception of a few well-known drivers. I will develop a framework to compare the effectiveness of driver gene prediction methods, in the absence of a gold standard. Both effectively and efficiently identifying cancer driver genes is a matter of great importance to science funding policy towards cancer genomics.

Public Health Relevance

Large sequencing studies have revolutionized our capability to identify the genetic architecture of cancer. However, effectively integrating this stream of big data to identify specific driver genes has remained troublesome. My proposed research project aims to develop an integrative machine learning method that leverages diverse features in cancer genomics to improve predictions of cancer driver genes, and to utilize a principled approach for evaluating the performance of any such method.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Predoctoral Individual National Research Service Award (F31)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Mcguirl, Michele
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Johns Hopkins University
Biostatistics & Other Math Sci
Biomed Engr/Col Engr/Engr Sta
United States
Zip Code
Reiter, Johannes G; Makohon-Moore, Alvin P; Gerold, Jeffrey M et al. (2018) Minimal functional driver gene heterogeneity among untreated metastases. Science 361:1033-1037
Ng, Patrick Kwok-Shing; Li, Jun; Jeong, Kang Jin et al. (2018) Systematic Functional Annotation of Somatic Mutations in Cancer. Cancer Cell 33:450-462.e10
Cai, Binghuang; Li, Biao; Kiga, Nikki et al. (2017) Matching phenotypes to whole genomes: Lessons learned from four iterations of the personal genome project community challenges. Hum Mutat 38:1266-1276
Tokheim, Collin J; Papadopoulos, Nickolas; Kinzler, Kenneth W et al. (2016) Evaluating the evaluation of cancer driver genes. Proc Natl Acad Sci U S A 113:14330-14335
Tokheim, Collin; Bhattacharya, Rohit; Niknafs, Noushin et al. (2016) Exome-Scale Discovery of Hotspot Mutation Regions in Human Cancer Using 3D Protein Structure. Cancer Res 76:3719-31