With the advances of technologies, cancer research enterprise is rapidly becoming data-intensive and data- driven. One example is the explosion of biotechnologies and the generation of massive genetic and genomic data, such as whole genome sequencing data. Another example is health informatics, which allows rapid avail- ability of large administrative health care databases, such as electronic medical records and Medicare claim data. Cancer data science has emerged to be increasingly important in cancer research. Indeed, massive data provide unprecedented opportunities for new discovery in cancer. This project aims at development and application of statistical and computational methods for analysis of massive and complex genetic and genomic data, together with epidemiological and clinical data, in population and medical science of cancer research. Our ultimate goal is to use rich data sources to understand cancer etiology, risk, and prognosis, and discover new effective strategies for cancer prevention, intervention and treatment. It has become increasingly evident that limited methods suitable for analyzing massive data have emerged as a bottleneck to effectively translate rich information into meaningful knowledge. There is a pressing need to develop statistical and computational methods for massive cancer data to bridge the technology and information transfer gap, and accelerate innovations in cancer prevention and treatment. This Project aims at narrowing this gap. Specifically, to advance genetic and genomic cancer epidemiology, we will develop statistical and computational methods for (a) analysis of whole genome sequencing association studies; (b) integrative analysis of genetic, genomic, and environment data; (c) study of gene-environment interactions; (d) risk prediction using whole genome genetic and genomic data and environmental data. To advance cancer genomic medicine, we will develop statistical and computational methods for integrative analysis of genetic, genomic and clinical data to understand cancer prognosis and advance precision medicine using (a) data from genetic epidemiological cohort studies; (b) combining data from genetic epidemiological cohort studies with administrative databases such as electronic medical records and Medicare claim data. We have assembled a strong collaborative interdisciplinary team of researchers involving biostatisticians, computational biologists, health informaticians, genetic epidemiologists and clinical scientists. We will apply te proposed methods to lung, breast and nasopharynx cancer genetic epidemiological and clinical studies. We will develop open access user friendly software to be distributed to the research community, and open online educational modules for training cancer researchers in using the methods developed in this Project.

Public Health Relevance

Analytic methods, such as statistical and computational methods, that can handle the complexities associated with big cancer data, play a pivotal role in capitalizing more fully on such data. They will enable cancer researchers to timely and effectively extract knowledge from massive, complex and diverse data, and gain insights in cancer etiology, risk and prognosis, and develop new strategies to reduce cancer burden and improve patient care.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Unknown (R35)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1-GRB-I (M1))
Program Officer
Chen, Huann-Sheng
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Harvard University
Biostatistics & Other Math Sci
Schools of Public Health
United States
Zip Code
Chen, Han; Cade, Brian E; Gleason, Kevin J et al. (2018) Multiethnic Meta-Analysis Identifies RAI1 as a Possible Obstructive Sleep Apnea-related Quantitative Trait Locus in Men. Am J Respir Cell Mol Biol 58:391-401
Li, Yafang; Xiao, Xiangjun; Han, Younghun et al. (2018) Genome-wide interaction study of smoking behavior and non-small cell lung cancer risk in Caucasian population. Carcinogenesis 39:336-346
Xia, Yin; Cai, Tianxi; Cai, T Tony (2018) Multiple Testing of Submatrices of a Precision Matrix with Applications to Identification of Between Pathway Interactions. J Am Stat Assoc 113:328-339
Domenyuk, Valeriy; Gatalica, Zoran; Santhanam, Radhika et al. (2018) Poly-ligand profiling differentiates trastuzumab-treated breast cancer patients according to their outcomes. Nat Commun 9:1219
Barfield, Richard; Feng, Helian; Gusev, Alexander et al. (2018) Transcriptome-wide association studies accounting for colocalization using Egger regression. Genet Epidemiol 42:418-433
Liu, Zhonghua; Lin, Xihong (2018) Multiple phenotype association tests using summary statistics in genome-wide association studies. Biometrics 74:165-175
Lopes-Ramos, Camila M; Kuijjer, Marieke L; Ogino, Shuji et al. (2018) Gene Regulatory Network Analysis Identifies Sex-Linked Differences in Colon Cancer Drug Metabolism. Cancer Res 78:5538-5547
Sinnott, Jennifer A; Cai, Tianxi (2018) Pathway aggregation for survival prediction via multiple kernel learning. Stat Med 37:2501-2515
Sun, Ryan; Carroll, Raymond J; Christiani, David C et al. (2018) Testing for gene-environment interaction under exposure misspecification. Biometrics 74:653-662
Antonelli, Joseph; Cefalu, Matthew; Palmer, Nathan et al. (2018) Doubly robust matching estimators for high dimensional confounding adjustment. Biometrics :

Showing the most recent 10 out of 127 publications