Randomized clinical trials (RCTs) are the gold-standard method of evaluating cancer treatment, which has immense health and economic burdens worldwide. However, practical considerations that allow an RCT to be conducted typically require a relatively small sample size and restricted eligibility criteria such that the study has inadequate power to generalize treatment effects to elderly patients or other under-represented patient pop- ulations. On the other hand, massive real-world data (RWD) are increasingly captured by population-based databases and registries, such as Surveillance, Epidemiology, and End Results (SEER), SEER-Medicare, and National Cancer Database (NCDB), that have much broader demographic and clinical diversity compared to RCT cohorts. Treatment evaluation using causal inference methods and RWD that were not collected purely for re- search purposes is now frequently performed but fraught with limitations such as confounding due to lack of randomization. In fact, the agreement between RCT and RWD ?ndings is often low in the analysis of matched RCT and RWD studies with the same treatment comparisons. Although several national organizations and reg- ulatory agencies have advocated using RWD to complement RCTs, methods that integrate these two potentially complementary data sources and achieve better treatment evaluation over the use of a single data source alone have yet to be developed. This proposal is motivated by the PIs' collaborative work to study the safety and ef?cacy of treatment strategies for elderly non-small cell lung cancer (NSCLC) and esophageal cancer patients by integrating data from multiple sources: RCTs from NCI cooperative groups and the real-world databases (e.g. SEER, SEER-Medicare, and NCDB). The objective of this project is to develop new statistical methods for integrative analyses of RCTs and RWD that can improve the generalizability and increase estimation ef?ciency of RCT ?ndings to more diverse real-world patients as well as under-studied populations while avoiding confounding bias inherent in RWD.
In Aim 1, we develop methods for statistical analysis of RCT data to compare chemoradiotherapy patterns for the real-world and elderly NSCLC patients by leveraging the baseline covariates of comparable patients from SEER, for whom the temporal information of chemotherapy and radiation and the outcome are both missing.
Aims 2 and 3 focus on the settings when both RCT and RWD provide comparable covariates, treatment, and outcome information.
In Aim 2, we develop improved analysis of RCT data to evaluate trimodality therapy versus surgery alone for the real-world and elderly esophageal cancer patients by exploiting the large sample size and predictive power offered by the NCDB/SEER-Medicare.
In Aim 3, we develop new ef?cient and data-adaptive methods to estimate individualized treatment effects of adjuvant chemotherapy versus observation, possibly modi?ed by age and tumor size, for stage IB resected NSCLC patients by integrating RCT and NCDB data.

Public Health Relevance

The proposed research is closely in line with the 21st Century Cures Act, passed in 2016, which placed additional focus on the use of big real-world data to support decision making and precision medicine. The availability of multiple data sources, namely randomized clinical trials (RCTs) and real-world databases, presents unique and novel opportunities for medical research, because the knowledge that can be acquired from integrative analyses would not be possible from any single-source analysis alone. Our effort is important to bridge RCTs and vast real-world databases and registries arising from clinical practices in order to better understand how treatment works for the real-world and under-studied patient populations outside relatively narrow RCT eligibility criteria and provide accurate and reliable evidence for patient-centered care.

National Institute of Health (NIH)
National Institute on Aging (NIA)
Research Project (R01)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Salive, Marcel
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
North Carolina State University Raleigh
Biostatistics & Other Math Sci
Schools of Arts and Sciences
United States
Zip Code