A reliable and precise prognosis is fundamental for successful disease management and treatment selection. More aggressive intervention can be given to patients who are at high risk of early disease onset, while patients who are unlikely to respond to one treatment should be considered for alternative options. With the rapid advancement of technology, a wide range of biological and genomic markers have emerged as potential tools for improving the prediction of disease and treatment outcomes, and may lead to personalized, tailored medicine. New technologies such as DNA sequencing and microarrays are generating detailed data with exponentially increasing dimensionality and complexity. These data presents unprecedented opportunities and great challenges for making accurate prediction of clinical outcomes. To take full advantage of such data, this proposal aims to develop statistical approaches to efficiently construct and evaluate prognostic tools for disease risk assessment and treatment selection. Specifically, in Aim 1, we will develop accurate risk prediction models by incorporating complex interactive effects via a kernel machine regression framework. We will also provide non-parametric procedures for assessing the predictive performance of the resulting models.
In Aim 2, we propose inference procedures for absolute risks and prediction performance of new markers using two-phase studies.
In Aim 3, we develop systematic procedures for identifying subgroups of patients who may or may not benefit from a new treatment using patient level baseline marker information.
In Aim 4, we focus on high dimensional regression and develop regularized resampling methods to construct confidence intervals and hypothesis testing procedures for regression coefficients and the prediction performance of estimated models. To increase the practical impact of our research, in addition to creating software for public use, we will apply the proposed procedures to predict individual risk of developing (i) rheumatoid arthritis among women using the Nurse's Health Study (NHS);(ii) CVD among diabetic patients using the NHS and the Health Professional Follow-up Study;(iii) AIDS defining events among HIV infected patients using a large immunogenetic study;and (iv) CHD or stroke using the Women's Health Initiative (WHI) study. We also plan to develop algorithms to identify cases of various autoimmune diseases using electronic medical record (EMR) data from two large hospitals in Boston. The identified cases will be used for subsequent genetic case-control studies of the corresponding diseases. Such algorithms will enable the use of EMR clinical data directly for discovery research. In addition, we will develop treatment selection strategies for HIV infected patients using randomized ACTG clinical trials and for dietary intervention in preventing CVD using WHI clinical trials. Incorporating genetic profile, modifiable risk factors, along with biologic markers into risk models is likely to improve the prediction of clinical outcomes and ultimately lead to personalized medicine.

Public Health Relevance

The research proposal addresses the pressing need for advanced statistical tools that meet challenges in current development of prediction models for disease risk and treatment benefit. By providing statistical tools that enable clinical investigators to effectively develop personalized disease management strategies, this proposal will join prior and ongoing research activities towards the goal of finding efficient and cost effective personalized medicine.

National Institute of Health (NIH)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Computing and Health Informatics Study Section (BCHI)
Program Officer
Marcus, Stephen
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Harvard University
Biostatistics & Other Math Sci
Schools of Public Health
United States
Zip Code
Matsouaka, Roland A; Li, Junlong; Cai, Tianxi (2014) Evaluating marker-guided treatment selection strategies. Biometrics 70:489-99
Parast, Layla; Tian, Lu; Cai, Tianxi (2014) Landmark Estimation of Survival and Treatment Effect in a Randomized Clinical Trial. J Am Stat Assoc 109:384-394
Sinnott, Jennifer A; Dai, Wei; Liao, Katherine P et al. (2014) Improving the power of genetic association tests with imperfect phenotype derived from electronic medical records. Hum Genet 133:1369-82
Yu, Sheng; Kumamaru, Kanako K; George, Elizabeth et al. (2014) Classification of CT pulmonary angiography reports by presence, chronicity, and location of pulmonary embolism with natural language processing. J Biomed Inform 52:386-93
Zhou, Qian M; Zheng, Yingye; Cai, Tianxi (2013) Assessment of biomarkers for risk prediction with nested case-control studies. Clin Trials 10:677-9
Zhao, Lihui; Tian, Lu; Cai, Tianxi et al. (2013) EFFECTIVELY SELECTING A TARGET POPULATION FOR A FUTURE COMPARATIVE STUDY. J Am Stat Assoc 108:527-539
Zheng, Yingye; Cai, Tianxi; Pepe, Margaret S (2013) Adopting nested case-control quota sampling designs for the evaluation of risk markers. Lifetime Data Anal 19:568-88
Parast, Layla; Cai, Tianxi (2013) Landmark risk prediction of residual life for breast cancer survival. Stat Med 32:3459-71
Zhou, Qian M; Zheng, Yingye; Cai, Tianxi (2013) Subgroup specific incremental value of new markers for risk prediction. Lifetime Data Anal 19:142-69
Cai, Tianxi; Gerds, Thomas A; Zheng, Yingye et al. (2011) Robust prediction of?t-year survival with data from multiple studies. Biometrics 67:436-44

Showing the most recent 10 out of 20 publications