The health community in recent years has witnessed the digitization of medical records proceed at an astonishing pace. Longitudinal databases offer the promise for health informatics researchers to conduct observational research at massive scales. This project?s goal is to contribute to the statistical and computational advances that will make such a promise a reality. Our three aims highlight clinical research scenarios for which current analytic procedures are inadequate in the setting of large-scale observational data; we then propose novel methods to address these shortcomings.
In Specific Aim 1, we will conduct a rigorous analysis of propensity score estimation tools for bias reduction in drug effect estimates in high dimensional settings. We develop an improved simulation framework and conduct negative control experiments to assess the performance of LASSO penalty regularization and the high dimensional propensity score (hdPS) algorithm.
In Specific Aim 2, we will establish the viability of conditional logistic regression (CLR) for conducting large scale observational cohort studies. CLR, which has demonstrated bias reduction benefits over unconditional logistic regression for case-control studies, has not been evaluated for cohort studies because of its computational burden. We propose the combination of a dynamic programming algorithm and Maximization- Minimization (MM) algorithms to greatly optimize the evaluation of CLR.
In Specific Aim 3, we challenge the one-drug/one-outcome study paradigm dominant in drug safety surveillance research by developing computational tools that will allow for the analysis of thousands of drug-outcome pairs. We leverage MM algorithms on observational study designs such as logistic regression and the self-controlled case series to open up their statistical computation to massive parallelization opportunities with graphics processing units (GPUs). For each specific aim, we conduct evaluation studies in pressing clinical scenarios; for instance, we study the relative risks of anticoagulation medications on serious complications such as stroke and major bleeding. The anticipated impact from the successful completion of these aims include greater clarity in the selection of propensity score models, the viability of using conditional logistic regression to conduct cohort studies, and substantial improvement in the statistical computation for conducting standardized drug safety studies that analyze thousands of drug-outcome pairs.

Public Health Relevance

In this project, we aim to introduce a suite of methods that will lead to the statistically rigorous and computationally efficient analysis of high dimensional observational health data. These methods can be broadly utilized in health informatics for drug safety surveillance, comparative effective research, and patient level prediction of medical outcomes.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Predoctoral Individual National Research Service Award (F31)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Sim, Hua-Chuan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of California Los Angeles
Biostatistics & Other Math Sci
Schools of Medicine
Los Angeles
United States
Zip Code
Tian, Yuxi; Schuemie, Martijn J; Suchard, Marc A (2018) Evaluating large-scale propensity score performance through real-world and synthetic data experiments. Int J Epidemiol 47:2005-2014