The modernization and standardization of clinical care information systems is creating large networks of linked electronic health records (EHR) that capture key treatments and select patient outcomes for millions of patients throughout the country. The observational data emerging from these systems provide an unparalleled opportunity to learn about the effectiveness of existing and novel treatments, and to monitor potential safety issues that may arise when interventions are used in broad patient populations. However, observational clinical data have exposures that are driven by many factors and therefore aggressive adjustment is needed to remove as much confounding bias as possible in order to make attribution regarding select exposures. The field of machine learning provides a powerful collection of data-driven approaches for performing flexible, thorough confounding adjustment, but performing reliable statistical inference is particularly challenging when these techniques are used as part of the analytic strategy. We propose to advance reproducible research methods by developing and illustrating novel targeted learning tools that leverage the flexibility of machine learning methods to detect and characterize health effect signals using large-scale EHR data. Specifically, we will first develop techniques for making efficient, statistically valid and robust inference for treatment effects using state-of-the-art machine learning tools. We will also develop online learning techniques to make such inference in the context of streaming EHR data. Methodological advances will enable us to formulate a formal, rigorous and practical framework for conducting continuous, effective and reliable surveillance for safety endpoints. Finally, we will develop statistical approaches for incorporating prior information -- including demographic, epidemiologic or pharmacodynamic knowledge, for example -- to improve health effect estimation and inference when the health outcome of interest is rare and the statistical problem is thus difficult, as often occurs in safety surveillance. The ultimate goal of the proposed research is to enable biomedical researchers and public health regulators to carefully monitor and protect the health of the public by allowing them to more effectively and more reliably detect critical health effect signals that may be contained in population-scale EHR data.
The modernization and standardization of clinical care information systems is creating large networks of linked electronic medical records that capture key treatments and select patient outcomes for millions of U.S. subjects. The population scale of contemporary health care data is opening new opportunities for quickly learning from observational data, and is now supporting on-going national surveillance that will monitor the risks and benefits of both existing and novel treatment paths. The objective of this proposal is to provide an inferential framework that leverages the flexibility of machine learning methods to detect health effect signals, including in the important setting of high-dimensional confounders and/or rare events, and to develop a real-time sequential updating methodology for safety signal detection.