Electronic health record (EHR) databases collect data that reflect routine clinical care. These databases are increasingly used in comparative effectiveness research, patient-centered outcomes research, quality improvement assessment, and public health surveillance to generate actionable evidence that improves patient care. It is often necessary to analyze multiple databases that cover large and diverse populations to improve the statistical power of the study or generalizability of the findings. A common approach to analyzing multiple databases is the use of a distributed research network (DRN) architecture, in which data remains under the physical control of data partners. Although EHRs are generally thought to contain rich clinical information, the information is not uniformly collected. Certain information is available only for some patients, and only at some time points for a given patient. There are generally two types of missing information in EHRs. The first is the conventionally understood and obvious missing data in which some data fields (e.g., body mass index) are not complete for various reasons, e.g., the clinician does not collect the information or the patient chooses not to provide the information. The second is less obvious because the data field is not empty but the recorded value may be incorrect due to missing data. For example, EHRs generally do not have complete data for care that occurs in a different delivery system. A medical condition (e.g., asthma) may be coded as ?no? but the true value would have been ?yes? if more complete data had been available, e.g., from claims data as the other delivery system would submit a claim to the patient?s health plan for the care provided. In other words, one may incorrectly treat ?absence of evidence? as ?evidence of absence?. EHRs hold great promise but we must address several outstanding methodological challenges inherent in the databases, specifically missing data. Addressing missing data is more challenging in DRNs due to different missing data mechanisms across databases.
The specific aims of the study are: (1) Apply and assess missing data methods developed in single-database settings to handle obvious and well-recognized missing data in DRNs; (2) Apply and assess machine learning and predictive modeling techniques to address less obvious and under-recognized missing data for select variables in DRNs; and (3) Apply and assess a comprehensive analytic approach that combines conventional missing data methods and machine learning techniques to address missing data in DRNs. The analytic methods developed in this project, including the extension of existing missing data methods to DRNs, the innovative use of machine learning techniques to address missing data, and their integration with privacy- protecting analytic methods, will have direct impact on the design and analysis of future comparative effectiveness and safety studies, and patient-centered outcomes research conducted in DRNs.