Enormous healthcare resources are devoted to compiling electronic medical record (EMR) databases that are increasingly integrated and rich in patient population and that offer potential for identifying disease risk factors via statistical analyses to predict the disease risk as a function of various factors (e.g., clinical and demographic) for that patient. Unfortunately, the disease event data may have high miscoding error rates, due to the fact that clerical personnel with limited training are employed to enter their codes. For example, in one EMR database of patients with cardiac workup, after reviewing a random sample of cases recorded as sudden cardiac arrest events, the error rate was found to be 75 percent. In order to take such errors into account and avoid developing unreliable risk assessment models, it is imperative that a doctor perform chart reviews to validate a sample of cases and determine whether the events were true events. However, the number of chart reviews is limited due to the high cost of doctors' time. The objective of this research is to develop a methodology for judiciously and efficiently selecting validation cases for maximum information content, which will allow reliable disease risk assessment even with highly error-prone EMR data. The anticipated benefits to the health and well-being of society are substantial, as this research will allow the enormous untapped potential of large EMR databases to be more fully utilized for discovering new disease risk factors. It is also anticipated that this research can be extended to other big-data application domains for extracting reliable information from large quantities of data that are of questionable quality.

Large electronic medical record (EMR) databases offer potential for developing clinical hypotheses and identifying disease risk associations by fitting statistical models that predict the likelihood that a patient develops a particular condition as a function of various predictor variables (e.g., clinical, phenotypical, and demographic data) for that patient. Although the predictor variable data are often recorded reliably, the event data may have high error rates due to ICD-9 disease miscoding. To avoid developing unreliable risk assessment models, previous research used random validation sampling to estimate error probabilities for correcting biases in logistic regression models fit to the entire data, which is both inefficient and unreliable with high error rates. In contrast, this research will develop a validation sampling and reliable risk assessment (VSRRA) methodology for judiciously designing a validation sample. The intellectual underpinning is the observed analogy between VSRRA and traditional design of experiments (DOE), whereby validating the response for one error-prone case in VSRRA corresponds to conducting one experimental run in DOE. In light of this analogy, this research will develop (i) suitable VSRRA design criteria based on the Fisher information matrix for the model parameters and Bayesian counterparts such as posterior and preposterior parameter covariance matrices, applicable to a broad class of generalized linear models commonly used in medical risk studies; (ii) heuristic and more exact hybrid algorithms for selecting the validation sample to optimize the design criteria; (iii) multistage, sequential versions of the VSRRA sampling strategies that refine the designs based on information that is learned along the way, as new cases are validated; and (iv) methods that determine whether and how the full set of unvalidated data can be reliably included, along with the validated data, in the final model fitting. A fundamental tenet of data analysis is that carefully designed experimental studies produce far more reliable statistical conclusions than observational studies. Likewise, it is anticipated that the DOE-based VSRRA methodology will allow far more reliable disease risk assessment and hypotheses generation.

Project Start
Project End
Budget Start
2014-09-01
Budget End
2018-08-31
Support Year
Fiscal Year
2014
Total Cost
$399,999
Indirect Cost
Name
Northwestern University at Chicago
Department
Type
DUNS #
City
Chicago
State
IL
Country
United States
Zip Code
60611