A clinical data warehouse (CDW) is a repository that aggregates medical patient data from many different sources: billing records, electronic medical records including structured data (e.g., codes for diagnoses, procedures, vital signs, etc.), semi-structured reports and free-text dictations. A key benefit of maintaining a CDW lies in its ability to provide the raw data that are needed for large-scale study of real-world health care -- for example, finding a previously unknown association between a pain killer (e.g., Vioxx) and heart disease. Unfortunately, CDWs are riddled with systematic errors that make it difficult to answer even the simplest questions (such as "What fraction of female outpatients have breast cancer?") with any accuracy.

This project focuses on statistical models and learning algorithms for quantifying and correcting errors in CDW records. For example, the project is developing semi-supervised learning methods that use the structured data present in electronic medical records (patient age, weight, medications, billing codes, etc.) in order to quantify the likelihood of error that is associated with the diagnosis codes present in the record (for example, being able to state "There is a 0.2 probability that the correct code was migraine instead of the listed headache"). The project will also develop methods that attempt to control for confounding variables present in the records, in order to remove systematic biases from the data.

These models and learning algorithms will allow CDW users to manage and monitor the uncertainty and error in the data. This in turn will allow fundamentally new types of analysis to be undertaken, which will result in the discovery of actionable medical knowledge that saves both lives and money. To make the models and algorithms accessible to medical professionals who may lack computational or statistical background, they will be added to an open-source release of the widely-used I2B2 CDW software.

The project is a collaboration between the Computer Science Department at Rice University and the School of Biomedical informatics at the University of Texas Health Science Center at Houston. All project results will be made available online (www.cs.rice.edu/~cmj4/CDW.htm).

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0964526
Program Officer
Maria Zemankova
Project Start
Project End
Budget Start
2010-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2009
Total Cost
$600,000
Indirect Cost
Name
Rice University
Department
Type
DUNS #
City
Houston
State
TX
Country
United States
Zip Code
77005