A real-time clinical repository contains a wealth of detailed information useful for clinical care, research, and administration. In their raw form, however, the data are difficult to use there is too much volume, too much detail, missing values, and inaccuracies. Clinicians, researchers, and administrators require higher level interpretations that address their questions. For example, a clinician may need to know whether a patient is at sufficient risk for having active tuberculosis to warrant respiratory isolation. The answer to the question may be spread around the clinical repository in chest radiographs, laboratory tests, medication histories, vital signs, and physician's notes. Translating from these raw data to the interpretation (at risk or not) is a difficult and laborious task. The hypothesis of this proposal is that data mining techniques can be applied to a real-time clinical repository to discover knowledge and generate accurate clinical interpretations, and that these interpretations can be automated. The project differs from earlier machine learning studies in its emphasis on a real clinical repository and the use of natural language processing to supply coded clinical data.
The specific aims are: (l) Select clinical domains--Several clinical domains with interesting, non-trivial clinical problems will be selected. Problems for which a gold standard answer can or has been assembled for a retrospective cohort will be chosen. (2) Prepare raw clinical data for mining--The raw data from a clinical repository will be transformed into a structure that facilitates data mining. The data will be flattened, pivoted, summarized, and mapped as needed for the domains. Narrative data will be coded using the MedLEE natural language processor. The preparation process will be automated. (3) Use data mining algorithms to discover knowledge- Several data mining algorithms will be applied to the selected clinical domains. Algorithms will include decision tree generation, rule discovery, neural networks, nearest neighbor, logistic regression, and composite algorithms (for variable reduction). The algorithms will be trained on a training set for each domain, and their predictive accuracy will be measured and compared to each other and to expert-written rules. The performance of human experts writing rules using manual data mining visualization techniques (which does not require an explicit training set) will also be measured. (4) Study the dependence of data mining on the training set--The performance of data mining algorithms depends on the data used the train them. The sensitivity of the algorithms to noise (inaccurate data), missing data, and training set size will be measured. (5) Use the discovered knowledge to generate real-time interpretations-- The output of the algorithms (decision tree, rules, neural network equation, or logistic regression equation, but not nearest neighbor) along with the necessary data preparation steps will be encoded in Arden Syntax Medical Logic Modules. They will be run against the clinical repository to verify that the interpretation can be automated in real time. (6) Disseminate the methods and results--The methods and results will be disseminated via publications and a Web site, and tools will be made available.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
5R01LM006910-02
Application #
6391286
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Sim, Hua-Chuan
Project Start
2000-04-01
Project End
2003-03-31
Budget Start
2001-04-01
Budget End
2002-03-31
Support Year
2
Fiscal Year
2001
Total Cost
$396,315
Indirect Cost
Name
Columbia University (N.Y.)
Department
Miscellaneous
Type
Schools of Medicine
DUNS #
167204994
City
New York
State
NY
Country
United States
Zip Code
10032
Schuemie, Martijn J; Ryan, Patrick B; Hripcsak, George et al. (2018) Improving reproducibility by using high-throughput observational studies with empirical calibration. Philos Trans A Math Phys Eng Sci 376:
Sottile, Peter D; Albers, David; Moss, Marc M (2018) Neuromuscular blockade is associated with the attenuation of biomarkers of epithelial and endothelial injury in patients with moderate-to-severe acute respiratory distress syndrome. Crit Care 22:63
Vilar, Santiago; Friedman, Carol; Hripcsak, George (2018) Detection of drug-drug interactions through data mining studies using clinical sources, scientific literature and social media. Brief Bioinform 19:863-877
Ta, Casey N; Dumontier, Michel; Hripcsak, George et al. (2018) Columbia Open Health Data, clinical concept prevalence and co-occurrence from electronic health records. Sci Data 5:180273
Grossman, Lisa V; Mitchell, Elliot G; Hripcsak, George et al. (2018) A method for harmonization of clinical abbreviation and acronym sense inventories. J Biomed Inform 88:62-69
Schuemie, Martijn J; Hripcsak, George; Ryan, Patrick B et al. (2018) Empirical confidence interval calibration for population-level effect estimation studies in observational healthcare data. Proc Natl Acad Sci U S A 115:2571-2577
Levine, Matthew E; Albers, David J; Hripcsak, George (2018) Methodological variations in lagged regression for detecting physiologic drug effects in EHR data. J Biomed Inform 86:149-159
Albers, D J; Elhadad, N; Claassen, J et al. (2018) Estimating summary statistics for electronic health record laboratory data for use in high-throughput phenotyping algorithms. J Biomed Inform 78:87-101
Polubriaginof, Fernanda C G; Vanguri, Rami; Quinnies, Kayla et al. (2018) Disease Heritability Inferred from Familial Relationships Reported in Medical Records. Cell 173:1692-1704.e11
Sottile, Peter D; Albers, David; Higgins, Carrie et al. (2018) The Association Between Ventilator Dyssynchrony, Delivered Tidal Volume, and Sedation Using a Novel Automated Ventilator Dyssynchrony Detection Algorithm. Crit Care Med 46:e151-e157

Showing the most recent 10 out of 120 publications