An unsolved problem in health informatics is how to apply the past experiences of patients, stored in large-scale medical records systems, to predict the outcomes of patients and to individualize care. One approach to prediction, heretofore impractical, is rapidly finding a patient cohort "similar enough" to an index case that the health experiences and outcomes of this cohort are informative for prediction. This task is formidable because of large variability of the vast numbers of patient attributes with the added complexity of sequences of patient encounters evolving over time. Epidemiological considerations such as confounding by indication for treatment also come into play. The objective of this research effort is to (1) create a modular test bed that uses a "big data" systems architecture to support research in rapid individualized prediction of outcomes from large clinical repositories and (2) to explore various approaches to making "pragmatic" near-term predictions of outcomes. Using the Department of Veterans Affairs'(VA) Informatics and Computing Infrastructure database (VINCI), a research database with records of tens of millions of patients, we will explore two synergistic strategies for rapidly finding a cohort of patients that are similar enough to an index patient to predict near-term treatment response and/or adverse effects in an elastic cloud environment: 1) use of temporal alignment of critical events including use of gene sequence alignment methods to relax requirements for exact temporal matching;and, 2) use of conceptual distance metrics to model the degree of content similarity of case records. The initial domain of application will be treatment of Type 2 diabetes. The approach will apply open source "big data" methodologies, including Hadoop and Accumulo, to store and filter "medical log" files. The content of these "logs" will be processed by a combination with strategies including conceptual markup of events using natural language processing tools, matching of event streams, and statistical data mining methods to rapidly retrieve and identify patients that are sufficiently similar to an index case to be able to make personalized yet pragmatic clinical predictions of outcomes.

Public Health Relevance

This proposal studies how to use experience of past patients, stored in electronic medical records systems, to help clinicians make practical decisions on the care of complex patients with type 1 diabetes. Research applies methods adapted from Internet search engines and from studies of the human genome to determine what it means for one patient's disease experiences to be similar to and relevant to another's.

National Institute of Health (NIH)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Lyster, Peter
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Utah
Schools of Medicine
Salt Lake City
United States
Zip Code