An unsolved problem in health informatics is how to apply the past experiences of patients, stored in large-scale medical records systems, to predict the outcomes of patients and to individualize care. One approach to prediction, heretofore impractical, is rapidly finding a patient cohort similar enough to an index case that the health experiences and outcomes of this cohort are informative for prediction. This task is formidable because of large variability of the vast numbers of patient attributes with the added complexity of sequences of patient encounters evolving over time. Epidemiological considerations such as confounding by indication for treatment also come into play. The objective of this research effort is to (1) create a modular test bed that uses a big data systems architecture to support research in rapid individualized prediction of outcomes from large clinical repositories and (2) to explore various approaches to making pragmatic near-term predictions of outcomes. Using the Department of Veterans Affairs' (VA) Informatics and Computing Infrastructure database (VINCI), a research database with records of tens of millions of patients, we will explore two synergistic strategies for rapidly finding a cohort of patients that are similar enough to an index patient to predict near-term treatment response and/or adverse effects in an elastic cloud environment: 1) use of temporal alignment of critical events including use of gene sequence alignment methods to relax requirements for exact temporal matching; and, 2) use of conceptual distance metrics to model the degree of content similarity of case records. The initial domain of application will be treatment of Type 2 diabetes. The approach will apply open source big data methodologies, including Hadoop and Accumulo, to store and filter medical log files. The content of these logs will be processed by a combination with strategies including conceptual markup of events using natural language processing tools, matching of event streams, and statistical data mining methods to rapidly retrieve and identify patients that are sufficiently similar to an index case to be able to make personalized yet pragmatic clinical predictions of outcomes.

Public Health Relevance

This proposal studies how to use experience of past patients, stored in electronic medical records systems, to help clinicians make practical decisions on the care of complex patients with type 1 diabetes. Research applies methods adapted from Internet search engines and from studies of the human genome to determine what it means for one patient's disease experiences to be similar to and relevant to another's.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Marcus, Stephen
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Medical University of South Carolina
Internal Medicine/Medicine
Schools of Medicine
United States
Zip Code
Frey, Lewis J (2018) Data integration strategies for predictive analytics in precision medicine. Per Med 15:543-551
Frey, Lewis J; Bernstam, Elmer V; Denny, Joshua C (2016) Precision medicine informatics. J Am Med Inform Assoc 23:668-70
Dunlea, Robert; Lenert, Leslie (2015) Understanding Patients' Preferences for Referrals to Specialists for an Asymptomatic Condition. Med Decis Making :
Frey, L J; Lenert, L; Lopez-Campos, G (2014) EHR Big Data Deep Phenotyping. Contribution of the IMIA Genomic Medicine Working Group. Yearb Med Inform 9:206-11
Lenert, Leslie; Dunlea, Robert; Del Fiol, Guilherme et al. (2014) A model to support shared decision making in electronic health records systems. Med Decis Making 34:987-95