Patterns extracted from Electronic Medical Records (EMRs) and other biomedical datasets can provide valuable feedback to a learning healthcare system, but our ability to find them is limited by certain manual steps. The dominant approach to finding the patterns uses supervised learning, where a computational algorithm searches for patterns among input variables (or features) that model an outcome variable (or label). This usually requires an expert to specify the learning task, construct input features, and prepare the outcome labels. This workflow has served us well for decades, but the dependence on human effort prevents it from scaling and it misses the most informative patterns, which are almost by definition the ones that nobody anticipates. It is poorly suited to the emerging era of population-scale data, in which we can conceive of massive new undertakings such as surveiling for all emerging diseases, detecting all unanticipated medication effects, or inferring the complete clinical phenotype of all genetic variants. The approach of unsupervised feature learning overcomes these limitations by identifying meaningful patterns in massive, unlabeled datasets with little or no human involvement. While there is a large literature on feature creation, a new surge of interest in unsupervised methods is being driven by the recent development of deep learning, in which a compact hierarchy of expressive features is learned from large unlabeled datasets. In the domains of image and speech recognition, deep learning has produced features that meet or exceed (by as much as 70%) the previous state of the art on difficult standardized tasks. Unfortunately, the noisy, sparse, and irregular data typically found in an EMR is a poor substrate for deep learning. Our approach uses Gaussian process regression to convert such an irregular sequence of observations into a longitudinal probability density that is suitable for use with a deep architecture. With this approach, we can learn continuous unsupervised features that capture the longitudinal structure of sparse and irregular observations. In our preliminary results unsupervised features were as powerful (0.96 AUC) in an unanticipated classification task as gold-standard features engineered by an expert with full knowledge of the domain, the classification task, and the class labels. In this project we will learn unsupervised features for records of all individuals in our deidentifed EMR image, for each of 100 laboratory tests and 200 medications of relevance to type 1 or type 2 diabetes. We will evaluate the features using three pattern recognition tasks that were unknown to the feature-learning algorithm: 1) an easy supervised classification task of distinguishing diabetics vs. nondiabetics, 2) a much more difficult task of distinguishing type 1 vs. type 2 diabetics, and 3) a genetic association task that considers the features as micro-phenotypes and measures their association with 29 different single nucleotide polymorphisms with known associations to type 1 or type 2 diabetes.

Public Health Relevance

Every piece of data generated in the course of medical care can provide crucial feedback to a learning healthcare system, but only if relevant patterns can be found among them. The current practice of using human experts to guide the pattern search has served us well for decades, but it is poorly suited to the emerging era of population- scale data, in which we may conceive of massive new undertakings such as detecting all emerging diseases or all unanticipated medication effects. This project will develop methods to mathematically identify such patterns at a large scale, with no need for human judgment to specify what patterns to look for or where to look for them.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Exploratory/Developmental Grants (R21)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Sim, Hua-Chuan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Vanderbilt University Medical Center
Internal Medicine/Medicine
Schools of Medicine
United States
Zip Code
Davis, Sharon E; Lasko, Thomas A; Chen, Guanhua et al. (2017) Calibration Drift Among Regression and Machine Learning Models for Hospital Mortality. AMIA Annu Symp Proc 2017:625-634
Lasko, Thomas A (2015) Nonstationary Gaussian Process Regression for Evaluating Clinical Laboratory Test Sampling Strategies. Proc Conf AAAI Artif Intell 2015:1777-1783
Lasko, Thomas A (2014) Efficient Inference of Gaussian-Process-Modulated Renewal Processes with Application to Medical Event Data. Uncertain Artif Intell 2014:469-476