Semi-supervised learning with electronic medical records

Gronsbell, Jessica

Abstract

The implementation of electronic medical record (EMR) systems in routine healthcare has resulted in a rich and inexpensive source of data for translational research. When linked with specimen biobanks, these extensive databases offer a unique opportunity to accelerate the goals of disease genomics as they contain large amounts of detailed clinical and genetic data collected for the purposes of medical care [4; 6; 7; 8; 9]. However, the statistical methods to analyze EMR data are limited and thus the focus of this proposal. In particular, extracting accurate disease phenotype information is a major challenge impeding EMR-based research [10]. Currently, ICD9 codes are used to con?rm presence or absence of a disease in cohorts derived from EMRs. These codes are extremely variable and therefore have a signi?cant impact on the statistical power of genetic studies [11; 12]. An alternative approach is to develop a highly accurate algorithm to classify disease status. But due to the laborious medical record review required to obtain validated phenotype information for classi?er estimation, phenotypes are only available for a small training set nested in a large cohort. In contrast, predictors of phenotype are available for all observations. To improve accuracy and ef?ciency in model estimation and evaluation, it is therefore of interest to develop semi-supervised learning (SSL) methods that utilize the so- called unlabeled data or observations without con?rmed phenotype status in addition to the labeled training set. Although a great body of literature on SSL exists, nearly all methods concern estimation of classi?ers or prediction rules when the labeled training set is a simple random sample from the large unlabeled data set [13; 14; 15; 16; 17; 18; 19; 20; 21]. Despite the practical importance of evaluating the prediction performance of an estimated model, no SSL procedures currently exist to improve the estimation of model performance parame- ters. Additionally, the simple random sampling assumption is restrictive and the development of semi-supervised (SS) methods that accommodate more ?exible sampling schemes in the context of both model estimation and evaluation is needed. In this proposal, these limitations are addressed through formulation of an ef?cient method to estimate various prediction performance measures including the ROC curve within the traditional SS framework of simple random sampling. The strati?ed random sampling design in the SS setting is also considered and methods to estimate a classi?er and its accuracy are developed. These procedures will be applied to EMR-based studies of bipolar disorder and depression. The success of this work will thus improve ef?ciency in analyzing EMR data and expedite the use of EMRs in clinical and genetic research in neuropsychiatry.

Public Health Relevance

The use of electronic medical records (EMRs) in routine healthcare has generated a rich source of data for in-depth study of disease risk factors. However, EMR data typically consists of a very small number of expensive observations with information on disease status and a large amount of automatically extracted data concerning risk factors such as laboratory results and previous health history. Statistical methods that accommodate this data structure are limited and thus the focus of this proposal.