Extensive Electronic Medical Records (EMR) combined with large-scale genotyping creates an important opportunity for a new type of genetic association study. Phenotype information in EMRs consists primarily as billing codes (ICD9 codes). The main difficulty in using the EMR data to represent relevant phenotypes is the question of how to identify the relationships between these ICD9 codes. Manual grouping of the codes has been used, but this approach is biased and it is preferable to use a data-driven method. In this project we will apply a logistic regression model with elastic net regularization (Elastic Net) method of using the EMR data to define natural and statistically significant groups of ICD9 codes related to a given genotype difference. The Elastic Net method deals efficiently with large high-dimensional datasets where the data values may have high levels of correlation, exactly the case that we encounter in Electronic Medical Record data. Our primary hypothesis is that the elastic net analysis will identify natural groups of related ICD9 codes that are statisticlly significantly over- or under-represented between two groups of patients determined by genetics characteristics.
Aim 1 of this project is concerned with developing and testing an Elastic Net method for the analysis of genetic studies in EMR data. To define the genotype sets, we have chosen a set of approximately 275 genes that are important to mitochondrial function. We have used a panel of pathogenicity prediction methods to define a set of over 600 predicted pathogenic nonsynonymous variants in these genes, which we have found to occur in one or more individuals within the de-identified Vanderbilt EMR (BioVU). We will classify the study subjects into "carriers" and "non-carriers" of any of these predicted pathogenic nonsynonymous variations. The hypothesis for our second aim is that the patients with rare variants that are predicted to be pathogenic in a set of fundamental genes for mitochondrial proteins will have a complex set of phenotypes that can be identified by the elastic net method.
In Aim 2 of this project, the Elastic Net method will be used to identify sets of ICD9 codes that are significantly different between the carriers of the mitochondrial protein variants and the non-carriers.
Electronic Medical Records are potentially a major source of understanding of the effects on genetics on health. However, we need to apply modern statistical methods, developed in the past few years, to understand the complex patterns in medical conditions as documented in these medical records. This project will combine these new statistical methods with methods of predicting the severity of protein changes to discover how rare genetic variants in genes for mitochondrial proteins (the cellular energy source) affect health.