The Million Veteran Program (MVP) is currently the largest biobank study in the world. The resource provides an unprecedented opportunity to identify the genetic causes of a variety of human diseases that disproportionally affect our veterans including diseases that affect the neurological, cardiovascular, pulmonary, gastrointestinal, endocrine, and musculoskeletal organs. Fast-paced technological progress over the last 10 years now allows us to reliably and densely profile individuals across their entire genome. Such data has already been generated and linked to a wide spectrum of human diseases and physiologic traits. However, many more links remain to be made which will provide the scientific community with additional important clues on the root causes of many life-threatening diseases as well as valuable insights on how to develop new drugs to treat or prevent these same diseases. The current challenge in making these additional discoveries is no longer the generation of high quality genetic data in large numbers but rather the organization and querying of very large and complex electronic health records (EHR) being leveraged by these large biobank studies. Until now, much effort and time has been expended to painstakingly develop and validate rules-based definitions to identify individuals with a specific disease, syndrome, or state across a variety of EHR platforms. However, the recent mapping of the VA corporate data warehouse to the Observational Medical Outcomes Partnership common data model (OMOP-CDM) provides us with unprecedented opportunities to apply new ?electronic phenotyping? tools that can identify individuals with a specific disease, syndrome, or state in a much more efficient manner than rules-based methods. The goal of this proposal is to comprehensively test the ability of one of these new tools named APHRODITE (Automated PHenotype Routine for Observational Definition, Identification, Training and Evaluation) to identify established genetic links among MVP participants. APHRODITE was developed at Stanford by one of our co-investigators and uses state of the art machine learning algorithms to identify individuals with a condition in a fraction of the time it takes to identify them through rules-based definitions. The algorithm has shown great promise within the Stanford clinical data warehouse but requires validation in other EHR cohorts.
In aim 1, we will test the accuracy of an APHRODITE classifier to that of a rules-based classifier for at least 5 diseases using gold-standard sets in the VA.
In aim 2, we will test whether APHRODITE classifiers from aim 1 can be applied to MVP participants to replicate established genetic associations. If automated methods in APHRODITE perform equally well or better than rules-based methods for multiple diseases, automated methods may be leveraged for phenotypes where rules based methods may not exist, maximizing the efficiency of genetic discovery in MVP and facilitating rapid replication of findings within MVP in other EHRs mapped to the OMOP-CDM.

Public Health Relevance

Inherited differences in our DNA play an important role in the development of nearly all human diseases. Linking these differences to diseases has recently been greatly facilitated by large studies of humans with electronic health records and genetic profiling. In this proposal, we will test the capability of a new computer algorithm named APHRODITE in efficiently identifying individuals with a disease within the Million Veteran Program and linking them to inherited changes in the DNA that are known to predispose to the same disease.

National Institute of Health (NIH)
Veterans Affairs (VA)
Non-HHS Research Projects (I01)
Project #
Application #
Study Section
Special Initiatives - MVP Projects (SPLM)
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Veterans Admin Palo Alto Health Care Sys
Palo Alto
United States
Zip Code