Today's healthcare infrastructure supports the production and storage of clinical data on a massive scale. A central goal in clinical informatics is to leverage these data to improve our understanding of health and disease. However, a major challenge is the paucity of reliable disease labels in observational data. Disease phenotypes address this issue by summarizing the characteristics of specific diseases in terms of commonly observed clinical variables. Classically, disease phenotypes are engineered via a manual expert-driven approach which fails to scale to large numbers of diseases. Data-driven methods for disease phenotyping aim to obtain large numbers of disease phenotypes by directly modeling large-scale observational clinical data. Such high-throughput methods may scale, but generally cannot guarantee identifiability; that is, inferred phenotypes are not guaranteed to map to specific diseases. In addition, data-driven disease phenotyping methods generally model phenotypes independently with no effort to capture relationships among diseases which would be consistent with our understanding of comorbidities, disease progression trends, and disease type/subtype relationships. The long-term goal of the proposed research is to support large-scale analysis of observational clinical data by introducing a family of closely related models for high-throughput disease phenotyping which resolve the issue of identifiability and model relationships among diseases. My work is inspired by an unsupervised probabilistic graphical model for high-throughput phenotyping, UPhenome. My objective is to derive, implement, validate, and disseminate UPhenome-based models which will 1) process both biomedical knowledge and clinical data to yield identifiable phenotypes and 2) model co-occurrence, temporal, and hierarchical relationships among inferred phenotypes. My central hypothesis is that UPhenome-based models can support large-scale clinical data analysis by inferring phenotypes that effectively represent the clinical characteristics of specific diseases while also capturing common comorbidities (co- occurrence model), patterns of disease progression (temporal model), and organizing diseases into types and subtypes (hierarchical model). To test this hypothesis, I propose the following aims.
Aim 1 : I describe Guided UPhenome, a model which process biomedical knowledge and clinical data to yield identifiable phenotypes. The model's capacity for capturing disease-specific traits is evaluated qualitatively by clinical experts, and quantitatively in disease-specific cohort selection tasks versus a gold-standard and a competing algorithm.
Aim 2 : I detail extensions to UPhenome which allow for modeling of disease relationships. The meaningfulness of these relationships is evaluated qualitatively using a series of custom ?intrusion tasks? inspired by the topic modeling literature.
Aim 3 : I will disseminate UPhenome-based models by ensuring their compatibility with the Observational Medical Outcomes Partnership (OMOP) common data model, and promoting their adoption within the Observational Health Data Sciences and Informatics (OHDSI) community.

Public Health Relevance

The high-throughput phenotyping methods I describe in this proposal will serve as powerful tools for exploring disease comorbidities, patterns of disease progression, and resolution of disease subtypes from observational clinical data. When adopted and applied broadly across a network of clinical institutions, such as the Observational Health Data Sciences and Informatics (OHDSI) collaborative, these methods could potentially power disease-oriented analysis on a massive population spanning the breadth of the nation. Such large-scale analysis would no doubt yield critical insights as to patterns of disease which would be of use in the study and understanding of public health.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Predoctoral Individual National Research Service Award (F31)
Project #
1F31LM012894-01
Application #
9541096
Study Section
Special Emphasis Panel (ZLM1)
Program Officer
Sim, Hua-Chuan
Project Start
2018-07-01
Project End
2023-06-30
Budget Start
2018-07-01
Budget End
2019-06-30
Support Year
1
Fiscal Year
2018
Total Cost
Indirect Cost
Name
Columbia University (N.Y.)
Department
Internal Medicine/Medicine
Type
Schools of Medicine
DUNS #
621889815
City
New York
State
NY
Country
United States
Zip Code
10032