Big Data Methods for Comprehensive Similarity based Risk Prediction

Kiryluk, Krzysztof; Wang, Shuang; Weng, Chunhua

Abstract

Electronic health records (EHR) provide rich source of data about representative populations and are yet to be fully utilized to enhance clinical decision-making. Conventional approaches in clinical decision-making start with the identification of relevant biomarkers based on subject-matter knowledge, followed by detailed but limited analysis using these biomarkers exclusively. As the current scientific literature indicates, many human disorders share a complex etiological basis and exhibit correlated disease progression. Therefore, it is desirable to use comprehensive patient data for patient similarity. This proposal focuses on deriving a comprehensive and integrated score of patient similarity from complete patient characteristics currently available, including but not limited to 1) demographic similarity; 2) genetic similarity; 3) clinical phenotype similarity; 4) treatment similarity; and 5) exposome similarity (here exposome defined as all available attributes of the living environment an individual is exposed to), when some of the aspects may overlap and interact. We will optimize information fusion and task-dependent feature selection for assessing patient similarity for clinical risk prediction. Since currently there does not exist a pipeline that is able to extract executable complete patient determinant data, to achieve the research goal described above, we propose first deliver an open- source data preparation pipeline that is based on a widely used clinical data standard, the OMOP (Observational Medical Outcomes Partnership) Common Data Model (CMD) version 5.2. Moreover, to mitigate common missingness and sparsity challenges in clinical data, we describe the first attempt to represent patients' sparse clinical information with missingness, including diagnosis information, medication data, treatment intervention, with a fixed-length feature vector (i.e. the Patient2Vec). This project has four specific aims.
Aim 1 is to develop a clinical data processing pipeline for harmonizing patient information from multiple sources into a standards-based uniformed data representation and to evaluate its efficiency, interoperability, and accuracy.
Aim 2 is to leverage a powerful machine learning technique, Document2Vec, from the natural language processing literature, to create an open-source Patient2Vec framework for the derivation of informative numerical representations of patients.
Aim 3 is to develop a unified machine learning clinical- outcome-prediction framework for Optimized Patient Similarity Fusion (OptPSF) that integrates traditional medical covariates with the derived numerical patient representations from Patient2Vec (Aim 2) for improved clinical risk prediction.
Aim 4 is to evaluate our similarity framework for predicting 1) the risk of end-stage kidney disease (ESKD) in general EHR patient population and 2) the risk of death among patients with chronic kidney disease (CKD).

Public Health Relevance

The project focus on developing a novel data science pipeline which includes a clinical data processing pipeline to format comprehensive patient health determinants from a variety of sources of clinical, genomic, socioenvironmental data, and a clinical-outcome-prediction framework that optimally fuses relevant patient health determinants to define patient similarity for improved clinical risk predictions.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Research Project (R01)
Project #: 5R01LM013061-02
Application #: 9870948
Study Section: Biomedical Library and Informatics Review Committee (BLR)
Program Officer: Sim, Hua-Chuan

Project Start: 2019-02-12
Project End: 2024-01-31
Budget Start: 2020-02-01
Budget End: 2021-01-31
Support Year: 2
Fiscal Year: 2020
Total Cost
Indirect Cost

Institution

Name: Columbia University (N.Y.)
Department: Biostatistics & Other Math Sci
Type: Schools of Public Health
DUNS #: 621889815

City: New York
State: NY
Country: United States
Zip Code: 10032

Related projects


NIH 2021 R01 LM	Big Data Methods for Comprehensive Similarity based Risk Prediction Kiryluk, Krzysztof; Wang, Shuang; Weng, Chunhua / Columbia University (N.Y.)
NIH 2020 R01 LM	Big Data Methods for Comprehensive Similarity based Risk Prediction Kiryluk, Krzysztof; Wang, Shuang; Weng, Chunhua / Columbia University (N.Y.)
NIH 2019 R01 LM	Big Data Methods for Comprehensive Similarity based Risk Prediction Kiryluk, Krzysztof; Wang, Shuang; Weng, Chunhua / Columbia University (N.Y.)

Comments

Be the first to comment on Krzysztof Kiryluk's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: