Electronic health records (EHR) provide rich source of data about representative populations and are yet to be fully utilized to enhance clinical decision-making. Conventional approaches in clinical decision-making start with the identification of relevant biomarkers based on subject-matter knowledge, followed by detailed but limited analysis using these biomarkers exclusively. As the current scientific literature indicates, many human disorders share a complex etiological basis and exhibit correlated disease progression. Therefore, it is desirable to use comprehensive patient data for patient similarity. This proposal focuses on deriving a comprehensive and integrated score of patient similarity from complete patient characteristics currently available, including but not limited to 1) demographic similarity; 2) genetic similarity; 3) clinical phenotype similarity; 4) treatment similarity; and 5) exposome similarity (here exposome defined as all available attributes of the living environment an individual is exposed to), when some of the aspects may overlap and interact. We will optimize information fusion and task-dependent feature selection for assessing patient similarity for clinical risk prediction. Since currently there does not exist a pipeline that is able to extract executable complete patient determinant data, to achieve the research goal described above, we propose first deliver an open- source data preparation pipeline that is based on a widely used clinical data standard, the OMOP (Observational Medical Outcomes Partnership) Common Data Model (CMD) version 5.2. Moreover, to mitigate common missingness and sparsity challenges in clinical data, we describe the first attempt to represent patients' sparse clinical information with missingness, including diagnosis information, medication data, treatment intervention, with a fixed-length feature vector (i.e. the Patient2Vec). This project has four specific aims.
Aim 1 is to develop a clinical data processing pipeline for harmonizing patient information from multiple sources into a standards-based uniformed data representation and to evaluate its efficiency, interoperability, and accuracy.
Aim 2 is to leverage a powerful machine learning technique, Document2Vec, from the natural language processing literature, to create an open-source Patient2Vec framework for the derivation of informative numerical representations of patients.
Aim 3 is to develop a unified machine learning clinical- outcome-prediction framework for Optimized Patient Similarity Fusion (OptPSF) that integrates traditional medical covariates with the derived numerical patient representations from Patient2Vec (Aim 2) for improved clinical risk prediction.
Aim 4 is to evaluate our similarity framework for predicting 1) the risk of end-stage kidney disease (ESKD) in general EHR patient population and 2) the risk of death among patients with chronic kidney disease (CKD).

Public Health Relevance

The project focus on developing a novel data science pipeline which includes a clinical data processing pipeline to format comprehensive patient health determinants from a variety of sources of clinical, genomic, socioenvironmental data, and a clinical-outcome-prediction framework that optimally fuses relevant patient health determinants to define patient similarity for improved clinical risk predictions.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Sim, Hua-Chuan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Columbia University (N.Y.)
Biostatistics & Other Math Sci
Schools of Public Health
New York
United States
Zip Code