This project will introduce new paradigms for dealing with missing values in electronic health record (EHR) data, with the objective of developing novel approaches for early diagnosis of diastolic ventricular dysfunction, a silent disease responsible for one-third of the total heart failure-related deaths worldwide. EHR are often messy and suffer from missing data problem for various reasons, for example more frequent clinical exams after the manifestation of the first symptoms of a certain disease and less frequent exams during routine screening. Missing data often limits the ability to extract useful information from these sources (e.g., early diagnosis). The goal of this project is to leverage the fact that missing information sometimes satisfies mathematical or physical principles to develop innovative model-based imputation approaches, combining models and efficient privacy-preserving learning techniques in large EHR datasets. Computationally efficient algorithms will be developed to train numerical models while preserving patient privacy, and the feasibility and practical usefulness of these approaches will be demonstrated at a scale that has not yet been addressed in the literature. The approaches for predictive numerical models developed for this project can be applied broadly in various fields. Additional project goals include development of infrastructure for research and education through freely available, open-source software libraries. This project will also provide invaluable multi-disciplinary skills to undergraduate and graduate students. Both research and outreach efforts focus on increasing the participation of women, people with disabilities, and of underrepresented groups.
The team will develop novel regularization approaches through numerical models, i.e., optimally trained models able to suggest distributions of missing data based on the underlying physics. For EHRs characterizing cardiovascular function, lumped parameter hemodynamic models offer an ideal regularizer. Parameter estimation for these models using Markov chain Monte Carlo is computationally expensive and therefore incompatible with fast application to large EHR collections. Additionally, optimally trained numerical models of the cardiovascular system can be thought as a type of query, rising issues of patient privacy. The proposed research tackles these issues through: (1) Acquisition and analysis of a large heart failure EHR dataset. (2) Development of privacy-preserving variational inference for hemodynamic models, enhanced using homotopy-based optimization. (3) Implementation and extensive testing of novel imputation approaches for missing data, combining uncertainty quantification and numerical models. (4) Demonstration on a large patient cohort.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.