There is a growing awareness of the need for multi-center clinical databases and multi-institutional analyses of healthcare data to ensure reproducibility and generalizability of research findings. Single-instance database algorithms are prone to three distinct problems. First, in the context of Big Data science, the size of the data compared to the number of variables makes it difficult to develop complex predictors without overfitting, and more traditional learning algorithms may lead to over-simplified models that do not capture important related influences or interactions between different types of healthcare information. Second, training and testing predictive models on a single database can lead to learning noise or other irrelevant local practices or differences in definitions that are correlated with, but not causally related to, the outcome in question. This leads to models that do not work in other institutions or in the future when practices or the environment changes. Third, sharing data between institutions, and in particular, across borders, is extremely problematic because of trust, legal issues, privacy issues and national policies. The significance of solving these issues is threefold: 1) it would allow the creating of strong generalizable data science models, which leverage enormous pools of data from around the world; 2) it would also allow the identification of rare diseases or patient types, which, as we compile databases, become less rare; and 3) perhaps most importantly, it would allow the free exchange of data science models and generalized approaches to solving medical problems in the cloud.

This project aims to develop a set of distributed deep learning and cloud computation techniques for cross-institution and cross-border machine learning on health and medical data without the need for protected health information to leave the generating institution. The goals are to create demonstration programs which illustrate feasibility and open source the architecture. The scope of this project encompasses the broad set of machine learning-based tasks multiple institutions may want to apply to their healthcare data in the cloud, as well as the technical issues surrounding transfer learning of knowledge across domains (e.g., institutions/demographics) and tasks (e.g., types of classification and prediction problems). The project has three specific aims: 1) develop a cloud-based infrastructure which preserves regional autonomy of data, but allows the sharing of parameters of the partially trained deep neural network (including weights and hyperparameters) between regions, to allow transfer learning across domains and tasks; 2) develop a standardized coded model for deep learning approaches in medical applications; and 3) evaluate the effect of training and testing the model across multiple centers and national boundaries, by comparing improvement in performance with cross-institutional training without loss of privacy protection, using metrics of sensitivity, specificity, positive predictive value, area under the receiver operating characteristic (ROC) curve and model calibration. Aims 1-3 will be achieved by taking four databases (including, a database of intensive care unit patients with sepsis, a free text corpus of nursing progress notes, voice recordings taken from a public corpus classically used for speaker identification, and a public database of full-face images used for classification of facial expressions) and placing them in the cloud (Google, AWS and Azure) at different geopolitical locations (namely US and Europe) and developing a distributed deep learning architecture that learns to improve its performance by sharing weights across borders, but not sensitive patient data. This project has the potential to make several contributions to the field. First, it will demonstrate that medical data across geopolitical boundaries can be made available in an interoperable manner (using the FHIR standard) and can be used for training of deep learning algorithms in a privacy-preserving manner, thus addressing both the concerns of Health Insurance, Portability and Privacy Act (HIPPA) and interoperability. Secondly, it will provide open-source deep learning algorithms for several medical datasets and data types that can be used across institutions to solve similar problems with some fine-tuning (e.g., via transfer learning). Third, it will provide a set of open-source meta algorithms for transfer learning (across domains and tasks) implemented on the cloud in containers (dockers) that can be downloaded for local use or transferred across the different cloud vendors.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
1822378
Program Officer
Alejandro Suarez
Project Start
Project End
Budget Start
2018-03-15
Budget End
2021-02-28
Support Year
Fiscal Year
2018
Total Cost
$300,000
Indirect Cost
Name
Emory University
Department
Type
DUNS #
City
Atlanta
State
GA
Country
United States
Zip Code
30322