Large, incomplete datasets create major challenges for statistical prediction in research. This project will develop a data curing service that is able to manage large, incomplete, and diverse datasets, and would provide uncertainty measures for the cured data. The project identifies and collaborates with several communities where this data service is central to scientific research, including civil engineering, building science, urban energy, and social science.
The effort creates a parallel data curing service, provides uncertainty measures for the cured data, and develops supplementary imputing algorithms. The team develops a data curing platform with imputation for incomplete, heterogeneous data; robust machine learning (ML) and statistical predictions would be established by developing an easy-to-use, general-purpose, large data-friendly imputation program. The focus is on a novel combination of three established imputation methods: two-level finite mixture model-based imputation (FMMI), fractional hot deck imputation (FHDI), and Gaussian mixture model-based imputation (GMMI), for which parallel implementations in R would also be provided.
This award by the NSF Office of Advanced Cyberinfrastructure is jointly funded by the Established Program to Stimulate Competitive Research (EPSCoR).
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.