The rise of the Internet, smart phones, and wireless sensors has resulted in a vast trove of data about all aspects of our lives, from our social interactions to our personal preferences to our vital signs and medical records. Increasingly, "data science" teams want to collaboratively analyze these datasets, to understand trends and to extract actionable business, scientific, or social insights. Unfortunately, while there exist tools to support data analysis, much-needed underlying infrastructure and data management capabilities are missing. To this end, "DataHub", a collaborative platform for cleaning, storing, understanding, sharing, and publishing datasets, will be developed. DataHub will be a publicly accessible platform that will host private user datasets as well as public datasets retrieved from online sources. DataHub will serve as the common substrate for data science, freeing up end users from tedious dataset book-keeping tasks, and instead supporting them in their search for useful insights. DataHub will be deployed on a large scale at MIT; partnerships with organizations and groups from a variety of sectors will be leveraged upon to show benefits for real data scientists and to ensure that the proposed techniques meet real-world big data challenges. The curriculum development part of this project will lead to the training of new data scientists, and the project will also provide opportunities for graduate and undergraduate students to participate in research and learn how to do collaborative research.

Unlike most systems that focus on improving performance or on supporting even more sophisticated analyses, DataHub will instead focus on simplifying and automating many fundamental book-keeping operations that are a pre-requisite to data science. Key features of DataHub will include: (1) a flexible, source code control-like versioning system for data, that efficiently branches, merges, and differences datasets; (2) new data ingest, cleaning, and wrangling tools designed to automate data cleaning process; (3) the ability to search for "related" tables and to integrate them into the analysis process; and (4) the ability to selectively share and collaborate on data sets across users and teams. Overall, DataHub will significantly reduce the amount of effort involved on the part of data scientists for preparing, analyzing, sharing, and managing data.

For more information, see the project website at: http://data-hub.org

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1513443
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2015-09-01
Budget End
2019-08-31
Support Year
Fiscal Year
2015
Total Cost
$333,333
Indirect Cost
Name
Massachusetts Institute of Technology
Department
Type
DUNS #
City
Cambridge
State
MA
Country
United States
Zip Code
02139