This Small Business Innovation Research (SBIR) Phase I project aims to do preliminary feasibility study to be eventually able to offer scalable data cleaning service in the Cloud Computing environment for small businesses. Despite active research on data cleaning and available commercial data cleaning solutions, many small businesses are unable to clean their in-house data to the satisfactory level due to various reasons. Often, small businesses simply do not have human or IT resources for cleaning their data. Toward these challenges, in this SBIR Phase I proposal, we claim that by extending existing state-of-the-art data cleaning solutions to be more scalable and be available in the Cloud Computing environment and offering data cleaning as a service as in Software-as-a-Service (SaaS) paradigm, small businesses can afford to have easy access to sophisticated data cleaning service for nominal fees. This effort will extend current software infrastructure to build high performance data cleaning solutions, which will be the fundamental basis of many data quality problems. Its intellectual merits lie in establishing a unifying framework that improves the scalability in data cleaning solutions and extend them to fit into the Cloud Computing environment.
The broader impact/commercial potential of this project can be far reaching since the issues of data quality are ubiquitous in many businesses. With the explosive increase of data size as in "Big Data" in virtually all industries and disciplines, in particular, the ability to do the scalable execution of rich data cleaning solutions becomes ever more important. By offering such scalable data cleaning solutions in the Cloud Computing environment as chargeable service, this SBIR project aims to reach many small businesses that need to clean their complex in-house data without much investment. By implementing the whole cleaning-as-a-service using Amazon's web service infrastructure such as EC2, the company believes that the project has a great commercial potential to serve the market that did not exist before.
The SBIR Phase I project (Data Cleaning as a Service) ran from July 2012 to June 2013, led by Nittany System Research, LLC and collaborated with Penn State. The project aimed at investigating and developing novel Cloud-based data cleaning service using the pay-as-you-go model. The term "data cleaning" in general refers to the task of detecting and correcting inaccurate records in a database so that subsequent business analysis can be performed based on more accurate data. While there are many existing solutions to the problem of data cleaning, our solution is in particular novel as follows. First, being entirely Cloud-based, businesses can clean their data by simply uploading their data to the Cloud and cleaning them within the Cloud. Second, our solution is very scalable as it runs using multiple machines in parallel based on MapReduce framework. Based on these two key ideas, PIs have built a prototype of the Cloud-based de-duplication system, named as Dedool (for deduplication tool), using Amazon.com’s Cloud infrastructure (AWS).The prototype is accessible to public at www.dedool.com/. In addition, PIs have investigated about novel ways to monitor and optimize parallel environment such as MapReduce so that one can run a very large amount of cleaning tasks more efficiently. Finally, PIs have identified the need to be able to do cleaning in the Cloud while providing anonymity to the data being cleaned.This way, businesses using the proposed Dedool service do not need to worry about the privacy issue of their potentially sensitive data. Overall, the project has hired and trained a total of six developers, and produced four research publications (three published and one under review).