Data cleaning technologies, traditionally designed to improve quality of data in back-end data warehouses, are fast emerging as a vital component of real-time information access. As the Web evolves towards supporting interactive analytics and basic search migrates from simple keyword retrieval to retrieval based on semantically richer concepts (e.g., entities) extracted from web pages, the need for "on-the-fly" cleaning techniques that can help alleviate data quality challenges is rapidly increasing. This project explores three new innovations that will help advance data cleaning towards becoming an embedded enabling technology for real-time information access. The first innovation is "query-aware data cleaning" which is based on the observation that the specificity of the real-time task such as a query can be exploited significantly to bring new optimizations to the data cleaning process. The second innovation is a data cleaning framework that migrates from the "best-effort" adhoc setup of today's systems into a principled approach that exposes and exploits a fundamental tradeoff between the cost of cleaning and quality of results achieved. Finally, since results of cleaning need to be fed to the end-user or analysis code, the proposal postulates and addresses approaches towards how results processed through data cleaning code can be presented to the end-recipient. The primary contribution is mechanisms to hide the uncertainty in the data and determinize the results while maximizing the end application goals.

The proposed research is intended to bring transformative improvements in interactive analytics and search on the web by facilitating real-time data cleaning and data quality enhancements. The project also aims to benefit the research community by incorporating mechanisms developed as part of this research into the Web People Search Technology (WEST), enabling WEST to become a real-time on-the-fly web people search tool. The goal is to support WEST as a plug-and-play system wherein other researchers could embed and test their data cleaning algorithms and tools. Finally, the planned research, system development, and educational activities are going to significantly enhance the educational experience of students, preparing them for a brighter future in the today's knowledge driven society.

For further information see the project web site at the URL: http://sherlock.ics.uci.edu

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1118114
Program Officer
Frank Olken
Project Start
Project End
Budget Start
2011-08-01
Budget End
2014-07-31
Support Year
Fiscal Year
2011
Total Cost
$500,000
Indirect Cost
Name
University of California Irvine
Department
Type
DUNS #
City
Irvine
State
CA
Country
United States
Zip Code
92697