Data cleaning technologies, traditionally designed to improve quality of data in back-end data warehouses, are fast emerging as a vital component of real-time information access. As the Web evolves towards supporting interactive analytics and basic search migrates from simple keyword retrieval to retrieval based on semantically richer concepts (e.g., entities) extracted from web pages, the need for "on-the-fly" cleaning techniques that can help alleviate data quality challenges is rapidly increasing. This project explores three new innovations that will help advance data cleaning towards becoming an embedded enabling technology for real-time information access. The first innovation is "query-aware data cleaning" which is based on the observation that the specificity of the real-time task such as a query can be exploited significantly to bring new optimizations to the data cleaning process. The second innovation is a data cleaning framework that migrates from the "best-effort" adhoc setup of today's systems into a principled approach that exposes and exploits a fundamental tradeoff between the cost of cleaning and quality of results achieved. Finally, since results of cleaning need to be fed to the end-user or analysis code, the proposal postulates and addresses approaches towards how results processed through data cleaning code can be presented to the end-recipient. The primary contribution is mechanisms to hide the uncertainty in the data and determinize the results while maximizing the end application goals.

The proposed research is intended to bring transformative improvements in interactive analytics and search on the web by facilitating real-time data cleaning and data quality enhancements. The project also aims to benefit the research community by incorporating mechanisms developed as part of this research into the Web People Search Technology (WEST), enabling WEST to become a real-time on-the-fly web people search tool. The goal is to support WEST as a plug-and-play system wherein other researchers could embed and test their data cleaning algorithms and tools. Finally, the planned research, system development, and educational activities are going to significantly enhance the educational experience of students, preparing them for a brighter future in the today's knowledge driven society.

For further information see the project web site at the URL:

National Science Foundation (NSF)
Division of Information and Intelligent Systems (IIS)
Application #
Program Officer
Frank Olken
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of California Irvine
United States
Zip Code