The unprecedented amounts of data available to individuals, companies, governments, and scientists promises to revolutionize the way entertainment, business, governance, and science operate. And while data are cheap and plentiful, much of this data is lower quality than the precise data that has been managed for the last 30 years. Building an application that processes this imprecise data is difficult: it requires that developers handle both standard data management challenges (e.g., concurrency and scalability), while at the same time coping with imprecise and incomplete data, which is typically done using statistical or machine learning techniques (e.g., interpolation and classification). The Hazy project addresses this challenge by building a system that integrates the paradigms of relational database management systems with statistical machine learning techniques. This project conducts the following major tasks: (I) designing a language to integrate these techniques with standard SQL, (II) proposing an algebra to implement this language along with support for automatic optimization (similar to a standard RDBMS), and (III) discovering techniques to efficiently maintain the statistical models as the underlying data are changed or updated. The end goal is a system that makes it as easy to develop scalable applications that use imprecise data as it is to develop their precise counterparts. Hazy allows users to process larger amounts of data with more sophisticated statistical processing than ever before. In turn, this enables new applications in a divese set of areas, such as life and physical science sensing applications, health-care and environmental monitoring, and enterprise-based and Web-based information extraction.
The research of this project is used to develop the data and infrastructure for new practicum-style courses that are under development at the University of Wisconsin-Madison. In addition, this infrastructure will be used as part of an outreach effort to enable high school students to gain access to data analysis tools. The source code of Hazy is released into open source and the results are disseminated on the project Web site (www.cs.wisc.edu/hazy/).