Real-world data --- especially when generated by distributed measurement infrastructures such as sensor networks --- tends to be incomplete, imprecise, and erroneous, making it impossible to present it to the users or to feed it directly into applications. The goal of this project is to develop MauveDB, a data management system that offers a principled approach to dealing with this problem by supporting a new abstraction called "model-based views." MauveDB is built by leveraging the Apache Derby Database Management System (DBMS) codebase. Analogous to traditional database views, model-based views provide independence from the details of the underlying data generating mechanism and hide the irregularities of the data by using statistical and probabilistic "models" to present a consistent view of the data to the users. MauveDB supports a declarative language for defining model-based views, and supports declarative querying over model-based views using an extended version of SQL that supports continuous and probabilistic queries. Being a full-fledged DBMS, MauveDB also enables easy storage and archival and querying of historical data. By relieving the users of the burden of dealing with the noisy real-world data, MauveDB enables a high-impact new class of real-world applications based on networks of monitoring and sensing devices, such as traffic monitoring, location-based services, environmental monitoring, health services, and military applications. This research will be used to develop code, datasets and sensor network deployments that will be used in the new course modules being developed at the University of Maryland; the code and the course material developed will also be made freely available at the project web site www.cs.umd.edu/~amol/MauveDB.
The goal of this NSF project was to develop data management abstractions and tools to store, manage, and process noisy and incomplete data generated in a wide range of practical application domains. The data uncertainties may be a result of the fundamental limitations of the underlying measurement infrastructures, or the inherent ambiguity in the domain, or they may be a side-effect of the rich probabilistic modeling typically performed to extract high-level events from the data. For example, there has been a tremendous increase in the number of distributed measurement infrastructures such as wireless sensor networks that continuously generate invaluable data about our everyday world. However, the potential of that data has been hard to realize because of the typically incomplete and erroneous nature of the data, that makes it to hard to base critical decisions on it. Experimental data generated in scientific domains also exhibits many of the similar uncertainties. Similarly, when attempting to integrate heterogeneous data sources on the Internet ("data integration") or extracting structured information from text ("information extraction"), the results are approximate and uncertain at best. Lacking functionally rich and easy-to-use data management tools that can reason about large volumes of uncertain data, the information about the uncertainty is often either discarded or reasoned about only superficially. The key outcomes of the project can be broadly divided into two parts. First, we developed abstractions and techniques to integrate a variety of statistical modeling techniques into relational database systems to seamlessly manage and process uncertain data. Statistical analysis and modeling are perhaps the most ubiquitous tasks that need to be performed on real-world data, especially on uncertain data. Models can often be an end unto themselves, e.g., in scientific data analysis, but are also widely used in non-scientific application domains for smoothing of noisy data, predicting missing values, forecasting, pattern recognition, and event or anomaly detection. However, statistical modeling techniques typically cannot scale to the very large volumes of data commonly encountered in today's world. On the other hand, relational database systems can easily handle very large data volumes, but do not have the rich processing capabilities of statistical models. This forces the users to employ an awkward combination of databases and external tools like R, Matlab and SAS, for their analysis. To address this problem, we proposed a new declarative abstraction called "model-based views" that enables scalable and real-time statistical modeling of data by tightly integrating statistical models in the core query processing engine of a database system. We illustrated the viability and benefits of model-based views by building a prototype system that supports this abstraction. We showed how to efficiently integrate a large variety of statistical models into relational databases, thus enabling us to significantly enrich the capabilities of relational database systems, without compromising the ability to efficiently query large volumes of data. Second, we developed techniques and algorithms for representing and querying large volumes of data annotated with "probabilities" in relational databases. Such annotations are one of the most common ways to both associate uncertainty with data and reason about it. However, traditional data management tools cannot properly reason or operate upon such annotated data. Further, even simple analysis of such data turns out to be intractable in many cases. During the course of the project, we addressed many different facets of this challenging problem. We developed a uniform framework for managing probabilistic data that combines the rich modeling power of probabilistic graphical models with the efficiency of database storage and querying. This enables us to represent and query complex correlations that may be present in the data uncertainty. We also developed algorithms and data structures to execute many different types of queries specified in a declarative query language (specifically, a modified subset of SQL) over large volumes of such probabilistic data, including ranking and inference queries. Overall our work has shown how to extend relational database system to effectively represent, store, and query large volumes of uncertain data, increasingly encountered in a range of real-world applications. More details about the project, and the publications that resulted from it, can be found at: www.cs.umd.edu/~amol/MauveDB, and www.cs.umd.edu/~amol/PrDB.