Increasingly many users have access to large datasets that they need to analyze. Astronomers, oceanographers, and other domain scientists rely on data analysis for their science. Journalists may want to analyze data to use in their articles. Over the past several years, cloud service providers have been offering an increasingly large selection of data management services for data analytics (e.g., Amazon Elastic MapReduce or Google BigQuery). Cloud services provide a seamless access to powerful data analysis tools, often directly through the browser. Too many services, however, remain too close to the traditional mode of operating a database management system. They reveal too much information about their internal architecture and deployment: Users are required to reason at the level of service instances, instance types, and gigabytes processed. As a result, users today must be data management experts to choose between these services and leverage them in a cost-effective manner. This project will develop new data management techniques that will enable cloud service providers to isolate users from the details of their service internals while offering the ability to trade off price and performance. The project will further develop tools to explain performance and help users re-write their queries to improve it.
More specifically, the project will develop new approaches to (1) predict not only the query runtime but whether a query is likely to execute slower than estimated due to failures, skew, cardinality estimation errors, or contention; (2) guarantee query runtimes by dynamically changing both the resources allocated to a query and its failure-handling and skew-handling mechanisms as needed; (3) post specific slowdown factors in case of heavy load and guarantee them through novel scheduling algorithms; and (4) explain query performance and suggest rewrites in a way that does not require users to understand query plans. The project will implement all of the algorithms in the open source Myria cloud data management system (and service) recently developed and in continuous operation at the University of Washington.
For further information see the project web site at: http://cloudperf.cs.washington.edu