"Big data" have been growing in volume and diversity at an explosive rate, bringing enormous potential for transforming science and society. Driven by the desire to convert data into insights, analysis has become increasingly statistical, and there are more people than ever interested in analyzing big data. The rise of cloud computing in recent years offers a promising possibility for supporting big data analytics. However, it remains frustratingly difficult for many scientists and statisticians to use the cloud for any non-trivial statistical analysis of big data.
The first challenge is development---users need to code and think in low-level, platform-specific ways, and, in many cases, resort to extensive manual tuning to achieve acceptable performance. The second challenge is deployment---users are faced with a maddening array of choices, ranging from hardware provisioning (e.g., type and number of machines to request), software configuration (e.g., number of parallel execution slots per machine), to execution parameters and implementation alternatives.
This project aims to build Cumulon, an end-to-end solution for making statistical computing over big data easier and more efficient in the cloud. For development, users can think and code in a declarative fashion, without worrying about how to map data and computation onto specific hardware and software. For deployment, Cumulon presents users with best "plans" meeting their requirements, along with completion time and monetary cost to help them make decisions. A plan encodes choices of not only implementation alternatives and execution parameters, but also cluster resource and configuration parameters. This project develops effective cost modeling and efficient optimization techniques for the vast search space of possible plans. Cumulon addresses the challenges of uncertainty and extensibility (in terms of not only functionality but also optimizability). Cumulon also features a performance trace repository, which collects data from past deployments and uses them to improve cost modeling and optimization.
Cumulon aims to make statistical computing over big data easier and more cost-effective for a wide range of users including scientists and statisticians. Besides leveraging the cloud to provide on-demand, pay-as-you-go access to computing resources, Cumulon further simplifies development and deployment, reduces reliance on programming and tuning support, and accelerates data-driven discoveries. More than a one-shot solution, Cumulon is designed as a basis for an evolvable, open-source ecosystem that keeps up with advances in big-data analytics. Its repository of performance traces benefits the community in independent ways.
With the growing importance of quantitative, data-driven methods, Cumulon can impact many domains. The interdisciplinary team of PIs---from computer science, statistics, etc. ---is applying Cumulon to concrete applications in biomedical research and computational journalism. Through collaboration, the PIs seek to attract diverse talents, motivate them to work on problems with potential societal impacts, and help prepare them for the new challenges of big data.
For further information see the web site at: http://db.cs.duke.edu/projects/cumulon