Big data and the evolving field of Data Science are fundamentally shifting the meaning of analytics. Highly complex computations have come to define the typical workload, with jobs ranging from machine learning to large-scale visualization. However, there is a fundamental discrepancy between the availability of analytical tools for big Internet companies and those for non-tech enterprises. Current analytics frameworks, like Spark or Hadoop, are designed to meet the needs of giant Internet companies; that is, they are built to process petabytes of data in cloud deployments consisting of thousands of cheap commodity machines. Yet non-tech companies like banks and retailers - or even the typical data scientist - seldom operate deployments of that size, instead preferring smaller clusters, aka Enterprise clusters, with more reliable hardware. In fact, recent industry surveys reported that the median Hadoop cluster was fewer than 10 nodes, and over 65% of users operate clusters smaller than 50 nodes. Targeting complex analytics workloads on smaller clusters, however, fundamentally changes the way we should design analytics tools. Most current systems focus on the major challenges associated with large cloud deployments, where network and disk I/O are the primary bottleneck and failures are common, where the next generation of analytics frameworks should optimize specifically for the computation bottleneck. As part of this project, the PIs will systematically design a new analytical open-source engine, called Tupleware, build specifically for the infrastructure of non-tech companies. Tupleware will make complex analytics more accessible and push the boundaries of what computations are possible.
Specifically, the PIs will design, implement and evaluate various program synthesis, i.e., query compilation techniques, for complex analytics on enterprise clusters with fast interconnects and considerable available memory. Existing query compilation techniques focus on SQL and are not designed for workloads where UDFs and iterations dominate the computation, nor do they target distributed setups; all issues the PIs will address in this proposal. Furthermore, the PIs aim to combine high-level query optimization with compiler technology to holistically optimize complex analytical workflows by considering statistics about the data (e.g., the selectivity of predicates) with low-level statistics about the UDFs (e.g., the number of used registers). Finally, all the results will be integrated into the Tupleware system and thus, made accessible for a broader range of users.
For further information see the project web site at: http://tupleware.cs.brown.edu/