Recent technological advances enable collection of massive amounts of data in science, commerce, and society. These datasets bring us closer than ever before to solving important problems such as decoding human genomes and coping with climate changes. Meanwhile, the exponential growth in data volume creates an urgent challenge. Many existing analysis tools assume datasets fit in memory; when applied to massive datasets, they become unacceptably slow because of excessive disk input/output (I/O) operations.
Across application domains, much of advanced data analysis is done with custom programming by statisticians. Progress has been hindered by the lack of easy-to-use statistical computing environments that support I/O-efficient processing of large datasets. There have been many approaches toward I/O-efficiency, but none has gained traction with statisticians because of issues ranging from efficiency to usability. Disk-based storage engines and I/O-efficient function libraries are only a partial solution, because many sources of I/O-inefficiency in programs remain at a higher, inter-operation level. Database systems seem to be a natural solution, with efficient I/O and a declarative language (SQL) enabling high-level optimizations. However, much work in integrating databases and statistical computing remains database-centric, forcing statisticians to learn unfamiliar languages and deal with their impedance mismatch with host languages.
To make a practical impact on statistical computing, this project postulates that a better approach is to make it transparent to users how I/O-efficiency is achieved. Transparency means no SQL, or any new language to learn. Transparency means that existing code should run without modification, and automatically gain I/O-efficiency. The project, nicknamed RIOT, aims at extending R---a widely popular open-source statistical computing environment---to transparently provide efficient I/O. Achieving transparency is challenging; RIOT does so with an end-to-end solution addressing issues on all fronts: I/O-efficient algorithms, pipelined execution, deferred evaluation, I/O-cost-driven expression optimization, smart storage and materialization, and seamless integration with an interpreted host language.
RIOT integrates research and education, and continues the tradition of involving undergraduates through REU and independent studies. As a database researcher, the PI is committed to learning and drawing from work from programming languages and high-performance computing. Findings from RIOT help create synergy and seed further collaboration with these communities. To ensure practical impact on statistical computing, RIOT has enlisted collaboration from statisticians and the R core development team on developing, evaluating, and disseminating RIOT.
Further information can be found at: www.cs.duke.edu/dbgroup/Main/RIOT
Recent technological advances have enabled collection of massive amounts of data in science, commerce, and society. These large, high-resolution datasets have brought us closer than ever before to solving important problems such as decoding human genomes and coping with climate changes. Meanwhile, the exponential growth in the amount of data has created an urgent and difficult technical challenge. Many existing data analysis tools still assume that datasets fit in main memory of a single machine; they are unable to cope with massive datasets. Across application domains, much of advanced data analysis is done with programs custom-developed by statisticians. Unfortunately, progress has been hindered by the lack of easy-to-use statistical computing environments that support efficient and scalable execution of programs over large datasets. High-performance libraries provide only a partial solution, because many optimization opportunities in a program remain at a higher, inter-operation level. The goal of this project is to provide a more usable platform for big data analytics. Intellectual Merit To make a practical impact on the statistical computing community, this project postulates that a better approach is to make it transparent to users how efficiency and scalability is achieved. Transparency means new language to learn; the system automatically optimizes the programs written in a high-level language familiar to statisticians. To achieve I/O-efficiency, this project has developed an end-to-end solution that addresses issues on all fronts in an innovative way: efficient and flexible storage and indexing, pipelined execution to avoid intermediate results, I/O-cost-driven optimization through aggressively deferred evaluation, and seamless integration of these features with an interpreted host language R. The project has also investigated various methods for leveraging emerging hardware and platform for scalable statistical analysis, including the use of a computing cloud, solid-state drives (SSDs), and graphics processing units (GPUs). Users can benefit from these technologies without having to rewrite programs specifically for them. This project has generated many publications in database research venues, including CIDR 2009, ICDE 2010, PVLDB 2011, PVLDB 2012, CIKM 2012, SIGMOD 2013, PVLDB 2013, and IEEE Data Engineering Bulletin 2014. The software artifacts developed by the project include a proof-of-concept implementation on top of a database system (available from the project website), a prototype system for I/O-efficient linear algebra built from ground up to overcome the limitations and inefficiency of database systems (demonstrated at ICDE 2010), and a prototype system that jointly optimizes parallel execution and deployment strategies for linear algebra workloads on a cloud. Broader Impact The PI has been part of other interdisciplinary projects funded by NSFone studied how to collect and analyze ecological data from a sensor network, and another one investigating how to simplify the development and deployment of statistical data analysis in a cloud. Much of the work in this project is motivated by the ecological data analysis problems faced in the first project on sensors, while many of the results from this project are now being applied in the second project to problems in statistics and political science. The project has provided training for a number of PhD students: Yi Zhang, Risi Thonangi, and Botong Huang. Yi Zhang, the lead student on this project, graduated with a PhD in 2012. The PI has also supervised undergraduate researchers, Weiping Zhang and Jiaqi Yan, to work on this project.