Many scientific domains have entered a data-driven era, in which scientific discovery depends heavily on effective and efficient analysis of large-scale data generated by wet-bench experiments or computer simulations. Current database management systems (DBMSs), while being very popular in the business world, fall short in high-throughput data processing required by scientific applications. The goal of this project is to design and implement a novel data management software architecture that enables high-throughput data management services for general scientific communities. The project achieves this goal via (1) a novel one-scan-fits-all data processing framework based on repetitive scans of large data sources; (2) a query engine that leverages the massive computing power of modern Graphics Processing Units (GPU) hardware; and (3) design and implementation of algorithms for popular analytics in three scientific domains on top of the query engine to demonstrate the effectiveness and efficiency of the proposed architecture. The project also aims at building a software prototype and evaluating this prototype with real-world scientific datasets and query workloads.
The project is expected to provide a highly efficient solution to satisfy the data management needs of a wide range of scientific fields. To deliver comparable performance, the proposed architecture requires only a fraction of the hardware and energy costs needed by existing systems. As a result, it has the potential to make scientific studies that are regarded as difficult or infeasible a reality. Integration of proposed research into educational endeavors that contribute to broadening the influence of computer science, nurturing the next generation of multidisciplinary scientists, and boosting the success of minority and women students in the computer science and engineering field are other broader impact activities planned.