Scientific discovery is increasingly dependent on huge datasets that require computing at unprecedented scale. Laboratory computers and spreadsheets simply cannot handle the data flowing from modern measurement devices, such as DNA sequencers. Scientific experiments now require understanding of both the underlying science and the cyberinfrastructure (CI) ecosystem to design and execute necessary computations. Fortunately, significant and strategic support from the public and private sectors is creating a distributed computational ecosystem at the national level to help meet the computational demands of large datasets. This project, the Scientific Data Analysis at Scale (SciDAS) is designed to improve flexibility and accessibility to national resources, helping researchers more effectively use a broader array of these resources. SciDAS is developed using large-scale systems biology and hydrology datasets, but is extensible to many other domains.
On a technical level, SciDAS federates access to multiple national CI resources including NSF Cloud, Open Science Grid, the Extreme Science and Engineering Discovery Environment (XSEDE v2.0), petascale supercomputers such as COMET, and campus resources. Central to SciDAS is the use of ExoGENI dynamic networked infrastructure to enable Layer-2 connectivity and data movement between these resources and data repositories. SciDAS relies on the integrated-Rule-Oriented-Data-System (iRODS), enhanced with software-defined-networking (SDN) capabilities, to support network-aware data management decisions and efficient use of network resources. The distributed and scalable nature of both the data-sharing and the compute infrastructure are exploited to optimize for computer and data locality, boosting the performance of workflows and scientific productivity. Scientific discovery use cases in systems biology and hydrology will drive cyberinfrastructure development at the petascale level while simultaneously generating useful results for domain scientists.