Science is becoming a data management problem. Advancements in sensing and computational modeling have dramatically increased data acquisition rates, establishing queries -- "in ferro" experiments -- as an essential method of scientific discovery alongside in situ, in vitro, and in silico experiments. Unfortunately, the infrastructure to design and conduct in ferro experiments over massive datasets has not kept pace with our collective ability to create these datasets. Computational modelers, who have long enjoyed the benefits of a research focus on creating larger and faster CPU farms, now face terabytes of simulation results in which deep insights into the health of the planet remain locked. The key is seamlessness: to interactively analyze these data, quantitatively and qualitatively, without regard to boundaries manifesting from time and space domains, physical location, hardware architecture, storage medium, or file organization.
This project is building a new infrastructure that uses the CluE platform to allow ad hoc, longitudinal query and visualization of massive ocean simulation results at interactive speeds. This infrastructure leverages and extends two existing systems: GridFields, a library for general and efficient manipulation of simulation results; and VisTrails, a comprehensive platform for scientific workflow, collaboration, visualization, and provenance. By cloud-enabling these systems, the proposed infrastructure provides: (1) seamless access to the 10 year history of simulation results at interactive speeds; (2) an architecture and execution strategy that exploit both remote cloud and local desktop resources; and (3) a provenance capture and manipulation platform that enables repeatability, code reuse, forensics, and collaboration.