This EAGER award creates an interoperability test bed to identify the components of an effective layered architecture for geoscience and environmental science research. In a layered architecture, every layer consists of different technologies, each of which uses different interaction protocols. The proposed project will examine a wide variety of existing technologies in terms of their effectiveness in working across present data silos. These technologies include data grids, workflow systems, policy management systems, web visualization services, and security protocols that work with various repository catalogs. Project goals are focused on developing cyberinfrastructure tools and approaches that allow geoscience data repositories to enable new science and more effectively make their data holdings discoverable and available to the public. Essential elements of the project include the collection and comparision of various approaches and existing tools to check effectiveness in handling and integrating geoscience data, and by automating processes needed to integrate various databases and data types. The project is led by a team of experts in cyberinfrastructure and geoscience data management and employs a spiral softwar3ee development approach. Broader impacts of the work include building infrastructure for science in order to facilitate data-enabled science in the geosciences. It will also produce results that are likely to be applicable to fields outside of the geosciences. The effort supports a larger NSF effort to establish a new paradigm in the development of an integrative and interoperable data and knowledge management system for the geosciences for a new NSF initiative called EarthCube.
The EarthCube Layered Architecture concept award assessed the types of cyber-infrastructure needed to support GeoScience disciplines. In particular, the major goal was to demonstrate whether a loosely coupled federation environment could be used as the unifying infrastructure that links community resources to research environments. A collaboration environment was proposed as the unifying infrastructure that would support the interoperability mechanisms needed to interact with existing cyberinfrastructure. This approach captures the knowledge needed to interact with existing resources within interoperability mechanisms provided by the collaboration environment. The demonstrations required the identification of appropriate EarthCube science scenarios, associated use cases, and the implementation of appropriate interoperability mechanisms. The NSF DataNet Federation Consortium federation hub was used as the testbed for the demonstrations. A secondary goal was the identification of the types of interoperability mechanisms that were needed, and whether these interoperability mechanisms could be implemented within policy-based data management systems. Major Activities A total of ten interoperability mechanisms were demonstrated. They included interoperability mechanisms for accessing each community resource from the DataNet Federation Consortium collaboration environment, and either querying an information catalog, retrieving data, initiating a workflow, or archiving results: Kepler workflow system NCSA Cyberintegrator workflow management system GeoBrain broker Data Access Broker (DAB) Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) Hydrologic Information System Data Observation Network for Earth (DataONE) information catalog and repository Data Conservancy repository NOAA Envrionmental Research Division’s Data Access Program (ERDDAP) Sustainable Environment Actionable Data (SEAD) repository iRODS workflow system for reproducible data driven research Four science scenarios were developed that used the interoperability mechanisms to support a research goal: Study of hypoxia in the Gulf of Mexico Ecohydrology analysis of a watershed Analysis of river flow for Texas drought Reproducible data driven research using workflow provenance Significant results The interoperability mechanisms were organized into three types of knowledge encapsulation. The loosely coupled federated architecture captured knowledge needed to support: Interaction with a remote community resource. The knowledge included support for execution of the required access protocol, and for quantifying input and output variables. Management of the products cached within the collaboration environment. This included management of the shared collection, as well as management of the interactions with the remote community resources. Reproducible data driven research. This included management of the provenance associated with each analysis workflow, sharing of workflows, and re-execution of workflows. The interoperatiliby mechanisms were implemented through three basic mechanisms: Drivers that applied partial I/O manipulation at the community resource. This requires installing software middleware at the community resource to manage manipulation of data from the community resource, before transmission over the network. A simple example is data subsetting of a terabyte data set to avoid sending the entire file over the network. Micro-services that execute the interaction protocol required by a community resource. In this case, no modifications are needed at the community resource. Instead the interactions are controlled from the collaboration environment and implemented within the collaboration environment. Data sets can be indentified, retrieved, and cached in the collaboration environment in support of research analyses. Policies that control interactions with the community resource, or manage the shared collection, or control access to shared workflow analyses. The policies are executed within the collaboration environment middleware, independently of the remote community resource. These three interoperability mechanisms were implemented on the DataNet Federation Consortium testbed. Each interoperability mechanism could be applied by any participating member of a collaboration, shared with other members, and re-used across research analyses. Key outcomes The creation of national-scale cyberinfrastructure that integrates geoscience community resources is feasible through a loosely coupled federation architecture. We demonstrated an approach that minimizes the effort to assemble the federation by implementing the interoperability mechanisms within collaboration environment middleware. Community resources could be integrated into the federation without any modification. Interoperability mechanisms could be applied that managed access, retrieved data sets, and executed analysis workflows. The power of the approach was illustrated in two demonstrations: Ecohydrology watershed analysis and access to data resources within the NOAA ERDDAP server. In the latter case, NetCDF files could be parsed at the remote community resource through a "driver" embedded within the collaboration environment middleware. Both data subsets and metadata could be extracted from a remote NetCDF file. In the ecohydrology use case, multiple federal repositories were accessed for input data, transformations were applied to the data to generate the appropriate form, and the watershed analysis model was then automatically run. The amount of time needed to do an analysis was reduced from months to hours. The capturing of workflow provenance information enabled re-execution of workflows, and comparison of results across different sets of input. The workflow, input files, and output files could be shared, making it possible for a second researcher to verify the results. This is a key requirement for reproducible data driven research.