Research progress often depends critically on the ability to move data over long distances: for example, from a remote supercomputer, telescope, or genome sequencing center to a researcher's laboratory or to a university data center. Yet despite major improvements in wide area and campus networks, data movement remains often frustratingly difficult and slow, due to obstacles such as firewalls and inadequate tools.
To overcome this problem, this project develops an easy-to-use, high-performance data transfer and synchronization solution that can operate everywhere researchers need their data to be: desktops and research labs, campus servers and HPC facilities, supercomputing centers, scientific instruments, and public and private cloud providers. This new data movement solution builds on the widely used Globus implementation of the GridFTP protocol and the newer Globus Transfer hosted data movement service, plus the Indiana University implementation of the eXtensible Session Protocol (XSP), which provides for negotiation of end-to-end circuits via protocols such as OpenFlow. It incorporates numerous new capabilities, including negotiation of end-to-end circuits; communication between pairs of nodes that are both behind network address translation (NAT) devices or firewalls; new data movement capabilities such as efficient verification of transfers; and HTTP for communication with object stores.
Resulting software solutions are delivered to the NSF community via a combination of Globus Transfer and the GridFTP and XSP open source software distributions. Partner science and engineering projects in cosmology, biology, physics, cyberinfrastructure, and other domains are assisting with evaluation.
Advances in wide area and campus network connectivity often do not translate into improved end-to-end data movement performance for NSF researchers. The reasons for this unfortunate state of affairs include the following: (1) Network Address Translation (NAT) devices and firewalls frequently prevent data movement altogether, particularly when data needs to move in a "peer-to-peer" manner between one researcher’s desktop or lab and another. (2) Contention within shared networks results in slow transfers, which is particularly challenging for projects in areas such as astronomy, genomics, and physics that must operate reliable distributed data analysis pipelines against large amounts of remotely sourced data. (3) The growing use of public data sets and cloud-based object stores with HTTP interfaces has outpaced software optimized to easily and efficiently transfer large amounts of data via HTTP over high performance networks. (4) Tools often leave end-to-end needs, beyond simply moving the bits, unaddressed—needs such as file verification, automatic tuning based on end-to-end configuration considerations, and determining the root cause of errors when things do not work as expected. And while solutions to many of these problems are known to the networking community, those solutions are not available to most NSF researchers via tools that are sufficiently easy to adopt and use. This CC-NIE Integration project: (1) extended the popular Globus data management service (www.globus.org) to address these problems, providing easy-to-use, high-performance data transfer, replication, and sharing capabilities that can operate everywhere researchers need their data to be—desktops and research labs, campus servers and HPC facilities, supercomputing centers, scientific instruments, and cloud providers; (2) extended the Indiana University implementation of the eXtensible Session Protocol (XSP), which provides for negotiation of end-to-end network circuits, providing for better control and use of network infrastructure amongst competing workloads; and (3) extended the Globus software stack to allow network management tools, such as XSP, to be plugged into Globus so that the benefits of advanced network technologies can be more easily realized by end users. The resulting software has been delivered to the NSF community via a combination of the cloud-based Globus software-as-a-service (SaaS) web application, and the Globus GridFTP and XSP open source software distributions, and is already being used by thousands of researchers across virtually all areas of NSF science and engineering. Intellectual merit: This software integration and enhancement project has achieved its primary goal of providing a broadly useful capability to the NSF research community. In addition, the vertical integration of state-of-the-art tools ranging from low-level network management tools to end-user software services has enhanced our understanding of issues and challenges involved making advanced network capabilities usable by scientists and engineers. Broader impacts: By reducing barriers to moving large quantities of data, this work contributes to a further democratization of access to scientific data. Particularly for researchers working in smaller institutions, poor network connectivity and limited or no local network expertise can be major obstacles to participation in research. Globus, including the new methods introduced by this project, reduces barriers to high-speed data movement. The highly effective Globus outreach program, and partnerships with projects such as XSEDE and Open Science Grid, have further enhanced adoption.