Rapid increases in data volumes and velocities are overwhelming finite human capabilities. Continued progress in science and engineering demands that we automate a broad spectrum of currently manual research data manipulation tasks, from transfer and sharing to acquisition, publication, indexing, analysis, and inference. To address this need, which arises across essentially all scientific disciplines, this project will work with scientists in astronomy, engineering, geosciences, materials science, and neurosciences to develop and apply Globus Automate, a distributed research automation platform. Its purpose is to increase productivity and research quality across many science disciplines by allowing scientists to offload the management of a broad range of data acquisition, manipulation, and analysis tasks to a cloud-hosted distributed research automation platform. By thus enabling scientists to hand off responsibility for managing frequently performed tasks, such as acquiring, analyzing, and storing data, Globus Automate will increase the productivity of scientific instruments and the scientists that use them.
This project will expand the capabilities and reach of the highly successful Globus research data management platform. Globus combines a professionally operated cloud-hosted management service with Globus Connect software deployed on more than 12,000 storage system endpoints, spanning most research universities, NSF-funded compute facilities, and NSF disciplines. Users employ Globus web interfaces and APIs to drive data movement, synchronization, and sharing tasks at and among endpoints. This ability to hand off responsibility for such tasks to cloud-hosted management logic has enabled substantial increases in data management efficiency, and spurred development of a wide range of innovative data management applications. Globus Automate will extend Globus capabilities to produce a full-featured distributed research automation platform that will enable the reliable, secure, and efficient automation of a wide range of research data management and manipulation activities. It will extend intuitive trigger-action programming models, suitable for non-programming users, to enable the specification and execution of a series of actions. It will provide for the detection of data events both at Globus storage system endpoints (e.g., creation or modification of new data files, extraction of new metadata) and at other sources (e.g., completion or failure of Globus transfer tasks); the propagation of such events to a cloud-hosted orchestration engine for reliable, efficient, and secure processing; and the invocation of remote actions on Globus endpoints and other resources. The project will leverage these basic event mechanisms to implement solutions to challenging science problems associated with partner science projects, and create a library of automation flows, both general-purpose (e.g., data publication and data replication) and domain-specific (e.g., feature detection in experimental data). These data event mechanisms will be made available on all storage systems relevant to research (Globus already supports most on-premises and cloud systems) and integrated with the Python language and JupyterLab environment that have become popular in science, so that researchers can define and share data automation behaviors as simple Python programs. A quantitative and qualitative research agenda will analyze the usability and adoption of both the platform and the research automation paradigm.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.