Researchers in an expanding number of scientific disciplines, ranging from astronomy to medicine and from economics, to botany, would like to move large volumes of data across institutional and national boundaries. Placing data at different locations expands data processing capacity, facilitates collaborative analysis and accelerates translation of gained insight into scientific discoveries. Yet, we have little understanding of how a complex ensemble of interrelated factors affect the cost, performance, and resource consumption of these data movements. Factors include the number and size distributions of files in a data set, performance characteristics of the source and destination file/storage systems, time-of-day/week fluctuations of the network carrying capacity between sites, availability of IPv6 paths. This project will explore models that capture both the characteristics and interplay of facets that are involved in placing data at locations separated by local, national, or international networks. Exploratory work will take place to establish organizational and functional foundations for an international data placement laboratory (iDPL) to support at-scale end-to-end data placement experiments.
The iDPL will be constructed with international partners and is intended to be an extensible facility. The UCSD and UW-Madison team bring together extensive experience in software tools (HTCondor, Open Science Grid, Rocks Clustering), high-performance campus-area networks (Quartzite and Prism@UCSD), storage systems (Data Oasis), and engaged science communities. The iDPL core software architecture is to treat data placement experiments as workflows and then use HTCondor?s DAGMan and Metronome continuous testing framework to author, instantiate and manage experiments. Synthetic data sets will be offered to facilitate diagnostics. Whenever possible, international networking providers will be engaged to correlate infrastructure view of the network with user experiences.
The broad and long term impact is to lay the groundwork for a persistent, data and information sharing vehicle that can be easily expanded beyond the initial international group of institutions