The ability to aggregate, share, and analyze important large data sets while optimizing time-to-science is essential to support multi-disciplinary and multi-institutional data-driven discovery. This project is deploying a federated cloud computing system in New York State and California comprised of data infrastructure building blocks designed to support scientists requiring flexible workflows and analysis tools for large-scale data sets. Data challenges from seven different communities-earth and atmospheric sciences, finance, chemistry, astronomy, civil engineering, genomics, and food science-are being addressed using a rich set of open source software, optimized frameworks, and cloud usage modalities. The federated cloud is operating at Cornell University (project lead) and at partner sites at the University at Buffalo and the University of California, Santa Barbara. The project team is supporting multi-disciplinary research groups with over forty global collaborators and documenting science use cases. The broader goal of this project is to develop a federated cloud model that encourages and rewards institutions for sharing large-scale data analysis resources that can be expanded internally with common, incremental building blocks and externally through meaningful collaborations with other institutions, public clouds, and NSF cloud resources.
Project documentation and webinars feature best practices and include how to create Virtual Machine instances, run at federated sites, burst to Amazon Web Services, and access, move, and store large-scale data. A new tool for cloud metrics is being built into Open XDMoD (XD Net Metrics on Demand) that features QBETS (Queue Bounds Estimation from Time Series) statistics to enable users to make online forecasts of future performance and allocation level availability as well as to predict when to burst from federation resources. A new allocations and accounting model allows institutional administrators to track utilization across federated sites and use this data as an exchange mechanism. These tools provide a better understanding of how the sharing of data infrastructure building block capacity across institutional boundaries can create wider science and engineering collaborations and increase data sharing in a scalable and sustainable way.