This project will further develop and enhance the Stork Data Scheduler to support Azure cloud computing environment, and to mitigate the end-to-end data handling bottleneck in data-intensive cloud computing applications. Stork data scheduler has been very actively used in many data-intensive application areas including coastal hazard prediction and storm surge modeling; oil flow and reservoir uncertainty analysis; numerical relativity and black hole collisions; educational video processing and behavioral assessment; digital sky imaging; and multiscale computational fluid dynamics. Making Stork available on the Azure environment will enable this already existing user base to easily migrate to the Azure Cloud Computing platform as well as other Azure application groups benefiting from the large-scale data handling capabilities of Stork.

The Stork Data Scheduler for Azure will make a distinctive contribution to cloud computing community because it focuses on planning, scheduling, monitoring and management of data placement tasks and application-level end-to-end optimization of networked I/O for petascale data-intensive applications. Unlike existing approaches, it will treat data resources and the tasks related to data access and movement as first class entities just like computational resources and compute tasks, and not simply the side effect of computation. Stork data scheduler for Azure will provide enhanced functionality for cloud computing such as data aggregation and connection caching; peer-to-peer and streamed data management; early error detection, classification, and recovery in data transfers; scheduled storage management; optimal protocol tuning; and end-to-end performance prediction services. The Stork data scheduler for Azure will dramatically change how domain scientists perform their research by rapidly facilitating sharing of large amounts of data in cloud computing environments.

Project Report

Stork data Scheduler is ported to work on the Windows-based systems and on the Azure Cloud Computing Environment. This will allow Azure users be able to immediately start using a broad range of Stork data management capabilities. We have implemented several inter-protocol translation modules as part of Stork Azure and started using them in real data transfers. Stork Azure can act as a negotiating system between different data storage systems/protocols and Azure. The modularity of Stork allows users to insert a plug-in to support their favorite storage system, protocol, or middleware easily. Stork can currently interact with data transfer protocols such as FTP, GridFTP, HTTP, SCP, SMTP, BITTORENT; and data storage systems such as RODS. Stork maintains a library of pluggable 'data placement' modules, which get executed by data placement job requests coming to Stork. Thin clients were implemented for the cloud hosted Stork Azure services (inclsuing a web interface, and a smartphone android app). These interfaces use the REST API to access Stork Azure services. The users are able to submit, manage, and monitor their data transfer tasks submitted to Stork Azure via these thin clients, which is very convenient. End-to-end data transfer throughput prediction models were implemented for Windows Azure. Using these prediction models, we are now able to provide a cloud-hosted 'data transfer completion time estimation service' as a SaaS. This estimation service will allow data movement operations to be scheduled in advance with a preferred time constraint given by the user, stating the earliest start time and desired latest completion time. This will allow users and higher level meta-schedulers to use data placement as a service where they can plan ahead and reserve the time period for their data movement operations between Azure and external storage systems. This service will eliminate possible long delays in completion of a transfer operation and increase utilization of Azure by giving an opportunity to provision the required network and storage resources in advance. We have analyzed various factors that affect the end-to-end data transfer throughput in wide-area distributed environments, such as number of parallel streams, CPU speed, and disk I/O speed. We have shown the effects of CPU-, disk-, and network-level parallelism in removing the bottlenecks one-by-one and increasing the end-to-end data transfer throughput. We have developed models and algorithms to set the best values for the application-level transfer tuning parameters such as pipelining, parallelism and concurrency. The tests conducted over high-speed networking and cloud testbeds show that our algorithms outperform the most popular data transfer tools like Globus-url-copy, Globus Online, and UDT in majority of the cases. The Stork Data Scheduler for Azure makes a distinctive contribution to cloud computing environments because it focuses on planning, scheduling, monitoring and management of data placement tasks and application-level end-to-end optimization of networked I/O for petascale data-intensive applications. Unlike existing approaches, it treats data resources and the tasks related to data access and movement as first class entities just like computational resources and compute tasks, and not simply the side effect of computation. Stork data scheduler for Azure provides enhanced functionality for cloud computing such as data aggregation and connection caching, peer-to-peer and streamed data management; early error detection, classification, and recovery in data transfers; scheduled storage management; optimal protocol tuning; and end-to-end performance prediction services. Stork data scheduler has been very actively used in many application areas including coastal hazard prediction and storm surge modeling; oil flow and reservoir uncertainty analysis; numerical relativity and black hole collisions; educational video processing and behavioral assessment; digital sky imaging; and multiscale computational fluid dynamics. Making Stork available on the Azure environment will enable this already existing user base to easily migrate to the Azure Cloud Computing platform. The Stork data scheduler for Azure will dramatically change how domain scientists perform their research by rapidly facilitating sharing of experience, raw data, and results in cloud computing environments.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
1115805
Program Officer
Almadena Chtchelkanova
Project Start
Project End
Budget Start
2011-04-15
Budget End
2014-03-31
Support Year
Fiscal Year
2011
Total Cost
$97,122
Indirect Cost
Name
Suny at Buffalo
Department
Type
DUNS #
City
Buffalo
State
NY
Country
United States
Zip Code
14228