Petascale applications are producing terabytes of data at a great rate. Storage systems in large-scale machines are significantly stressed as I/O rates are not growing as fast to cope with data production. A variety of HPC activities such as writing output and checkpoint data are all stymied by the I/O bandwidth bottleneck. Further to this, the post-processing and subsequent analysis/visualization of computational results is increasingly time consuming due to the widening gap between the storage/processing capacities of supercomputers and users' local clusters.
This research focuses on building a novel in-job dynamic data staging architecture and in bringing it to bear on the looming petascale I/O crisis. To this end, the following objectives are investigated: (i) the concerted use of node-local memory and emerging hardware such as Solid State Disks (SSDs), from a dedicated set of nodes, as a means to alleviate the I/O bandwidth bottleneck, (ii) the multiplexing of traditional user post-processing pipelines and secondary computations with asynchronous I/O on the staging ground to perform scalable I/O and data analytics, (iii) bypassing memory to access the staging area, and (iv) enabling QoS both in the staging ground and in the communication channel connecting it to compute client and persistent storage.
This study will have a wide-ranging impact on future provisioning of extreme-scale machines and will provide formative guidelines to this end. The result of this research will be a set of integrated techniques that can fundamentally change the current parallel I/O model and accelerate petascale I/O pipelines. Further, this research will help analyze the utility of SSDs in day-to-day supercomputing I/O and inform the wider HPC community of its viability.
Petascale applications are producing terabytes of data at a rate much faster than Moore’s law scaling. Storage systems in large scale machines are significantly stressed as I/O rates are not growing as fast to cope with data production. A variety of HPC activities such as writing outputs, checkpointing and post-processing are all stymied by the I/O bandwidth bottleneck. Consequently, state-of the-art supercomputing applications often cannot generate as much data as desired by scientists (in terms of data or time resolution) due to the cost of I/O and post-processing pipelines. Our approach in this proposal has been to accelerate these I/O pipelines by building a novel in-job data staging architecture from a dedicated set of nodes using a combination of node-local memory and emerging hardware such as Solid State Disks (SSDs). This staging ground is then used to perform scalable I/O and data analytics while a much larger number of nodes are running the parallel application. We have further devised a novel middleware library that would allow the use of this staging area as a memory extension device so that applications can perform out-of-core computations on emerging SSD devices on large-scale HPC machines. Intellectual Merit: The techniques developed in this research project have the broader impact of re-vitalizing data analytics in an in-situ fashion on SSDs and in an out-of-core fashion. Impact: These techniques can benefit a variety of scientific applications that are both and I/O and memory starved on modern HPC systems.