This award funds the development and deployment of the Data-Scope, a computational instrument specifically designed to enable data analysis tasks that are simply not possible today. The instrument?s unprecedented capabilities combine approximately five Petabytes of storage with a sequential IO bandwidth close to 500GBytes/sec, and 600 Teraflops of GPU computing.The need to keep acquisition costs and power consumption low, while maintaining high performance and storage capacity introduces difficult tradeoffs. The Data-Scope will provide extreme data analysis performance over PB-scale datasets at the expense of generic features such as fault tolerance and ease of management. This is however acceptable since the Data-Scope is a research instrument rather than a traditional computational facility.

Project Report

The nature of science is changing – it is increasingly limited by our ability to analyze the large amounts of complex data generated by our instruments and simulations: we see the emergence of Jim Gray’s "Fourth Paradigm" of science. Computers themselves are becoming the source of a lot of new data – the sizes of the largest numerical simulations of nature today are on par with the experimental data sets. This is not simply a computational problem, but rather requires a fresh look, and a holistic approach. We need to combine scalable algorithms and statistical tools with novel hardware and software solutions, like a deep integration of GPU computing with database indexing and fast spatial search capabilities. We propose to build a new kind of instrument, a ‘Data-Scope’, that is capable of observing very large amounts of scientific data, with unique features in its design. In sciences today tackling data-intensive problems at the 5-10TB scales is easy: one can perform these analyses at a typical generic departmental computing facility. 50-100TB problems are quite difficult, but there are about 10-15 universities in the world that can analyze such data sets. When one needs to deal with a petabyte of data, there are less than a handful of places anywhere in the world that can address this challenge. At the same time there are many projects which are crossing over the 100TB boundary today. Astrophysics, High Energy Physics, Environmental Science, Computational Fluid Dynamics, Genomics and Bio­informatics are all encountering data challenges in the several hundred terabyte range and beyond – even within a single university. The large data sets are here, but the off-the-shelf solutions for their analyses are not! The Data-Scope instrument has unique capabilities: it combines about 6.5 Petabytes of storage with a sequential IO bandwidth exceeding 500GBytes/sec and 120 Teraflops of GPU computing. In order to keep the cost of the instrument down, and its performance and storage capacity very high, all at low power consumption, there must be tradeoffs. The Data-Scope was tuned to provide extreme data analysis performance over petabytes at the expense of some generic features. It is a highly specialized tool to study data, a microscope for data: a "Data-Scope", which is why we consider this to be more similar to a research instrument rather than a traditional computational facility. Since its commissioning it has enabled certain analysis tasks that would have been extremely difficult otherwise. Two of JHU’s Nobel Laureates and their students are among the early users of the Data-Scope. This new, data-intensive nature of science is becoming increasingly important by the day. There is a similar vacuum in our abilities to handle large data sets now as there was in the 90’s when the concept of the BeoWulf cluster emerged. Many universities and scientific disciplines are looking for a new template that would enable them to address PB scale data analysis problems. In providing an inexpensive hardware and software architecture, we feel that we can substantially accelerate the development of data-intensive science in the whole country. In order to accelerate the acceptance of the proposed approach we will collaborate with researchers across many different disciplines and across many different institutions nationwide (Los Alamos, Oak Ridge, UCSC, NMSU, UW, UC, UIC, UIUC). The Data-Scope is hosting public services on some of the largest data sets in astronomy, and fluid mechanics, both observational and simulated. Our public turbulence database services (close to 500 Terabytes) have delivered over 10 trillion data points to the world. Students and postdoctoral fellows using the Data-Scope are gaining a substantial career advantage – these will be the job skills of the 21st century scientist! We have a strong industrial involvement. We have been working with Microsoft Research and the SQL Server team for over a decade on exploring how to bring data-intensive computations as close to the data as possible. Microsoft has provided substantial funding to build the GrayWulf facility, the forerunner of the Data-Scope. They have provided a continued collaboration and access to software licenses. NVIDIA is extremely interested in using GPUs in data intensive computations and in building data-balanced architectures, and has awarded JHU a CUDA Center of Excellence status, and donated 90 Tesla cards for the system. Data-Scope was among the first US based systems to have a 100G connectivity to Internet 2, and this has been used to download more than a Petabyte of data so far.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
1040114
Program Officer
Amy Walton
Project Start
Project End
Budget Start
2010-10-01
Budget End
2014-09-30
Support Year
Fiscal Year
2010
Total Cost
$2,087,760
Indirect Cost
Name
Johns Hopkins University
Department
Type
DUNS #
City
Baltimore
State
MD
Country
United States
Zip Code
21218