Scientists are increasingly limited by their ability to analyze the large amounts of complex data available. These data sets are generated not only by instruments but also computational experiments; the sizes of the largest numerical simulations are on par with data collected by instruments, crossing the petabyte threshold this year. The importance of large synthetic data sets is increasingly important, as scientists compare their experiments to reference simulations. All disciplines need a new ?instrument for data? that can deal not only with large data sets but the cross product of large and diverse data sets.

While the largest data sets have captured most of the public attention, they only represent the tip of the iceberg. What is often missed is that scientific data sets have a power law distribution. At one end are the very large data collections compiled by hundreds of scientists collaborating over multiple years. These projects typically have coherent data management plans and organization to ensure that the data products are accessible to a wide community. Nevertheless, the long-term curation of the data is still an unsolved problem.

At the other end of the distribution, in the "long tail", are the very large numbers of small data sets, such as the images, spreadsheets and tables collected in laboratories and field studies. While the individual files are small, their numbers add up; in fact, there is as much data aggregated in these small items as in the biggest collections. On the other hand, these data sets are often not as well documented as their bigger counterparts. For most scientists there is little reward in becoming a data management expert and devoting the time required to documenting the data for later reuse. In fact, the process of manually cleaning data sets has been called the strip mining of big data: an ugly and resource intensive effort that leaves big scars.

Scientists at the Johns Hopkins University have built innovative frameworks to publish scientific data across a wide range of disciplines, from astronomy to turbulence, and environmental science. These projects already share some common components for data management. This project will connect more of the existing independent components into a coherent one, explore how to scale the data services to deal with the "long tail" of the data distribution, and demonstrate the overlap in the basic data management tasks across disciplines. The project has four parts: (i) continue and enhance the efforts on the Sloan Digital Sky Survey, (ii) turn large numerical simulations into easy-to-use numerical laboratories, (iii) enhance an existing end-to-end system for environmental sensors and integrate it with other field data, (iv) enhance and generalize a set of core collaborative tools, and apply these to help with the challenge of the "long tail" of scientific data.

The projects involve the Sloan Digital Sky Survey (SDSS) -- the world's most used astronomy facility -- and its CASJOBs/MyDB collaborative environment The framework will be extended to other areas of science, like in-situ environmental monitoring and field biology. This will be demonstrated by integrating data in soil ecology from the Baltimore Ecosystem Study project with data collected automatically, via a wireless sensor network. The project will also test, how a simple, "DropBox"-like interface (i.e., online storage and sharing) can be used to overcome some of the barriers that prevent scientists from publishing much of their value-added data. Finally, the project will explore how smaller and larger numerical simulations can be placed into interactive, publicly accessible numerical laboratories, using data sets currently from turbulence and astronomy.

The funds will support people: a combination of data scientists, database administrators, postdoctoral fellows, students and programmers working together to "connect the dots" and bring additional data sets on line. The project will enhance the public interfaces of several publicly available data sets, prototype an easy-to-use environment to upload small user data into a collaborative environment, and create a framework for a new citizen-science project in environmental science.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
1244820
Program Officer
Robert Chadduck
Project Start
Project End
Budget Start
2012-10-01
Budget End
2016-09-30
Support Year
Fiscal Year
2012
Total Cost
$1,051,100
Indirect Cost
Name
Johns Hopkins University
Department
Type
DUNS #
City
Baltimore
State
MD
Country
United States
Zip Code
21218