Current petascale platforms can perform large-scale simulations and generate massive amounts of data at unprecedented rates. These rates are expected to increase as exascale platforms are introduced. The generation of more and more data presents new challenges for scientists who struggle with the analysis, sorting, and selection of scientifically meaningful results. When very large amounts of data records are located across a large number of nodes in a distributed memory system, even a small number of comparisons can be costly or even impossible. Therefore, new methodologies are necessary to analyze large scientific datasets at scale.

The goal of this project is to develop a transformative analysis method to model the properties of large scientific datasets in a distributed manner on petascale systems today and exascale systems in the future. The research activity includes (1) the design of new algorithms for encoding properties embedded in distributed data in a parallel manner by using space reduction techniques; (2) the design of new algorithms for clustering and classifying these properties by using distributed paradigms such as MapReduce; (3) the deployment of the algorithms for diverse datasets in structural biology and astronomy; and (4) the tuning of the algorithms for both result performance and accuracy on emerging storage technologies.

The analysis method will provide the scientific community with infrastructures and instrumentations to identify features that can be used to predict class memberships; find recurrent patterns in datasets; and identify class memberships from a specific feature or property. By effectively and accurately capturing scientific information in a scalable manner, these infrastructures and instrumentations will break the traditional constraint of data centralization and allow scientists to overcome the difficulties associated with the fully distributed nature of the data considered.

The project's educational component promotes training and learning in computational modeling and analysis techniques as well as data-intensive algorithms and platforms by involving undergraduate and graduate students in research activities and integrating big data analytics into the undergraduate curriculum at the University of Delaware. The research-based educational materials developed in this project will be made available to the scientific community through the project portal and through tutorials at XSEDE and Supercomputing (SC) conferences.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
1318417
Program Officer
Almadena Chtchelkanova
Project Start
Project End
Budget Start
2013-09-01
Budget End
2017-08-31
Support Year
Fiscal Year
2013
Total Cost
$69,038
Indirect Cost
Name
University of California San Diego
Department
Type
DUNS #
City
La Jolla
State
CA
Country
United States
Zip Code
92093