The growth of scientific data sets to petabyte sizes offers significant opportunities for important discoveries in fields such as combustion chemistry, nanoscience, astrophysics, climate prediction and biology as well as from data on the internet. However, the realization of new scientific insights from this data is limited by the difficulty of creating scalable applications due to the lack of easy-to-use programming models and tools. To address challenges in creating data intensive applications, the project will build an extensible language framework, backed by an expressive collection of high-performance libraries (I/O and analytic), to provide a development environment in which multiple domain-specific language extensions allow programmers and scientists to more easily and directly specify solutions to data-intensive problems as programs written in domain-adapted languages. The project will build on recent attribute grammar research to build an extensible specification of C to host domain-specific language extensions which will also address the inadequate performance in storage, I/O and analysis capabilities in low-level language such as C.

The proposed extensible language and library framework has the potential to be a transformative problem solving environment for programmers and scientists since it allows scalable and efficient solutions to data-intensive problems to be specified at a high-level of abstraction. The resulting language framework and libraries will be freely available to researchers writing applications for climate and other applications involving spatio-temporal data. This includes many applications in the physical sciences and engineering and thus it is expected that the framework will find use in other scientific domains as well.

Project Report

Ecosystem scientists now have petabytes of data available for analysis; one source of such data is from Earth orbiting satellites. Effective analysis of this data can help us understand how the Earth's climate is changing, and determine factors that cause these changes, in turn, providing an opportunity for predicting and preventing future ecological problems by managing the ecology and health of our planet. Performing the necessary analysis on this data is difficult. Although data sets containing the spatial and temporal data can be analyzed at various scales, many phenomena of interest become apparent only at a finer scale, making it critical to develop capabilities for large-scale data analysis tools. For example, it is difficult to detect slow changes (such as logging) in land cover at coarse resolutions. But higher resolution data sets have billions of data points just for one time instance, making change-point detection on a global scale extremely computationally intensive. Writing efficient, scalable, and portable data-intensive applications that deal with data on this scale is immensely challenging. In practice, programmers get bogged down in the low-level details of managing various resources such as the many parallel processors on modern super computers. They then spend more time on these issues than on the core computational problem. This significantly increases the time required to build these applications and in many cases it is so much of a burden that problems that scientists would like to address are not even implemented since it is too difficult to achieve their solutions within the time constraints. In addition, there is considerable potential for application of this work to sustainability. Many of the important issues involve spatio-temporal data, e.g., deforestation, water, food, and energy, and a number of the capabilities we have developed could be used to detect important trends, patterns, and associations that could help inform decision makers. To address these challenges we have developed new programming language tools and techniques that can dramatically simplify the process of writing the kind of computer applications as well as developing new algorithms for analyzing this type of data at the fine scales needed to detect the various climate phenomena of interest. The programming language techniques are based on the notion of "extensible languages." Extensible languages, and their supporting tools, can be extended with new linguistic features (new notations) that allow programmers to express the solution to their programming problem at a much higher level of abstraction. This simplifies the job of the programmer and also makes it possible for the language implementation to identify more ways to optimize the program so that it will run more quickly or use less memory. Our results here include improvements to the tools used to create and modify extensible languages and the creation of new programming language features, packaged up as composable language extensions, that are useful in writing parallel programs for mining and manipulating climate data. One important result is an analysis of language extension specifications that can be used to ensure that different language extensions, developed independently by different parties, can be used together in a single application and that the composition of these different language features will be successful. In the data mining algorithm work, we developed algorithms that use complex networks for the analysis of climate data, change detection algorithms and algorithms for tracking and detecting eddies in the ocean. This type of work presents many challenges in the design and implementation of extensible languages for this domain and we have continued to work to better understand these challenges and solutions to them. There are issues common to both the complex network, change detection, and eddy tracking work, such as dealing with the size of the data and the spatial and temporal nature of the data, as well as issues specific to each of these three areas. More specifically, for complex networks, there is the challenge of efficiently constructing the edges in the network, while for change detection; an important issue is finding similar time series, either in response to a query or to summarize a set of similar changes in the same local area. For eddy detection, there are challenges in tracking evolving objects over time and space. It is this domain-specific understanding of these challenges and their solutions that we are using to build language extensions that directly support the development of applications for change detection and complex network analysis for very large data sets. The eddy work also faces the challenge of large data, but is also faced with the additional complexity of handling local context and multiple potential tracks.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0905581
Program Officer
Vasant G. Honavar
Project Start
Project End
Budget Start
2009-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2009
Total Cost
$794,000
Indirect Cost
Name
University of Minnesota Twin Cities
Department
Type
DUNS #
City
Minneapolis
State
MN
Country
United States
Zip Code
55455