This NSF award to Princeton University funds U.S. researchers participating in a project competitively selected by the G8 Research Councils Initiative on Multilateral Research through the Interdisciplinary Program on Application Software towards Exascale Computing for Global Scale Issues. This is a pilot collaboration among the U.S. National Science Foundation, the Canadian National Sciences and Engineering Research Council (NSERC), the French Agence Nationale de la Recherche (ANR), the German Deutsche Forschungsgemeinschaft (DFG), the Japan Society for the Promotion of Science (JSPS), the Russian Foundation for Basic Research (RFBR),and the United Kingdom Research Councils (RC-UK), supporting collaborative research projects selected on a competitive basis that are comprised of researchers from at least three of the partner countries.

This international project targets the rapidly growing demands of climate science data management as models increase the precision with which they depict spatial structure and the completeness with which they describe a vast range of physical processes. The ExArch project is principally a framework for the scientific interpretation of multi-model ensembles at the peta-and exa-scale. It applies a strategy, a prototype infrastructure and demonstration usage examples in the context of the imminent CMIP5 archive, which will be the largest of its kind ever assembled in this domain. It will attach the ExArch framework to the CORDEX experiment, pushing beyond CMIP5 in resolution, albeit at regional scale.

This international project involving collaborating researchers in six countries will explore the challenges of developing a software management infrastructure which will scale to the multi-exabyte archives of climate data which are likely to be crucial to major policy decisions by the end of the decade. In the short term, strategies will be evaluated by applying them to existing data archives. The NSF funding primarily supports early career scientists at Princeton designing a system for querying and processing data from distributed web archives.

Project Report

The issue of climate change occupies centre stage in considering the future of the planet. The climate system is extraordinarily complex, being affected by everything from solar radiation to the dynamics of glaciers, from the waters of the ocean abyss to the behaviour of leaves and soils. It in turn affects all our human systems: climate change has impacts on agriculture, migration, international security, public health, air quality, water resources, travel and trade. How is it possible to understand such a complex system, and predict its future evolution? Even given a scientific understanding of the climate, how do we undertake scientific experimentation, envision alternate futures? Computers have been central in the advance of climate science. We build computer models of the Earth System, verify them against the historical climate record, and then project them into the future. These are solved by applying Newton's laws to spherical fluid layers, subject to solar heating at the Equator and losing heat at the poles. The response required to move that energy polewards led to atmospheric circulations that explain the persistent rain that gave rise to forests around the Equator, the dry desert zones in a band around the globe in either hemisphere near the Tropics, and the sweeping storm systems that periodically come through the higher latitudes in winter. A similar computation in the oceans explains the well-known marine currents and a giant circulation system that connected all the seas, pole to pole. Exact predictions are not possible, but we can run many instances of a model -- an ensemble -- to describe a range of possible different outcomes and their probability distribution, under various scenarios of global industrial and agricultural policy (which prescribe how much of CO2 and other pollutants would enter the climate system). The results from such models are sufficiently alarming that the world has begun to take serious notice that humans were directly interfering with a complex system that is the only one in the universe known to support carbon-based life forms. We must also sample across uncertainties in our understanding. There are some processes -- clouds, for example -- whose representation in models is still in question. Different modeling groups around the world use differing representations. By coordinating and running the same experiment and pooling results, we arrive at a method for the comparative study of models, the multi-model ensemble. Such analyses are the basis for policy documents such as the Intergovermental Panel on Climate Change (IPCC) Assessment Reports, which are periodic accounts of the state of the science, released every few years. The IPCC won a Nobel Peace Prize in 2007 (along with Al Gore) for its efforts in informing the world community of the causes and consequences of climate change. The global data network for distributing climate model projections, along with the network for collecting data from atmospheric and ocean sensors, and satellites, is now part of the ``vast machine'' that is today's global weather and climate science enterprise. The volumes of data involved in this enterprise are quite staggering and is beyond the resources of most research centers to put in one place: it must be examined and analyzed also in distributed fashion. Imagine an analysis on such a distributed archive. Let's say you were a policymaker in Colorado who wished to understand how snowfall in the Rockies responded to a major volcanic eruption such as Pinatubo in 1995. That might help you to plan for future climate shocks. A climate researcher in Switzerland has developed a little tool to perform a snowfall analysis, but it is now your job to apply it across many models, developed at many different institutions, archived at different places. The computation that is needed may itself only run on the Swiss node: you cannot run it on your own computer. Which poses the question: how do you, sitting in Colorado, run an analysis on a Swiss computer, which requires data from 20 different archives distributed around the globe? ExArch is designed to enable this sort of analysis. You will be able to query a future ExArch network for all models in the archive that have run experiments that included simulated responses to a Pinatubo eruption, find the data associated with snowfall, download the minimal subset of data needed for this analysis on an ExArch node, and run the analysis, all without having to download a million dropboxes worth of data down to your laptop. The hardware and software for enabling the "exascale" is still being built, and is the best part of a decade away. Yet when we see the rate at which climate data is growing, we know we have to plan for it now. ExArch is designed to be part of that solution.

National Science Foundation (NSF)
Division of Advanced CyberInfrastructure (ACI)
Application #
Program Officer
Irene M. Qualters
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Princeton University
United States
Zip Code