This proposal lays down a comprehensive framework for carrying out statistical inference on point-referenced high-dimensional spatial data available from a large number of locations. The focus of the proposal is methodological rather than purely theoretical or purely applied. Thus, statistical theory is used to develop mathematically formal but computationally feasible methods that can have a broad range of applications. Theoretical derivations and new results that will enhance current methods (including findings by the PI in prior NSF-funded research) will be explored, but always keeping in mind the practicing spatial analyst. The basic framework is to use a low-rank spatial process obtained by projecting the original process onto a lower-dimensional subspace. The PI intends to explore approximation properties of the low rank spatial process with regard to different metrics. The long-term goal of the PI is to develop a full suite of statistical methods that estimate spatial models in a wide variety of experiments in forestry, ecology and the broader environmental sciences. A recurrent underlying theme of the proposed methods that makes it different from existing methods is that the modeler does not need to sacrifice richness in modeling as a compromise for the large datasets. This resolves the statistical irony that large datasets are precisely where complex relationships can be detected effectively.

Modern spatial technologies such as Geographical Information Systems (GIS) and Global Positioning Systems (GPS) routinely identify geographical coordinates with a simple hand-held device. Consequently, scientists and researchers in a variety of disciplines today have access to geocoded data as never before. With data becoming increasingly high-dimensional both in terms of number of observed locations and the number of observations per location, scientists are seeking to hypothesize complex relationships. These, in turn, yield rather complex hierarchical models that are computationally expensive even for moderately sized datasets. This team recognises a need for statistical modeling of large multivariate spatial data and proposes a model-based setup to tackle a wide variety of large geostatistical datasets. Although some of the more serious statistical modeling will require multi-processor capabilities, the emphasis on this project is on methodology implementable with moderately powerful computing tools. The proposed methodologies would, therefore, be accessible to a large number of researchers. The broader impact of the proposed methods is best assessed by connecting the outcome of this research with the widely recognized impact of GIS on human society. From identifying spatial disparities in health standards to more precise weather predictions, GIS technology is used today in almost every sphere of society and the proposed methods can have far reaching beneficial effects in environmental research that potentially touch unexpected corners of society.

Project Report

With the increasing popularity and availability of spatial referencing technologies such as Geographical Information Systems (GIS) and Global Positioning Systems (GPS) that can identify geographical coordinates with a simple hand-held device, scientists and researchers in a variety of disciplines today have access to geocoded data as never before. Statistical models accounting for spatial associations have, not surprisingly, become an enormously active area of research over the last decade. With spatial data becoming increasingly high-dimensional – both in terms of number of observed locations and the number of observations per location – scientists are seeking to hypothesize extremely complex relationships. These, in turn, lead to complex and computationally expensive models even for moderately sized datasets. Matters become completely impractical with a large number of spatial locations (say thousands). This project has addressed the need for statistical modeling of large multivariate spatial data. The team of investigators has developed a model-based setup to tackle a wide variety of large geostatistical datasets.The technology emerging from this project, including a statistical software product that assists users in implementing the proposed methods, allows scientists and engineers to analyze large spatial datasets with moderately powerful computing tools. Some of the more advanced modeling techniques, also developed as a part of this project, can handle massive spatial databases using multi-processor computing capabilities. The focus of the proposal is methodological, rather than purely theoretical or purely applied. Thus, statistical theory has been used to develop computationally feasible methods having a broad range of scientific applications. Theoretical derivations and new results have been explored, but always keeping in mind the practicing spatial analyst. The long-term goal of the PI is to develop a full suite of statistical methods that estimate spatial models in a wide variety of experiments in forestry, ecology and the broader environmental sciences. The outcomes from this project are different from existing methods because the modeler does not need to sacrifice richness in modeling as a compromise for the large datasets. Project outcomes include over 20 peer-reviewed journal articles that have been published in highly regarded statistical journals, a software package called spBayes, which has been developed and disseminated through the R statistical computing environment and already ranks among the most popular statistical packages for handling spatial data, and a textbook designed for statisticians making forays into spatial analysis. In addition, numerous short courses and tutorials have been administered by the PI and the co-PI. The broader impact of this project is best assessed by connecting the outcome of this research with the widely recognized impact of GIS on human society. From identifying spatial disparities in health standards to more precise weather predictions, GIS technology is used today in almost every sphere of society and the novel methods developed and disseminated through this project will have far reaching beneficial effects in environmental research that potentially touch unexpected corners of society. As the scientific community moves into a data-rich era, it enjoys unprecedented opportunities to build understanding about how forest ecosystems function and will respond to changing environmental conditions. Although development of the proposed modeling frameworks were originally motivated by substantive questions in forestry, ecology and the other environmental systems, the potential advancements in data modeling using the outcomes of this research will find use in fields such as public and environmental health, meteorology, engineering, and geosciences where the fundamental goal is the same – use new findings to help improve society. Further, the proposed development of open source software and creating associated learning material will make these methodological advances accessible to researchers in applied fields. By redeeming the current and subsequent generations of investigators from using ad hoc methods that can often present deceptive stories, the proposed methods, software, and educational material (e.g., books, scientific journal articles, workshops, tutorials, software documentations and so on) that have emerged from this project can have far reaching beneficial effects in environmental research that potentially touch unexpected corners of society.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1106609
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2011-06-01
Budget End
2014-05-31
Support Year
Fiscal Year
2011
Total Cost
$303,478
Indirect Cost
Name
University of Minnesota Twin Cities
Department
Type
DUNS #
City
Minneapolis
State
MN
Country
United States
Zip Code
55455