With developments in modern information technology, massive datasets with complicated structures have been collected in many scientific fields such as astronomy, biology, climatology, etc. In this project, the investigator plans to explore the connections between the data sampling distribution and spectrum of the distribution dependent operators and to develop a theoretical foundation for analyzing commonly used spectral techniques based on pairwise distance/dissimilarity measures. Based on the theoretical analysis, a new class of statistical inference tools will be proposed for robust estimation, dimension reduction, clustering and data summarization. Computationally effective algorithms will be designed and their software implementations will be disseminated. Besides theoretical development in statistical methodology, the proposed inference tools will be applied to climate change studies using satellite data and climate model outputs.

The proposed research is motivated by real world scientific problems that require statistical inference from massive datasets. The proposed method is designed to extract useful information and knowledge from those massive datasets with complicated structures. The novel algorithms to be developed in this project have the potential to not only help geoscientists and climate modelers in analyzing climate records and calibrating climate models, but also provide statistical tools for scientific investigations for researchers in a wide spectrum of disciplines.

Project Report

With numerous developments in modern information technology, high-dimensional massive datasets displaying complicated structures have been collected in many scientific fields such as astronomy, biology, climatology, etc. This project addressed the statistical challenge of how to extract useful information and knowledge from such massive datasets. The project concentrated on a class of statistical methods that search massive datasests for simple, but critical data structures, using the notion of assessing pairwise distance/dissimilarity quantifications among data. In developments of new statistical theory and methodology, the PI collaborated with his Ph.D. students and colleagues on a wide range of statistics issues. The connection between the spectrum of the graph Laplacian and community structures in directed networks was formulated and proved. Further, scalable algorithms were constructed for community detection of directed networks. In addition, scalable spectral algorithms for data clustering were investigated and variable screening methods in high-dimensional clustering and classification were developed. We studied scientific applications involving dimension reduction and clustering methods for NASA’s Atmospheric Infrared Sounder (AIRS) level-3 quantization data to address the assessment of local climate change. We also investigated NASA’s global aerosol distribution product produced from data collected by the Multi-angle Imaging SpectroRadiometer instrument (MSIR) onboard NASA's Terra satellite. We collaborated with a scientist at NASA-JPL to quantify the changes in the distribution of aerosols over the last decade.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1007060
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2010-07-01
Budget End
2014-06-30
Support Year
Fiscal Year
2010
Total Cost
$144,984
Indirect Cost
Name
Ohio State University
Department
Type
DUNS #
City
Columbus
State
OH
Country
United States
Zip Code
43210