Application of Random Matrix Theory to Structured High-dimensional Data

Paul, Debashis

Abstract

The main goal of this application is to utilize spectral analysis techniques for dealing with high-dimensional inferential problems. Techniques of random matrix theory, especially Stieltjes transforms of spectral measures, will be utilized to enhance understanding the effects of dependencies among observations on commonly used statistical procedures in high-dimensional settings. As a key component, investigations on the spectral characteristics of large random matrices with dependencies among both rows and columns will be carried out. In addition, new regularization schemes will be developed that are tuned to the characteristics of the data, including possible non-stationarity of the observations, and make use of the intrinsic parsimonious structures in the data.

The proposed application is motivated by problems in a wide range of scientific fields such as wireless communication, spectrometry, genomics, environmental modeling, atmospheric science, brain imaging and econometrics. The emphasis of this proposal is to develop theoretical understanding and practical tools for analyzing complex and large-scale data arising in these disciplines. The research outputs from this project are expected to give wider access among scientists and practitioners in various disciplines to modern statistical tools and concepts for dealing with high-dimensional data. In addition, the tools and ideas developed through this project are likely to contribute towards downstream technologies that require sophisticated real-time data analysis techniques for complex time-varying signals.

Project Report

Statistical analysis of high-dimensional and structured data is increasingly taking center stage in decision making at public and private domains. Thus, developing newtechniques for carrying out such analysis in a time-bound manner is of paramount importance. The current research project funded by NSF aimed at addressing certain aspects of this challenge. The broad objective of this proposal has been to blend statistical theory for high-dimensional data with mathematical and computational techniques into a research program which provides useful generalizations to deal with various scientific problems. Specific scientific problems addressed through this research relate to various kinds of data. This includes data on large number of entities collected over a period of time, as is common in finance and economics. It also includes data on genomics where the activity levels of possibly thousands of genes are being measured simultaneously, which is helpful in detection of possible anomalies related to diseases. It further includes data on magnetic resonance imaging of brain tissues that provide information about the structure and function of different parts of the brain. Given the complexity of each individual data type, it was imperative to resort to certain useful, though somewhat simplified models for describing such data and then formulating statistical hypotheses and conjectures about the behavior of practically useful statistics computed from such data. Then the investigation was carried out using tools such as random matrix theory, differential equations techniques and nonparametric function estimation methods. One major accomplishment of this project is an enhancement of our understanding of the behavior of covariance and autocovariance matrices of large dimensional stationary time series. The latter model is widely used in signal processing, economics and finance. The main finding here is the establishmentof a stabilizing behavior of the spectrum of the sample autocovariances. This phenomenon is a result of the high dimensionality of the data that generalizes similar and well-studied phenomena in the context of temporally independent data. This finding raises the possibility that short term prediction for such time series may yet be feasible even for relatively high dimensions, which remains a challenging problem. Related investigations also led to discovery of a different type of stabilizing behavior of the spectra of appropriately normalized sample autocovariance matrices when the dimension is only moderately large compared to the number of data points. The latter discovery will enable us to test the suitability of some classes of practically useful models for describing data that show dependence in both spatial and temporal scale, as is typical for example in climate studies. Through a separate project involving testing the difference between two biological pathways, based on the measurements of gene expressions, a new method has been developed by making use of the nature of covariation of such data. Through data analysis it has been demonstrated that the proposed method is much more sensitive to the differences than that of existing test procedures that depend on the bulk behavior of the sample covariance matrices. In addition, the results have shown the distinct advantage of incorporating information about the eigenstructure of the covariance matrices while carrying out genomewide association studies, which traditionally depended on the separation of populations according to their means. Another project has been related to the estimation of neuronal fiber directions based on diffusion weighted MRI data. The proposed approach has been shown to be much more accurate in terms of determining the fiber directions than the commonly used estimate, namely, the leading eigenvector of the estimated diffusion tensor. This new approach also very effectively solves the ``crossing fiber'' problem where the traditional diffusion tensor imaging (DTI) techniques are known to beproblematic. Consequently, the proposed method performs much better in terms of reconstructingfiber bundles in regions of brain with crossing fibers than what is achievable throughtechniques based on tractography schemes relying upon the DTI approach. In addition to the aforementioned projects, the award enabled completion of a project on studying the dynamicsof human growth at a population level through the framework of a specific kind of statistical model based on ordinary differential equations. It also helped in the completion of a project on data-adaptive estimation for a class of inverse problems by making use of the theory of wavelets and penalized regression schemes. Apart from uncovering some interesting scientific facts and developing new techniques for statistical analysis of complex, structured data, the project also provided training opportunities for several graduate students. The scientific findings have been disseminated through seminars and published articles. Some of the research findings are being incorporated in a graduate level course to be taught in University of California, Davis. The interdisciplinary nature of some of the research has also created an opportunity for practitioners of different disciplines to learn and use some modern statistical techniques for data-driven analysis in their own fields of research.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Mathematical Sciences (DMS)
Type: Standard Grant (Standard)
Application #: 1106690
Program Officer: Gabor J. Szekely

Project Start
Project End
Budget Start: 2011-07-01
Budget End: 2014-06-30
Support Year
Fiscal Year: 2011
Total Cost: $169,987
Indirect Cost

Application of Random Matrix Theory to Structured High-dimensional Data
Paul, Debashis
University of California Davis, Davis, CA, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments