Multiscale Generalized Correlation: A Unified Distance-Based Correlation Measure for Dependency Discovery

Shen, Cencheng

Abstract

Detecting relationships between two data sets has long been one of the most important questions in statistics and is fundamental to scientific discovery in the big-data era. By developing an open-source, robust, efficient, and scalable statistical methodology for testing dependence on modern data, this project aims to advance the understanding and utility of testing dependence, tackle a number of related statistical inference questions, and accelerate a broad range of data-intensive research. The project incorporates fundamental research in mathematics, statistics, and computer science to further develop a multiscale generalized correlation framework to enable discovery and decision-making via analysis of large and complex data. The tools under development will allow scientists to better explore and understand high-dimensional, nonlinear, and multi-modal data in a myriad of applications. The project aims to provide a unified framework for discovery of relationships between observations in an efficient and theoretically-sound manner.

Combining the notion of generalized correlation with the locality principle, multiscale generalized correlation (MGC) is a superior correlation measure that equals the optimal local correlation among all possible local scales. By building upon distance correlation and making use of nearest neighbors, the resulting MGC test statistic is a unique dependence measure that is consistent for testing against all dependencies with finite second moment, and it exhibits better performance than existing state-of-art methods under a wide variety of nonlinear and high-dimensional dependencies. By investigating the theoretical aspects of distance-based correlations, this project aims to further improve the finite-sample performance of MGC-style tests, extend its capability to testing dependence on network and kernel data, and broaden its utility to general inferential questions beyond dependence testing such as two-sample testing, outlier detection, and feature screening, as well as applications to brain activity, networks, and text analysis. Overall, this project intends to establish a unified methodology framework for statistical testing in high-dimensional, noisy, big data, through theoretical advancements, comprehensive simulations, and real data experiments.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Mathematical Sciences (DMS)
Application #: 1921310
Program Officer: Gabor Szekely

Project Start
Project End
Budget Start: 2018-09-11
Budget End: 2020-08-31
Support Year
Fiscal Year: 2019
Total Cost: $142,651
Indirect Cost

Multiscale Generalized Correlation: A Unified Distance-Based Correlation Measure for Dependency Discovery
Shen, Cencheng
University of Delaware, Newark, DE, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments