Detecting relationships between two data sets has long been one of the most important questions in statistics and is fundamental to scientific discovery in the big-data era. By developing an open-source, robust, efficient, and scalable statistical methodology for testing dependence on modern data, this project aims to advance the understanding and utility of testing dependence, tackle a number of related statistical inference questions, and accelerate a broad range of data-intensive research. The project incorporates fundamental research in mathematics, statistics, and computer science to further develop a multiscale generalized correlation framework to enable discovery and decision-making via analysis of large and complex data. The tools under development will allow scientists to better explore and understand high-dimensional, nonlinear, and multi-modal data in a myriad of applications. The project aims to provide a unified framework for discovery of relationships between observations in an efficient and theoretically-sound manner.
Combining the notion of generalized correlation with the locality principle, multiscale generalized correlation (MGC) is a superior correlation measure that equals the optimal local correlation among all possible local scales. By building upon distance correlation and making use of nearest neighbors, the resulting MGC test statistic is a unique dependence measure that is consistent for testing against all dependencies with finite second moment, and it exhibits better performance than existing state-of-art methods under a wide variety of nonlinear and high-dimensional dependencies. By investigating the theoretical aspects of distance-based correlations, this project aims to further improve the finite-sample performance of MGC-style tests, extend its capability to testing dependence on network and kernel data, and broaden its utility to general inferential questions beyond dependence testing such as two-sample testing, outlier detection, and feature screening, as well as applications to brain activity, networks, and text analysis. Overall, this project intends to establish a unified methodology framework for statistical testing in high-dimensional, noisy, big data, through theoretical advancements, comprehensive simulations, and real data experiments.