Two important challenges of the proposed work are in high-throughput bioinformatics studies and neuroscience studies, which generate data typical of research aligned with two recent White House Initiatives, namely the Precision Medicine and BRAIN Initiatives respectively. Both initiatives are investing heavily in generating high-resolution data, which once analyzed properly, can yield insights that will pave the foundations of advanced treatments for genetic and nervous system disorders. To fully maximize the potential impact of collecting such data, this proposal develops a new framework for identifying complicated underlying patterns in multiway arrays. Specifically, the project fills a gap in nonparametric estimation of low-dimensional structure and geometry in big and noisy data arrays. The PI plans to develop a training program based on applications of the proposed research to recruit and retain talented high school, undergraduate, and graduate students from underrepresented minority (URM) groups for potential careers as innovative data scientists.

Modern data matrices present two challenges to their analyses: (1) they are transposable in the sense that both their rows and columns are often of interest and may contain non-trivial dependencies among them, and (2) they may be very large. The first challenge has only partially been addressed by existing biclustering or co-clustering methods. These methods can identify only very simple coupled structures that organize the rows and columns. In order to flexibly model and extract more complicated patterns in large data matrices, in which both rows and columns are high-dimensional, one requires a new co-manifold learning framework that can discover a wider range of intrinsic geometries of the rows and columns. To meet the first challenge, this project develops a framework for performing joint nonlinear dimension reduction on the rows and columns. The proposed methods construct multiscale distances that are invariant to row and column permutations, equipping practitioners with a means to estimate the intrinsic organization of the rows and columns of a data matrix without prior information on any row or column orderings. To meet the second challenge, this project formulates the key computations as optimization problems that admit distributed parallel algorithms with nearly linear speed-up. The framework also generalizes naturally to the higher-order generalization of matrices, multiway arrays or tensors. Finally, the estimated intrinsic geometries possess stability guarantees, namely small perturbations in the data due to noise or adjustments to input parameters cannot lead to disproportionately large variations in the estimated intrinsic geometry. The proposed procedures have the potential to enable practitioners to extract complicated patterns stably from massive data tensors with non-trivial dependencies along their modes or axes.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1752692
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2018-07-01
Budget End
2023-06-30
Support Year
Fiscal Year
2017
Total Cost
$234,841
Indirect Cost
Name
North Carolina State University Raleigh
Department
Type
DUNS #
City
Raleigh
State
NC
Country
United States
Zip Code
27695