The goal of this project is to advance the state of the art of the search algorithms for modern high-dimensional datasets. Such datasets arise naturally from various representations of data objects: e.g., one 20x20 image can be represented as a 400-dimensional vector, a coordinate per pixel. The modern representations of objects have evolved to sophisticated methods (e.g., deep-learned representations for images), and which invariably lead to the basic setup of a dataset of high-dimensional vectors endowed with some similarity measure. Then a central problem is that of similarity search: given a dataset, find similar objects inside it, or, given a new object, find most similar dataset objects. These two primitives are central to many standard data-science questions, such as clustering or inference. This project will develop new algorithms for these primitives, in a variety of important application contexts, which use a variety of similarity measures. Most of the prior research focused on algorithms that are "data oblivious", which can be thought of as "coloring" the dataset objects as green, yellow, etc, and then only comparing green-vs-green, yellow-vs-yellow, etc. objects, thereby improving the algorithmic performance. The fundamental concept explored in this project is the data-dependent design of such coloring methods, which takes into account the entire dataset (i.e., the "color" of an object depends on all the other objects as well). This project has foundational connections to emerging research directions in mathematics and computational-complexity theory, on the theoretical side, as well as to data-science applications where similarity search is ubiquitous, on the practical side.
This project will explore the data-dependent methods for the two primitives, referred to as closest-pair (CP) and nearest-neighbor search (NNS) problems respectively, under various notions of similarity measures. Prior research by the investigator and co-authors suggests that data-dependent methods may yield dramatic improvements in algorithm performance for problems on which remained elusive for decades. While such methods are common in practice, their limits and theoretical understanding is largely lacking, leaving room for substantial improvements in theory and practice. The project focuses on three specific directions of exploration. 1) To design efficient algorithms for important classic distances or classes of distances, via the new data-dependent approach of cutting modulus, which is related to a new functional-analysis notion of "average embedding". 2) To develop data-dependent tools beyond this notion, e.g., for similarity measures that are not distances, as well as for obtaining algorithms with an instance-optimal performance. 3) To develop faster algorithms for CP, by exploiting the structure of the dataset, and ultimately combining the benefits of the two disjoint approaches used for CP so far: geometric and algebraic. The latter problem may shed new light on the Strong Exponential Time Hypothesis, which is central to the newly emerged field of Fine-Grained Complexity.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.