Analysing and organising the massive amount of data collected everyday is one of the biggest challenges faced by computer scientists. This impacts almost every field of modern human life, such as health care, military applications, smart cities, transportation, networks, etc. To quickly analyse and organise such huge amounts of data, one needs to develop space and time efficient algorithms, that accelerate the search for information while minimising memory requirements. This project aims to develop space-efficient similarity filters, which quickly filter out queries that have very little similarity from the records existing in the database. This has applications in security, databases, machine learning, file systems, vision, pattern recognition, compression and networks. Apart from the targeted impact of these applications to society, this project also aims at producing the next generation of researchers and promoting under-represented groups in science and technology. Minority undergraduate and high school students will be involved in this project. Queens College, CUNY is a diverse institution, representing a wide range of ethnic minorities and has 28% Hispanic students. In an effort to engage more undergraduate students and women in research (Queens College has roughly 56% female students), the PI plans to offer a new course that develops an understanding of approximate membership and similarity search data structures in the setting of big data.

Similarity searching is a fundamental problem in this context, where a massive data of records needs to be organised so as to answer quickly queries of the form: given a record, is it similar to some record in the data? Similarity search, modeled as near(est) neighbor search (NNS) is an important and well-studied problem with numerous applications: Given a set P of n points lying in some d-dimensional metric space, the goal is to preprocess P so as to return for a given query point q the point p in P that is closest to q. Exact versions of this problem are hard to solve, and all solutions to the approximate version of the problem require superlinear (more than nd) space. This space usage is prohibitive for many big data settings, where not just n, but d could be huge (e.g., the number of pixels in an image). This project is aimed at developing the theory and applications of Similarity Filters, which are decision (yes/no) based data structures. Given a query q, a (c, r)-similarity filter always returns yes if q is within a distance r of some input point, and returns no with a small error probability (thus some false positives are allowed) if all input points are at least cr away from the query. Recent results from the researcher shows that such filters can be built with sublinear (asymptotically less than nd) space. The project takes on two tasks: 1) to understand the approximation factor-space-query tradeoff for similarity filters, developing tight upper and lower bounds, especially in the sublinear space regime, and 2) to develop I/O efficient Similarity Filters and nearest neighbor data structures, akin to quotient or cascade filters for the approximate membership problem (where the in-RAM Bloom Filter is the standard data structure, used in many networking applications). The overarching goal is to use similarity filters on RAM in conjunction with a disk-based data structure: faraway queries will be filtered quickly, and nearby queries will have their neighbor(s) returned following a slower but cache friendly search. The availability of such a two-layered space-and-time-efficient data structure will greatly boost the scope of applications when the amount of data is massive, especially in databases, networks, security and machine learning.

Project Start
Project End
Budget Start
2018-02-15
Budget End
2022-01-31
Support Year
Fiscal Year
2017
Total Cost
$175,000
Indirect Cost
Name
CUNY Queens College
Department
Type
DUNS #
City
Flushing
State
NY
Country
United States
Zip Code
11367