Information theory has had a profound impact on the fields of data transmission and compression. In contrast, it has yielded comparably few insights into problems such as knowledge extraction from and efficient search of massive datasets. While current information-theoretic tools and techniques can be applied to these problems to some extent, the paradigms for which these tools were developed will be being carefully reexamined in this project. Models that accurately capture the fundamental challenges faced by efficient search in modern massive database systems will be developed and analyzed. The asymptotic fundamental limits, which characterize the tradeoffs between accuracy, compression rate and search efficiency, will be investigated, along with development of practical algorithms that approach the ultimate benchmarks. One concrete problem being pursued is that of compression for efficient query and search. In this setting, the goal is, given a compressed representation, to answer search queries about the data that was compressed. This is in stark contrast to traditional compression, where the data need be merely reconstructible from the compressed form. The approach taken is tailored to distributed database design, but is also relevant to compression schemes that allow search within the compressed domain.

The fundamental quantities studied play a similar role to that of the channel capacity and entropy/rate-distortion in channel and source coding, respectively. On one hand, they yield an understanding of the fundamental limits on the performance that any system for similarity queries based on compressed representations can hope to attain. On the other, the insights obtained from the theory are guiding the construction of schemes that approach these limits in practice. We will investigate how existing practical approaches (such as various hashing and clustering techniques) perform with respect to the information theoretic limits, and the extent to which approaches that have proved to be practical in source and channel coding can be used as building blocks to develop new efficient search algorithms that significantly improve on the current state of the art

Project Start
Project End
Budget Start
2013-07-01
Budget End
2017-06-30
Support Year
Fiscal Year
2013
Total Cost
$250,000
Indirect Cost
Name
Princeton University
Department
Type
DUNS #
City
Princeton
State
NJ
Country
United States
Zip Code
08544