In this report, we highlight the intellectual challenges of the proposal, both theoretically and in practice. The interplay between theory and practice distinguishes this proposal from many in the database searching area. The goal of an entropy-compressed technology could have a major impact on data structure design and implementation. We give a three-year research plan of desired and expected outcomes.

In this proposal, we introduce a new data structures model and technology called Entropy-Compressed Data Structures that addresses the importance of limiting the size of text databases and search structures that index huge volumes of data. The main goal is to achieve "optimal" space terms (with a leading coefficient of 1, with provably smaller second-order terms) while not sacrificing optimal lookup time. We measure our "optimal" space usage in a data-aware manner, as a function of the inherent randomness (or entropy) in the input data set. This model is an outgrowth of the recent space-efficient discoveries made by the PI and coauthors in the area of text indexing, which has caused a rebirth of research in text indexing and has spawned considerable interest in space-efficient data structures.

In order to analyze the space required, we need measures (or models) to evaluate these succinct structures and quantify the space-savings. The space required for a structure should relate in some sense to the entropy of the data that the structure is built upon. We formalize a variety of intuitive and meaningful models as a foundation for a uniform and structured study in a number of applications.

We plan a three-pronged approach in developing Entropy-Compressed Data Structure technology:

First, we focus on the fundamental dictionary problem, where the task is to represent a set S of t items out of a universe U = {0, ..., n-1}. Dictionaries are critical in text indexing and other database applications as a building block in designing entropy-compressed data structures. For text-based applications, dictionaries serve as a powerful black box that operates within some entropy-aware partitioning of the data. Any improvement to a dictionary structure would have tremendous impact on all such dependent applications.

In the first year, we expect to devote our efforts in developing a powerful and succinct fully indexable dictionary that operates in near-optimal time. In particular, we hope to achieve bounds for a dictionary supporting the operations of rank and select (and thus the predecessor operation as well) in gap + o(log {n choose t}) bits (where gap refers to the optimal space cost using a gap-encoding of the items in the dictionary), while achieving time bounds of Anderson and Thorup, namely O(min{{sqrt{log t/loglog t}, ((loglog n)(loglog t))/logloglog n, loglog t + log t/loglog n }}) time. In a similar vein, we wish to further push our structure to achieve gap + o(gap) bits without sacrificing lookup time. These contributions would be quite major theoretically, and their impact could be considerable in practice.

As a further step, we would also like our dictionary to be dynamic, as this development would be of immediate interest to the database community. In practice, we want our structures to be simple, so that they can be readily implemented. For instance, a good implementation also becomes a potential solution to the IP lookup problem, since one could abstract an IP lookup as nothing more than a query to find the longest common prefix match (which is very similar to the predecessor problem).

Second, once a powerful dictionary is developed to serve as a black box, we can begin to focus on the technology of space-efficient representations of the application data structures themselves. Text indexing has received quite a bit of attention in recent years, and we focus on improvements to these structures to motivate a similar progression of work in other application areas. We will show very clearly the relationship between text indexing and the dictionary problems and how they can be used as a basic paradigm for making data structures entropy-compressed.

A strong component of this proposal, in comparison with other projects in the IDM program, is the marriage of theoretical analysis and practical implementation. We have demonstrated rigorous mathematical proofs of optimality in our earlier work, and moreover, the mathematical elegance has translated to efficient implementations in practice. We regard our strengths in theoretical design and analysis as a very strong component of this project.

In the second year, we propose to improve the state-of-the-art in text indexing. The work in text indexes boasts two individual results. The first achieves nH_h + o(n) bits of space with O(m) lookup time (where H_h is the hth order entropy), which is nearly optimal in space (aside from low-order terms) but not in time. The second achieves epsilon^{-1}nH_h + o(n) bits with o(m) lookup time, which is optimal in lookup time (aside from low-order terms) but not in space. We expect to achieve a text-indexing data structure which takes nH_h + o(n) bits of space while simultaneously supporting o(m) optimal lookup time. Achieving a result of this type satisfies all of the goals of developing entropy-compressed data structures. Further improvements involve adding more power to compressed suffix arrays (CSAs). We hope to dynamize compressed suffix arrays without increasing time or space, as well as supporting range-searching and occurrence queries in optimal time without increasing space. The choice of h carries with it a potentially nontrivial model cost; in order to alleviate this inefficiency, we will use gap encoding. We also expect to present a compressed suffix array that adaptively chooses the best context length h using nH_h + o(n) bits with o(m) lookup time.

In the third year, we expand our efforts to the areas of multidimensional matching. We begin our exploration by first developing the crucial notion of a multidimensional Burrows Wheeler Transform. In particular, we are considering a series of novel transformations of the data that simultaneously allow fast access to the data, ease of compression, and do not violate the various constraints posed by Giancarlo. We then expect to achieve a multidimensional suffix array that operates on d-dimensional data in just n^d H_h + o(n^d) bits with O(polylog n^d) time.

Once we have a generic space-efficient technology, we will aim to make it more practical via algorithms engineering by emphasizing dynamic updating, adaptivity, and simpler design of the implementation. We envision many applications to spatial databases, GIS, geometric processing, and numerical algorithms.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0415097
Program Officer
Frank Olken
Project Start
Project End
Budget Start
2004-09-01
Budget End
2007-08-31
Support Year
Fiscal Year
2004
Total Cost
$255,000
Indirect Cost
Name
Purdue University
Department
Type
DUNS #
City
West Lafayette
State
IN
Country
United States
Zip Code
47907