B-trees have been the data structures of choice for external-memory searching for decades because they minimize the number of disk-block accesses performed during a search. It is well known, however, that B-trees are empirically suboptimal because they exploit data locality at only one level of granularity, typically disk blocks, but not at coarser granularities, such as disk tracks, or finer granularities, such as cache lines.
Theoretical developments on cache-oblivious data structures and algorithms have shown how to achieve nearly optimal locality of reference simultaneously at every granularity. A striking feature of cache-oblivious data structures is that they free the programmer from the burden of tuning the code for cache and disk effects. The PIs' recent experiments suggest that cache-oblivious B-trees (CO B-trees) can surpass the performance of highly tuned traditional B-trees. CO B-trees achieve superior performance because they approximately optimize for all memory effects. In contrast, cache-aware algorithms ignore important aspects of the memory hierarchy. CO B-trees are not yet ready to be used in file systems and data bases, however, because they lack essential capabilities of industrial-strength B-trees, such as support for variable-size keys, concurrent accesses, and transactions.
The researchers propose to investigate how CO B-trees can achieve their potential. The researchers plan to study the wide range of algorithmic problems in data structures, stringology, and distributed systems required to develop a full-featured CO B-tree. In addition, the researchers plan to solve online scheduling problems so that virtual-memory systems can provide efficient support for cache-obliviousness. This algorithmic work is necessary to transfer CO technology to other areas of computer science, engineering, and scientific computing and is intended to transform how scientists and engineers manipulate massive data sets.