9307895 Bookstein Testing and Exploiting Clustering for Data Compression This is the first year funding of a three-year continuing award. This project develops techniques for compressing concordances of large, full-text databases. The practical significance is obvious, since concordances are huge, consuming as much resources as the data themselves; yet they are necessary to access the database efficiently. But the theoretical implications are also important, since the highly structured organization of concordances makes them suitable for modeling. In this project clustering in concordances is modeled. Sequential clustering is important in Information Retrieval generally: substantive terms tend to occur together in a document, and documents containing a given term often cluster in a typical database. This project develops and evaluates statistical tests that indicate when clustering is important; identifies measures of sequential clustering strength; and creates models of concordance generation recognizing clustering, improving compression effectiveness. The models studied include Markov models and Bayesian learning models. Sequential clustering is widespread and the results of this research should have implications well beyond data compression, for example analyzing term occurrence to identify content bearing terms for retrieval purposes. Thus this project promises direct benefits in improving our ability to store very large textual databases and, indirectly, in developing methodology of wider interest. ***

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
9307895
Program Officer
Program Director
Project Start
Project End
Budget Start
1993-08-01
Budget End
1996-12-31
Support Year
Fiscal Year
1993
Total Cost
$122,929
Indirect Cost
Name
University of Chicago
Department
Type
DUNS #
City
Chicago
State
IL
Country
United States
Zip Code
60637