9307895 Bookstein Testing and Exploiting Clustering for Data Compression This is the first year funding of a three-year continuing award. This project develops techniques for compressing concordances of large, full-text databases. The practical significance is obvious, since concordances are huge, consuming as much resources as the data themselves; yet they are necessary to access the database efficiently. But the theoretical implications are also important, since the highly structured organization of concordances makes them suitable for modeling. In this project clustering in concordances is modeled. Sequential clustering is important in Information Retrieval generally: substantive terms tend to occur together in a document, and documents containing a given term often cluster in a typical database. This project develops and evaluates statistical tests that indicate when clustering is important; identifies measures of sequential clustering strength; and creates models of concordance generation recognizing clustering, improving compression effectiveness. The models studied include Markov models and Bayesian learning models. Sequential clustering is widespread and the results of this research should have implications well beyond data compression, for example analyzing term occurrence to identify content bearing terms for retrieval purposes. Thus this project promises direct benefits in improving our ability to store very large textual databases and, indirectly, in developing methodology of wider interest. ***