This research involves the development of improved methods for automatic processing of text documents. The research focuses in particular on two specific text-handling problems: lossless data compression and classification. Lossless data compression is useful for reducing storage and transmission requirements for large text documents. It is also an important step in sending text securely via cryptography. Classification arises frequently in web search problems and also in forensics, where the goal is to determine the authorship of an anonymous document.

The approach is to draw new insights into these problems by using a novel asymptotic regime in information theory, which more accurately models real text sources than classical models do. Specfically, the investigators consider a regime in which the size of the data set and the source alphabet are comparably large. It can be shown that most classical information theory techniques fail in this "rare-events" regime. Nonetheless, new techniques can be developed that are tailored to this regime that yield new algorithms and insights.

Project Start
Project End
Budget Start
2008-09-15
Budget End
2012-08-31
Support Year
Fiscal Year
2008
Total Cost
$225,001
Indirect Cost
Name
University of Illinois Urbana-Champaign
Department
Type
DUNS #
City
Champaign
State
IL
Country
United States
Zip Code
61820