The purpose of this project is to develop algorithms and tools for the exploration and categorization of extremely large bodies of documents, especially from the World Wide Web. The technical approach is based on a new hierarchical divisive partitioning method which has produced quality clusters very fast in preliminary tests. The research issues to be addressed include: scalability analysis, theoretical foundations, incremental updating methods, generalizations (such as handling missing values and different scaling), and interface to one or more Web agents for various applications. Educational seminars and tutorials are a natural part of this project, given its interdisciplinary nature. Anticipated results are a set of algorithms and tools for organizing large document collections that enjoy the features of (1) scalability to very large datasets, (2) unsupervised operation, and (3) reasonable quality and usefulness of the categories found. Anticipated benefits include an order of magnitude increase in the size of datasets on which it will be practical to extract useful categories in an unsupervised manner. Potential applications include client-side WWW organization and search aids, server-side aids to create document ratings in a consistent manner, tools to maintain and update organization and classification of contents of specialized databases, all with a minimum of human intervention. www.cs.umn.edu/~boley/PDDP.html

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
9811229
Program Officer
Maria Zemankova
Project Start
Project End
Budget Start
1998-09-15
Budget End
2002-08-31
Support Year
Fiscal Year
1998
Total Cost
$185,019
Indirect Cost
Name
University of Minnesota Twin Cities
Department
Type
DUNS #
City
Minneapolis
State
MN
Country
United States
Zip Code
55455