This project proposes to explore integrated methodologies for pattern discovery and network analysis in multi-lingual text corpora. The interdisciplinary project team will focus on recent developments in network analysis of complex problems associated with visual query systems, topic discovery, anomaly detection, and rapid mining of complex time-stamped data as a means for extending approaches to noisy data from a range of disciplinary source materials. In order to look at the various problems and solutions to these major issues in current scholarship, the PIs have chosen three sets of disparate data: Buddhist Canonic texts (Chinese and Sanskrit); Irish studies journals (English and Gaelic); and Danish folklore (English and Danish). The research exercise to be performed is of a scale and complexity never before attempted. The project anticipates finding deficiencies in existing network analysis algorithms dealing with rich external data available on nodes and links and developing new network analysis algorithms to overcome the deficiencies.

Project Report

This project advanced the way Humanities scholars analyze data, particularly textual data and their related metadata, and to bring rich, real-world research problems from the Humanities to the attention of the computer science community, with the joint force of PI’s from both Computer Science and Humanities. It examined how we discover communities of translators or communities of storytellers and understand their formation, their development, and their position in a large Humanities corpus in space and across time. Two large and complex Humanities corpora were used: a collection of Nordic Folklore and the Chinese Buddhist canon. The complexity of the problems and the multi-lingual datasets in Chinese and Danish provide a much broader range of vocabularies and topic range than most of the well-studied scenarios in Computer Science. By taking advantage of the valuable information stored as metadata over each node and/or edge in the graphical representation of the target corpus, "knowledge infused" graphs enables navigation of rich cultural datasets, pattern interrogatories, and discovery of new areas for deeper investigation. An initial analysis was made of the feasibility of using graph theory to make a quantitative impact analysis of the consequences of making reference works available when reading scholarly texts. The rationale is that any text corpus will contain mention of persons, institutions, places, events, and other topics, but typically will provide only limited explanation of those persons, etc. Reference works, however, are designed to provide such explanations. It was found that applying explanatory resources to a text corpus greatly increases the quality and connectedness of graphs representing that text, to an extent that depends on the relevance of the particular applied resource. This promises to be a significant step towards assessing the value added by providing explanatory resources and comparing the increased connectedness of alternative resources or any given combination of resources. Using our approach the ground truth provided by the Humanities data and scholars provide an important asset to guide the development of these methods and their instantiation in accessible plug-in tools, to become novel and elegant solutions to address these challenges.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0970179
Program Officer
William Bainbridge
Project Start
Project End
Budget Start
2010-09-01
Budget End
2012-08-31
Support Year
Fiscal Year
2009
Total Cost
$299,982
Indirect Cost
Name
University of California Berkeley
Department
Type
DUNS #
City
Berkeley
State
CA
Country
United States
Zip Code
94704