There has been an enormous amount of work in recent years directed toward understanding the structural and dynamical properties of "informatics graphs" or "complex networks." Most of this work has been on small to medium-sized networks, and it has led to an improved understanding of the properties of networks arising in many graph mining applications. In spite of this, formulating appropriate models for and answering even basic questions about larger informatics graphs remains challenging. For instance, recent work has shown that dynamic properties as well as basic structural properties of large informatics graphs are not reproduced even qualitatively by popular network generative models.

The proposed work will use traditional and recently-developed approximation algorithms for the graph partitioning problem as "experimental probes" of large informatics graphs in order to characterize in a more robust and scalable manner the structural and dynamic properties of very large informatics graphs. This will include extending and implementing recently-developed algorithms such as "local" spectral methods and algorithms that intuitively "interpolate" between spectral and flow-based methods, as well as revisiting in light of new applications traditional methods such as the global spectral method and ideas underlying the popular package Metis. A central goal will be to provide the analyst with tools that have sufficient algorithmic and statistical flexibility to characterize the local and global structures of large networks in a rich and robust way.

The Intellectual Merit of the proposed work lies in extending recent theoretical and algorithmic developments and applying them to very real-world problems. The Broader Impact of the project lies in enhancing interdisciplinary education at Berkeley and Stanford and more generally. This will involve the organization of meetings and courses that will include the opportunity for research projects, including by students from underrepresented groups, that focus on bridging theoretical methods and real-world applications. For further information see the project web page: URL: http://cs.stanford.edu/people/mmahoney/graphmining/

Project Report

Given a corpus of document, what are the topics covered in the corpus? Which topics are covered in each document? Automatically discovering using a computer is termed topic modelling. Numerous approaches for this problem have been proposed in the last few years. This project undertook the task of evaluating these techniques in a common framework and in coming up with more effective evaluation techniques as well as more effective methods. In this project, we decided just looking at the topics that were output was interesting but hard to evaluate. Instead, we regarded a set of topics to be useful, if it could help us predict something new about a document; for example, other related topics, or related language that might be useful. Our framework thus, dropped half of a document and tried to recover information about the dropped half the topic models found using the various approaches. We found that recent sophisticated approaches were actually inferior to older, more established, but more straightforward approaches. We then examined various difficult cases, and derived an approach which were intuitive variations on the traditional effective approaches to get still better methods for topic modelling. In sum, we took a fresh look at topic modelling as a tool for prediction, found that this view reflected negatively on currently popular methods for this task, and found an improved method which works better with this view of the world.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0963904
Program Officer
Vasant G. Honavar
Project Start
Project End
Budget Start
2010-07-01
Budget End
2013-06-30
Support Year
Fiscal Year
2009
Total Cost
$418,011
Indirect Cost
Name
University of California Berkeley
Department
Type
DUNS #
City
Berkeley
State
CA
Country
United States
Zip Code
94704