Frequent subgraph mining is a core task in data mining which can be applied to various real-life problems related to graphs and networks. Presently, the research value of this task has been heightened by the increased availability of massive network data in the domains of life and social sciences. However, existing algorithms for subgraph mining suffer from various limitations; noteworthy among these are lack of scalability, lack of user interaction and the absence of a mechanism to mine dynamic graphs. This research aims to overcome the above limitations by accomplishing the following three related tasks: (1) use of Monte Carlo sampling mechanisms for designing scalable graph mining algorithms; (2) develop real-time interactive graph mining systems using subgraph sampling approaches; and (3) discover models for graph evolution that are based on sampling and driven by the principles of game theory and economics.

This research builds a novel paradigm for subgraph mining that is based on Monte Carlo sampling. This allows the development of algorithms that are scalable, by avoiding the need to enumerate all subgraph patterns. The resulting algorithms will be applied to subgraph mining problems in systems biology, e.g., predicting disease pathways by mining graphs from genomics and proteomics co-expression networks. A second outcome of this research is an interactive pattern mining framework using subgraph sampling where user feedback guides updates of the sampling distribution such that subsequent sampling prioritizes patterns that are considered "interesting" to the user. A third outcome is a subgraph sampling method that uses a game theoretic mechanism to design a subgraph evolution model for prediction tasks (such as link prediction) in dynamic networks.

Broader Impacts: Availability of tools for mining large graphs enables the opportunity to build network biomarkers, which are novel signatures for disease diagnosis and risk factor analysis. A sampling based interactive pattern system is instrumental to mine "interesting" associations between diseases and medicines from numerous hidden datasets that are currently unexplored in many hospitals and health clinics. Scalable graph mining algorithms are also likely to find use in search, e-commerce and social networks based industry. The educational goal of this research is to leverage the PI's industrial experience to develop a "Large-scale data analysis" course on methods needed to build data mining systems that work on industry-scale data.

Additional information about the project, including the findings, methods, open source implementations of algorithms, publications and data can be accessed through the project website at www.cs.iupui.edu/~alhasan/graph_mining.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1149851
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2012-03-01
Budget End
2017-02-28
Support Year
Fiscal Year
2011
Total Cost
$547,427
Indirect Cost
Name
Indiana University
Department
Type
DUNS #
City
Bloomington
State
IN
Country
United States
Zip Code
47401