In the context of social networks, "big data" generally involves information on very large social systems whose elements of interest display complex dependence. State-of-the-art statistical models for such systems require the use of computationally expensive stochastic simulation techniques to capture this dependence; these techniques do not generally scale well to the large-population case. One potential solution to this problem is to focus detailed modeling efforts on smaller subpopulations (e.g., groups, communities, etc.) extracted from the larger system. While scalability of the subsystem models is less challenging in this case, one must have appropriate methods for sampling from large networks in such a manner as to permit principled inference, and modeling techniques that recognize the coupling between local subpopulations and the broader network in which they are embedded.

The PI will bridge the gap between expensive, highly detailed models and the limits of computability imposed by Big Data by combining expertise from machine learning and social network modeling within a unifying exponential family framework. The research will develop novel methods for the scalable measurement and analysis of large social networks, validating these techniques by deploying them in the context of dynamic data collection from online social networks. Specifically, the researchers will combine probabilistic graphical models and exponential family random graph models (ERGMs) to: (i) identify models with low computational requirements by exploiting limited-range dependence; (ii) develop machine learning techniques for identifying weakly coupled regimes in large networks to facilitate sampling and subgraph modeling; and (iii) develop integrated sampling and modeling strategies for inference from subgraphs of large networks that capture coupling to the structures in which they are embedded. This proposal investigates these questions in both the cross-sectional and dynamic contexts, for networks with and without vertex attributes. The sampling techniques created via this project will be deployed as an extension of a broader infrastructure for data collection in online social networks developed and maintained by one of the PIs, allowing for evaluation in a practical setting.

The methods developed via this research will allow for analysis of data relating to many problems of public interest, including epidemiological, security, and emergency management applications; data collection and analysis activities within the project will include applications in the natural hazard context, with the potential to inform policies that can save lives and property during disasters. The project will be integrated with graduate and undergraduate education, as well as postdoctoral mentoring. Tools developed via this project will be released as part of a widely used open-source toolkit for statistical network analysis (statnet), allowing widespread dissemination to researchers and practitioners in a range of fields.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1251267
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2013-09-01
Budget End
2016-08-31
Support Year
Fiscal Year
2012
Total Cost
$746,783
Indirect Cost
Name
University of California Irvine
Department
Type
DUNS #
City
Irvine
State
CA
Country
United States
Zip Code
92697