Online social networks (OSNs) have witnessed tremendous growth and popularity over the recent years. The huge success and increasing popularity of social networks makes it important to characterize and study their behavior in detail. Recent work in analyzing online social network data has focused primarily on either static social network structure or evolving social networks. However, popular OSNs sites provide mechanisms to form and maintain community over time by facilitating communication, content sharing, and other forms of activities.
This research will develop a suite of algorithmic and analytic methods to support the characterization and modeling of activity networks. In particular, we will conduct static and temporal characterization studies of social network activity, study sampling techniques that can preserve graph properties for different communication activity graphs, investigate the fundamental theoretical trade-offs between preserving different properties of the graph, and develop procedural modeling techniques to generate social network activity graphs to better represent the temporal dynamics and burstiness of activity patterns.
We firmly believe that the insights garnered from the proposed algorithm development and theoretical analysis will have a significant impact on sampling and analysis in other network-centric domains in addition to OSNs. All the ideas that come out of the research will be incorporated into both graduate as well as undergraduate level networking and data-mining courses.
This goal of this project included: (1) Conducting static and temporal characterization studies of social network activity graphs; (2) Development of sampling techniques to preserve graph properties; (3) Investigation of the fundamental theoretical trade-offs between preserving different graph properties and different sampling errors; (4) Development of procedural modeling techniques to generate networks and represent the temporal dynamics of activity patterns. We investigated these issues across a wide set of real-world datasets and developed a suite of algorithmic and analytic methods that support the characterization and modeling of activity networks. Specifically, we developed a framework to outline the general problem of network sampling, which highlights the different objectives, population and units of interest, and classes of network sampling methods. We proposed a spectrum of computational models for network sampling methods, ranging from the traditionally studied model based on the assumption of a static domain to a more challenging model that is appropriate for streaming domains. Within this framework, we designed a family of sampling methods based on the concept of graph induction that generalize across the full spectrum of computational models (from static to streaming) while efficiently preserving many of the topological properties of the input graphs. We also developed a generic framework for big-graph analytics called Graph Sample and Hold (gSH). The gSH framework samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state typically less than 1% of the total number of edges in the graph. gSH provides a generic framework for unbiased estimation of the counts of arbitrary subgraphs. By varying the dependence of sampling probabilities on previous history, one can tune the estimation of various properties of the original graph efficiently with arbitrary degrees of accuracy. Next, we formulated the task of "Active Exploration" (AE), which aims to identify all items in a network with a desired trait (i.e., positive labels) given only partial information about the network. The AE process iteratively queries for labels or network structure within a limited budget; thus, accurate predictions prior to making each query is critical to maximizing the number of positives gathered. However, the targeted AE query process produces partially observed networks that can exhibit extreme label correlation bias, which makes it difficult for conventional relational learning methods to accurately estimate relational parameters. To address this issue, we modeled probabilistic relationships among the border vertices and developed an efficient collective inference method to jointly infer the item labels at the same time as the missing edges. We also explored the problem of creating synthetic networks that match characteristics traditionally found in real world networks. We proposed an extension to the Chung Lu random graph model, the Transitive Chung Lu (TCL) model, which incorporates the notion transitive edges to preserve clustering. In addition, we investigated the Kronecker family of generative graph models (KPGM) due to their scalable sampling algorithms, which have facilitated analysis of large-scale, real-world networks. We showed that the scalable sampling methods---in contrast to prior belief---do not in fact sample from the underlying KPGM distribution and often result in sampling graphs that are very unlikely. To address this issue, we developed a new representation that exploits the structure of KPGMs and facilitates the development of novel grouped sampling methods that are provably correct. Another focus of this project has been on modeling network datasets that record the interactions and/or transactions among a set of entities. A notable characteristic of activity networks, is that the structure of the networks change over time and these temporal dynamics are key to understanding system behavior. We proposed a temporal behavior model that captures the "roles'" of nodes in the graph and how they evolve over time in a novel dynamic behavioral mixed-membership model (DBMM) suitable for large temporal networks. Our proposed DBMM technique allows us to investigate the properties of temporal networks and understand both global and local behaviors, detect anomalies, as well as predict future structural changes. The work described above, while published in data mining and machine learning venues, is likely to have a broad impact across network-centric fields where researchers study the properties of complex systems by modeling the data as a network or graph. In these fields, a deeper understanding of the statistical properties of current network analytic methods and improved algorithms to learn and/or sample from these models, will increase the accuracy and robustness of these investigations. Moreover, this project has had a small, but important, impact on the education and training of several URMs students. As these individuals graduate and move to academic and industrial STEM settings, they will serve as role models in the areas of machine learning, graph mining, and network science more broadly.