Every network dataset poses unique challenges, but there is a growing toolkit of methods that can be adapted to specific situations. This research will extend that toolkit by continuing the development of machine learning ideas that perform aggressive local variable selection to fit local metrics, thus allowing nearby nodes to have similar models for edge formation, but distant nodes to have very different models. Also, the research will provide a new approach to goodness-of-fit assessment for network models, based upon minimum description length inference.

Network modeling has emerged as a critical methodology across many fields of science, including biochemistry, sociology, and Internet communication. Important applications include social interactions leading to fission in baboon troops, biochemical knowledge derived from protein-protein interactions, and insight into the growth and structure of the Wikipedia. This research will develop novel statistical models for network growth and new ways to assess how well they explain a given dataset.

Project Report

Network science is interdisciplinary. It draws tools and problems from mathematics, computer science, biology, sociology, statistics, and many other areas. This research program develops and applies new methodology to areas related to text networks, metabolic networks, and social networks. Regarding text networks, this research studied (a) the Wikipedia, (b) cross-links in lists of civilian casualties in conflict regions, and (3) all blog posts in 2012 from the top 1,509 U.S. political blogs. Text networks are exciting because they unite two very modern threads in statistical research: text mining and dynamic network modeling. The goal is to let the text mining improve the network model and the network model improve the text mining. For the Wikipedia data, an on-going project is estimating the topics within the Wikipedia and the clique structure within the Wikipedia. These represent independent sources of information about the organization of human knowledge. Preliminary results from a subset of the Wikipedia articles suggest that there is generally strong modularity in topics and cliques, and that cliques often correspond directly to topics. For the civilian casualty data, the text are records of deaths, and the network structure (a hypergraph) determines which lists (newspapers, Human Rights Watch, etc.) contain that record. In terms of impact, a new end-to-end uncertainty analysis indicates that the civilian casualty count in Syria is about twice the number (100,000) that is conventionally reported in the media. For the political blogosphere data, we particularly focused on the posts regarding the shooting death of Trayvon Martin. We found four major topics in the blog discussion: racism, the 2012 presidential campaign, the trial of George Zimmerman, and the events of the shooting. These topics got passed among the blog network, often in response to the news cycle. Additionally, the conservative bloggers had very different communication patterns than the moderate and liberal bloggers. Regarding the metabolic network, we used information on known pathways to improve accuracy in estimating the abundance of specific metabolites. Such estimates are important in diagnosing medical diseases such as diabetes, prostate cancer, ALS, early onset Alzheimer’s, and other health problems. The approach uses Bayesian methods to incorporate both the network information and knowledge about the mass/charge ratio of important metabolites. We estimate that overall error is reduced by a factor of 6. Regarding social networks, our research focused upon data on grooming relationships among baboons in the Amboseli National Reserve over a four-year period. The research questions were to determine how baboon troops will split (typically, a troop splits about every 16 to 20 years) and whether or not baboons have a sense of politics (e.g., is an enemy of an enemy more likely to be a friend?). The analysis involved a novel dynamic model with a latent space representation for grooming relations, after accounting for grooming intensity that reflects kinship, genetics, gender, hierarchy and rainfall (which drives the louse population). Our results confirm that baboon troops need not split along matrilines, and there is intriguing evidence that at least some of the baboons make strategic alliances.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1106980
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2011-07-01
Budget End
2014-06-30
Support Year
Fiscal Year
2011
Total Cost
$75,000
Indirect Cost
Name
Harvard University
Department
Type
DUNS #
City
Cambridge
State
MA
Country
United States
Zip Code
02138