Many modern data collections, gathered for the purpose of providing insights into matters of national interest such as medical and technological innovation, typically measure quickly evolving interactions, in addition to traditional unit-level measurements, in the context of a network. This project develops an integrated research and educational program to enable scientific and quantitative analyses of interactions and other combinatorial measurements as they change over time. Technical problems being addressed include, but are not limited to: an efficient representation that facilitates quantitative analyses of large-scale networks; models of how information and behavior evolve over time as a consequence of the network context they are embedded in; and fast algorithms to perform estimation of critical parameters in these models. These methods will be demonstrated on case studies exploring: the diffusion of medical innovations among physicians and its impact on health; technological innovation dynamics in the United States and the role of non-compete agreements; the estimation of point-to-point communications on a network, from aggregate traffic that is passively monitored.

The presence of interactions and other combinatorial measurements as a source of observed variation in the data creates new statistical and inferential challenges. For instance, generalized linear model theory needs to be extended to responses on a network. The analysis of processes on a network often induces constraints that make the inferential problems ill posed, since they involve a large number of unknown quantities to describe few observations. Estimation may require sampling from, and integrating over, extremely constrained parameter spaces. Importantly, interactions do not necessarily encode statistical dependence. In this sense, dealing with observed interactions requires original thinking; the data settings they entail are not amenable to analysis with classical methods, in which interactions are inferred as a means to encode dependence among unit-level observations. This project tackles technical challenges with a statistical and machine learning approach. Anticipated technical results include, but are not limited to: (1) a new wavelet decomposition of multivariate and dynamic networks; (2) statistical models of diffusion of information on a given network, and models of inhomogeneous network dynamics in continuous time; (3) scalable estimation algorithms for these models; and (4) theoretical foundations of inference with big data. This research will be evaluated qualitatively and quantitatively, at Harvard and in collaboration with industrial partners.

The proposed research is integrated with an interdisciplinary educational program, which will attract undergraduates to research at the intersection of statistics and computer science, in the context of problems of national importance. It will provide opportunities to actively encourage students from underrepresented groups to pursue careers in statistics and computer science. Key elements of the educational program include the development of a statistical machine learning curriculum; lectures on YouTube available to everyone; tutorials at national and international conferences and workshops; and a monograph. Outreach activities include open-source software and webtools for the community at-large, and a collaborative effort with industrial partners to leverage the new computational tools and algorithms for benefiting their pools of users worldwide. Additional details regarding the project can be found at: www.fas.harvard.edu/~airoldi/career.html.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1149662
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2012-07-01
Budget End
2019-08-31
Support Year
Fiscal Year
2011
Total Cost
$495,208
Indirect Cost
Name
Harvard University
Department
Type
DUNS #
City
Cambridge
State
MA
Country
United States
Zip Code
02138