Modern technology has completely transformed the concept of data in the biological and information sciences. Data collections about the flow of information on the web, for instance, or about regulatory and metabolic dynamics that drive cellular functionality are extremely large and heterogeneous. These collections are often characterized as networks of websites, or proteins, where directed edges denote information flow, or chemical reactions, and with node information described in terms of web pages, or chains of amino acids. Knowledge discovery and management is key. The goal of this proposal is to create novel computational and statistical approaches to store, search, and quantify patterns in large networks efficiently, and to explore the extent to which these new tools help address a number of important open problems and computational issues. The research plan includes theoretical, methodological, data analysis, and dissemination aspects.

The approach is to develop new models, methods and algorithms for analyzing large biological and information networks with rich node information. New tools will be developed: to assess the complexity of networks; to compare the fit of alternative network models; to store information about both connectivity and nodes in a network efficiently; to calibrate informative priors for networks that reflect the reality of signaling both in metabolic networks and in the spread of news on the web for empirical Bayesian analyses; to estimate the effects of node information on the local connectivity in a network; and to infer influence potentials and diffusion channels in online information networks. The proposed research is focused on three specific technical tasks: (1) establishing a new representation of valued, multivariate networks based on a statistical models; (2) developing a flexible family or probabilistic graphical models to link local connectivity in the network to high-dimensional node attributes; and (3) developing scalable algorithms to infer a non-observable network structure from multiple trails of informational artifacts on the network itself. In addition, two in-depth case studies will be developed to illustrate the potential of the proposed methodology. The first is an analysis of the effects of local influence patterns among online newspapers, news collectors and blogs on the diffusion of news and information items. The second is an analysis of the effects of local perturbations of signaling in regulatory networks on global cellular responses, for many known functions, from bacteria to human. Insights gained in tackling the case studies will in turn generalize and foster the development of the next wave of core methodology and theory in machine learning.

The proposed work meets an urgent need for the development of new and principled methods for analyzing massive amounts of network data, as well as the creation of large-scale data sets for testing and benchmarking, to the benefit of the community at large. The research plan is tightly integrated with an interdisciplinary educational program and with the development of a statistical machine learning curriculum, which will attract many undergraduates to research at the intersection of machine learning and the sciences, and will provide opportunities to actively encourage students from underrepresented groups to pursue careers in computer science and statistics. The team will distribute open source software and set-up websites to enable the community to use and build upon the tools.

Project Report

Modern technology has completely transformed the concept of data in the biological and information sciences. Data collections are often characterized as networks, say, of websites, where edges denote, say information flow, and with node information described in terms of, say webpages. Computational and statistical approaches to store, search, and quantify patterns in large networks efficiently are needed. This research addressed the fundamental problem of estimating the effects of local structural elements of a network on outcomes of interest. We developed new methodology for analyzing large biological and information networks with rich node information. In particular, (1) we developed a model-based representation of valued networks, which relies on a family of statistics models that can be used for mapping network connectivity to global outcomes. (2) We developed joint models of connectivity and node information. These statistical models can be used for mapping network connectivity to high-dimensional node-specific attributes. Models in this family include parameters that can quantify the extent to which node attributes are informative about network connectivity, and vice versa. (3) We developed a strategy to carry out network inference from information trails. This aim develops a family of statistical models for characterizing the spread of information on a network from multiple information trails. We used these tools to characterize the influence network underlying the spread of news on the web, and estimate local diffusion potentials of different websites, news collectors and blogs.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1017967
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2010-08-01
Budget End
2014-07-31
Support Year
Fiscal Year
2010
Total Cost
$513,780
Indirect Cost
Name
Harvard University
Department
Type
DUNS #
City
Cambridge
State
MA
Country
United States
Zip Code
02138