This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5).
Probabilistic graphical models provide a powerful mechanism for representing and reasoning with uncertain information. These methods have been successfully applied in diverse domains such as bioinformatics, social networks, sensor networks, robotics, and web mining; in turn, such application areas have posed new computational challenges driving graphical model research. This project is motivated by challenges in emerging application areas such as epidemiological simulation, geoscience modeling, and studies of interacting proteins, where there are rich sets of information of multiple types and at multiple levels of granularity. While the methods developed will be general, the research will focus on protein-protein interactions, which drive the molecular machinery of the cell by forming transient or persistent complexes to propagate signals, catalyze reactions, transport molecules, and so forth. The mixed-mode information available includes amino acid sequences, three-dimensional structures and associated physical models, and binary, rank-ordered, or even quantitative interaction data. The proposed techniques address key challenges in information integration, prediction, and generation using graphical models.
Intellectual merits: The intellectual merits of this work derive both from the new capabilities for information integration and for reasoning with probabilistic graphical models, as well as their application to the study of protein-protein interactions. Proteins offer, by far, some of the most complex, multi-faceted datasets for integration using computational methods; hence the lessons learned here can be applied to similarly rich information spaces, such as epidemiology and geosciences. These integrated models of interacting proteins and new algorithms for prediction and generation will also support significant applications such as protein engineering and systems biology, bridging interaction networks to the underlying residue-level interactions in order to better understand and control them.
Broader impacts: This project will reach out to both the bioinformatics and larger computer science communities to maximize the impact of our contributions. An open-source integrator platform will be developed, aimed at integrating protein datasets and which can be extended to information integration in other domains as well. To stimulate community building and foster discovery, the research team will advocate situating computer science research in the context of concrete applications. Building on prior successes, the team will organize a workshop at a suitable venue such as ICML/AAAI/NIPS/KDD focused on an 'information integration challenge' dataset involving protein modeling. Finally, through programs such as Women@SCS at Carnegie Mellon, WISP (Women in Science Program) at Dartmouth, Howard Hughes education grant internships at Purdue, and the MAOP/VTURCS (Minority Academic Opportunities Program and VT Undergraduate Research in Computer Science) program at Virginia Tech, the team will provide cross-disciplinary training to undergraduate students from underrepresented groups.
Keywords: Probabilistic Graphical Models, Information Integration, Mixed-Mode Datasets, Bioinformatics, Proteins, Markov Chain Monte Carlo (MCMC) methods.