This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5). This study addresses a core problem in biology: to predict function from massive and diverse experimental data that identify individual cell components, such as proteins, or suggest interactions among them. The approach will model all such information as networks. New network analyses methods will then be developed to overcome conflicting information due to experimental errors or inherent biological complexities. Other methods will aim at weighing optimally different types of information so that they may be best combined together. The result will pool biological data on a dramatically larger scale than previously feasible to yield a self-consistent and improved picture of protein function. More broadly, however, the network analysis techniques developed here should apply widely and efficiently to any massive, diverse and conflicting data, typical of complex systems.
Specifically, the investigators will integrate information from frustrated networks by diffusing diverse evolutionary, structural and functional data along the edges of protein graphs. To cope with the large network sizes, and data inconsistencies or alternative interpretation, semi-supervised learning algorithms will be extended, in aim 1, to test prediction accuracy under different network information diffusion mechanisms and, in aim 2, under alternative weighing strategies to pool complementary information networks. The outcome will (1) implement realistic biological networks with millions of nodes and edges to benchmark computational efficiency; (2) predict protein function and the phenotypes they induce based on integrated, massive and heterogeneous biological data sets; and (3), since high computational efficiency will be maintained even in networks with spin glass type frustration, these results will be transformative across a wide variety of fields by extending graph-based semi-supervised learning to a broad, cross-discipline class of complex networks with random interactions. Students will be trained at the interface between computational science and biology and developed software tools will be made public to the research community.