The project proposes improvements in tools for analyzing and modeling of Protein-Protein Interaction (PPI) networks. A PPI network is a mathematical representation of the physical interactions amongst proteins in a cell: it is a graph in which nodes represent proteins and edges between the nodes represent possible physical interactions between the corresponding proteins. Currently available PPI networks are partial and mostly resulting from various high throughput proteomics experiments that attempt to discover as many PPIs as possible in a number of organisms. The resulting PPI data sets are large including tens of thousands of interactions amongst thousands of proteins even for a unicellular eukaryotic organism such as baker's yeast.

The project introduces new, more sensitive measures of network local structure that are based on enumeration of "graphlets", small induced subgraphs of large graphs, as well as on distinguishing during the enumeration between the nodes in a graphlet that cannot be mapped to one another by an automorphism (an isomorphism of a graph with itself). In this way, the degree distribution is generalized into the spectrum of "graphlet degree distributions". These new measures produce graphlet-based network "feature vectors" that are further used to heuristically compare synthetic (model based) networks to experimentally determined PPI networks. The project's preliminary results indicate that geometric random graphs provide a superior fit to the currently available PPI networks than do other network models including scale-free networks. The project proposes to build upon these preliminary results towards an efficient PPI network alignment heuristic that would incorporate biological information of the proteins in the networks being aligned, such as protein functional, structural, and amino-acid sequence similarity. Furthermore, the project proposes to use geometric random graph models of PPI networks as the basis for algorithmic strategies that would optimize the cost of experimentally exploring the entire space of protein-protein interactions in the cell. The use of geometric random graph models and associated graphlet-based network feature vectors can be extended to other network modeling applications, such as epithelial planar cell polarization, chemical informatics, social network analyses, etc.

It is expected that the project will lead to better algorithms for various network comparison problems on PPI and other networks. Exploitation of a network model may significantly reduce the cost of characterizing the interactomes of various organisms.

The PI is involved in training high school, undergraduate and graduate students, including two female students. This work is being included in various UC Irvine courses. The software will be provided as a free open-source toolkit for other researchers. Additional information will be available at www.ics.uci.edu/~natasha/.

Project Report

My work involves the analysis of biological networks. To understand how biological networks fit into the larger picture of the new science that has come to be known as "systems biology", let us start with a brief discussion of the Human Genome. Sequencing of the Human and other genomes over the past decade has provided us with an unparalleled understanding of biology and our relationship to other life on Earth, as well as crucial insights into disease, ageing, and health. However, the genome only provides the low-level "program" that is used to create us and our cells. Although a crucial step in understanding biology, it is only the first step in a complete understanding of how our bodies work, just as Newton's Three Laws of Motion were just the first step in the 300-year development of physics that continues to this day. Genome analysis has allowed us to gain an understanding into what each individual gene does. At its most basic level, a gene is used to encode a protein; it is the protein which then does the "real work" inside the cell. The next logical step is to understand how these genes and their respective proteins interact with each other. This network of interactions is part of the larger science of "systems biology", which is the study of how biological systems (such as cells and ultimately a whole organism) work, from the ground up. We can study the Human protein-protein interaction network and attempt to discover important sub-networks of connections related to health and disease, and try to discover how these sub-networks get broken and how they might be fixed by drugs. We can also study the networks of several different species and compare the networks to each other, just like genomes of Human and Mouse can be compared. However, there is an important difference between genome analysis and network analysis: genome analysis is technically easier. Whereas a detailed and precise comparison of two large genomes can be done within a few minutes or hours on a modern computer, for technical reasons the comparison of two networks is far, far more difficult. In principle, a detailed and precise comparison of two large networks would take more computer time than has ever been used in the history of mankind; in fact, such a comparison would take more computer time than will be available on Earth for the next several centuries! To tackle such large problems, we create approximate comparisons that can run in a reasonable time. However, such approximate methods must be hand-tailored to the particular problem at hand. An approximate method that works well on social networks may not work well on a biological network; one that works well on a protein interaction network may not work well on a gene regulatory network. Thus, to design fast algorithms that work well in practice, we need to have a detailed understanding of the networks that we want to study. Unfortunately, gaining such an understanding requires us to perform detailed and precise comparisons of the networks---a Catch-22 situation! This situation is resolved by creating mathematical models of the networks and showing that the algorithms can work well on such model networks, while also demonstrating that the model networks closely resemble the real networks in ways that can be easily defined. The development of such models is the essence of the scientific method: we create a hypothesis (real networks can be modeled in such-and-such a way), and then test the hypothesis against real data. Our work has resulted in several very interesting observations. First, we have performed the first-ever detailed comparison of large networks between several different species. Our methods have demonstrated that there is a large amount of common "wiring" or "topology" between many different species on Earth, from Baker's Yeast all the way to mammals and humans. This is an important discovery because it means that we can learn about Human Systems Biology by studying that of other "model" organisms like yeast, fruit fly, and mouse. Furthermore, comparison of biological networks between species can also measure the evolutionary "distance" between them, providing a new way to build the "tree of life" that defines the taxonomic relationships between all species on Earth. In addition to studying protein interaction networks, we have studied the networks of interactions in the brain. Our methods can also be used outside of biology, to study the relationships and dynamics of social networks, economic networks, and communications networks. The development of precise and detailed methods and techniques for efficiently studying and comparing these networks will enable us to push the boundaries of knowledge in biology, sociology, and economics in ways that would otherwise be impossible due to the large expense of comparing networks.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0644424
Program Officer
Vasant G. Honavar
Project Start
Project End
Budget Start
2007-01-15
Budget End
2011-02-28
Support Year
Fiscal Year
2006
Total Cost
$447,697
Indirect Cost
Name
University of California Irvine
Department
Type
DUNS #
City
Irvine
State
CA
Country
United States
Zip Code
92697