The complex behavior of the cell derives from an intricate network of molecular interactions of thousands of genes and their products. Understanding how this network operates and predicting its behavior are primary goals of biology and have broad implications for life science, medicine and biotechnology.

The genomic information revolution of the last ten years has enabled new systems-level and data-driven approaches for studying cellular networks. In particular, using machine learning to model gene regulatory networks---the switching on and off of genes by regulatory proteins that bind to non-coding DNA---has emerged as a central problem in systems biology. Now, an explosion of new high-throughput technologies for measuring physical interactions between proteins and between protein and DNA provides a new data integration challenge for computational modeling of gene regulation. These new data can all be viewed as graph-structured data, or physical interaction networks.

The central computational goal of this project is to develop new machine learning learning algorithms for exploiting graph-structured data, including: (1) boosting with efficient graph mining; (2) graph kernels based on subgraph histogramming; and (3) information-based graph partitioning. These new algorithms will be used to integrate physical interaction network data into models of gene regulation in order to better represent underlying biological mechanisms. The focus will be two fundamental modeling problems: inferring signal transduction pathways and modeling cis regulatory modules at the level of DNA sequence and interacting regulatory proteins. The algorithms will be applied both to publicly available data and to primary gene expression data provided by one of the investigators to study the hypoxia in yeast and the response to environmental toxins in mammalian neural cells.

This project will learn systems-level models that lead to new insight into the underlying mechanisms of gene regulation and open the way to broader biological discoveries. All data, results and source code will be publicly available via the Web (www.cs.columbia.edu/ compbio/cellular-networks) and disseminated through courses and bioinformatics software packages. The project will also create undergraduate research opportunities for joint dry and wet lab projects and outreach activities to introduce New York City public high school students to new interdisciplinary areas of science.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0705580
Program Officer
Frank Olken
Project Start
Project End
Budget Start
2007-08-15
Budget End
2008-07-31
Support Year
Fiscal Year
2007
Total Cost
$253,905
Indirect Cost
Name
Columbia University
Department
Type
DUNS #
City
New York
State
NY
Country
United States
Zip Code
10027