High-throughput gene expression profile measurements enabled by microarrays have spawned significant advances in functional genomics and systems biology. The vast numbers of cumulative microarray experiments conducted over the past decade have generated a wealth of expression data available from several public repositories. While algorithms for microarray data analysis and gene network inference have been well studied, most available methods and programs are sequential and cannot scale up to analyzing large number of experiments due to both memory and time constraints.

In this project, the investigators will develop high performance, parallel computational methods for large-scale gene expression analysis and gene network inference utilizing tens of thousands of microarray experiments available in public repositories. The primary research goal is to develop capability to simultaneously analyze the entire gamut of gene expression data available for an organism, and make biological discoveries and build robust, accurate networks which would not be possible through limited, compartmentalized analysis. The research will be carried out using gene expression profiles of the plant Arabidopsis thaliana, a well-studied model organism and the focus of the decade-long NSF Arabidopsis 2010 initiative. The investigators will develop 1) parallel algorithms for biclustering large gene expression matrices, 2) parallel algorithms for inferring gene networks using Mutual Information and Bayesian approaches, and 3) methods for querying and analyzing large-scale biological networks. The project will be led by an interdisciplinary team of investigators whose expertise spans parallel algorithms, scalable computing and software development, bioinformatics and systems biology, statistical analysis, microarray experimental techniques and analysis, and organism specific knowledge of Arabidopsis. It will lead to the development of advanced computational methods and open source software programs in systems biology.

Project Report

Genes act together in networks to execute various cellular functions in response to both internal and external stimuli. This phenomenon can be observed indirectly by measuring the expression levels of genes as they collectively execute various biological processes. This project is focused on building genome-scale gene regulatory networks from large collections of gene expression data. One can view gene expression observations as the output of a system and network inference as the problem of developing a system model that is consistent with the observations. Because of computational complexity, previous methods compromised in some or all of the following -- size of the system, number of observations, quality of the inference method, and robustness through techniques such as permutation testing and bootstrapping. The primary goal of this project is to remove such limitations by developing techniques for network construction on parallel computers. Under this project, we developed a parallel network construction technique (TINGe) based on mutual information theory that can build genome-scale networks of complex organisms (such as plants and animals/humans) using all available gene expression experimental data at once. This is significant because no currently available method achieves this feat. We demonstrated our technique by reconstructing a 15,495 gene network of the plant Arabidopsis thaliana from 3,137 gene expression experiments in just 9 minutes on a 1,024 core commodity cluster computer. In addition, we developed statistical methods and protocols for gene expression analysis, which are independently published for the benefit of the research community. We also developed a method (GeNA) to extract the subnetwork corresponding to a specific biological process by taking known information about this process as a guide to navigate the whole genome network. Bayesian networks are another high quality network modeling technique. Under this project, we developed the first parallel algorithm for exact Bayesian network structure learning (ParaBayL) and the first parallel heuristic algorithm for large-scale Bayesian network structure learning (PARABLE). While we applied both of these methods for constructing gene networks, they can also be used in any application involving Bayesian network structure learning. As for broader impacts, all of the software produced under this project (TINGe, GeNA, ParaBayL, and PARABLE) are made available as open source. Using TINGe and GeNA, we analyzed 241 metabolic pathways in Arabidopsis and developed a web portal to make these available to the community for further study. Our analysis of each of these pathways contains a list of genes that we predict as having high potential for playing a role in the pathway. Thus, these resources provide valuable information for biologists working to improve their understanding of this organism. The project is a collaborative effort between three PIs in computer science, plant biology, and statistics, respectively. The project funded the work of a female Ph.D. student and a postdoctoral research associate, both of whom gained valuable interdisciplinary research experience. The student’s Ph.D. thesis is solely based on the work carried out under the project. The lead PI gave several invited and keynote presentations to inform the research community about results from this project.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
0811804
Program Officer
Almadena Y. Chtchelkanova
Project Start
Project End
Budget Start
2008-08-01
Budget End
2012-07-31
Support Year
Fiscal Year
2008
Total Cost
$383,000
Indirect Cost
Name
Iowa State University
Department
Type
DUNS #
City
Ames
State
IA
Country
United States
Zip Code
50011