In biology, network techniques have been applied to interpret the interactions between genes, including the physical interactions of proteins and regulatory relationships between transcription factors and targets. Although numerous methods have been developed to infer a network from expression data, several computational challenges remain unsolved, such as, how to derive non-linear relationships between transcription factors and targets, how to properly decompose a network into individual sub-network modules, how to predict biologically significant genes via network-scale comparisons, how to integrate and use the heterogeneous forms of biological interaction data to facilitate network analysis, and how to seamlessly visualize a large-scale network for interactive data mining. To solve these problems, the primary goal of this project is to develop a software package - the Gini Network Analysis Toolkit (GNAT) that utilizes the Gini-based methodologies: a family of mathematical solutions that have been widely used in economics, physics, informatic networks, and social networks in analyzing non-normally distributed data. The core functional modules and algorithms in the GNAT include the use of supervised machine learning methods to infer transcriptional networks, the Gini correlation coefficient to derive non-linear regulatory relationships, the Gini regression analysis to decompose a time-series network, the Gini index to measure and compare the distributions of the network properties of modules and genes under different biological conditions, and eventually the discovery of biologically important genes with system perturbation and decision tree analysis. The PI will also develop a network explorer, BioNetscape, to efficiently organize and visualize the tremendous amount of network data generated from the GNAT using the k-core decomposition algorithm, Ajax technology and GPU (graphical processing unit) computing techniques. The GNAT will be implemented in R and organized as a streamlined workflow to compensate the shortcomings of the traditional gene-scale transcriptome analysis methods.
The GNAT software will greatly facilitate the ongoing network development projects in plant research. The GNAT will be made available to be integrated into the iPlant Discovery Environment, The Arabidopsis Information Resource (TAIR), Plant Expression Database (PLEXdb) and other consortium databases to enhance the function of network analysis and gene discovery in plants. The GNAT will also be integrated into the Galaxy and GenePattern platforms to provide a user-friendly graphical interface. The source-code and R packages will be released into the public domain for broader use in plant, animal and microbial biology. To integrate research into education, the PI?s laboratory will develop a web-based Virtual Next Generation Sequencing Workshop for training biologists who are not specialists in bioinformatics to analyze genomic, epigenomic, transcriptomic and small RNA data. The workshop courseware is composed of teaching materials prepared in the PI's class, self-practice datasets and a virtual UNIX web-console for training biologists to analyze different types of next generation sequencing data with minimal requirements for programming skills. This project explicitly addresses cross-disciplinary research training at multiple levels that will encourage the participation of underrepresented groups in computer sciences and mathematics at the University of Arizona, who will work to answering biological questions. The students from the ASEMS (Arizona Science, Engineering, and Math Scholars) and IGERT programs at the University of Arizona will participate in the PI's team to develop the GNAT, BioNetscape and VNW, and use these tools in their research.