Gene Set Enrichment Analysis (GSEA) aims at identifying essential pathways, or more generally, sets of biologically related genes that are involved in complex human diseases. Due to many advantages it offers, GSEA has been proved to be crucial in systems biology studies that can lead to an integrated understanding of fundamental biological processes underlying disease pathogenesis, and elements defining therapeutic targets as well as responses to treatment selections. However, despite its potential importance in promoting human health, it is striking that conclusions of GSEA drawn from isolated studies are often sparse, and different studies may lead to inconsistent and sometimes contradictory results. This problem is largely related to the following limitations. Firstly, studies have shown that isoform-specific expression variations play important roles in complex human diseases. However, the microarray technology traditionally used for mRNA profiling often lacks the resolution needed to measure isoform-specific expression. Secondly, sample sizes of individual genome-wide transcriptomic studies are typically insufficient relative to an overwhelming number of genes. In the wake of next generation sequencing (NGS) technologies, it has been made possible to measure genome-wide isoform-specific expression levels, calling for next generation innovations that can utilize the un- precedence resolution. Further, enormous amounts of data have been created from various microarray and RNA-seq experiments; and the volume continues to grow fast. All these give rise to tremendous demand for developing methods of integrative GSEA (iGSEA) that allow for explicit utilization of isoform-specific expression, to combine multiple relevant studies, in order to avoid indecisive or potentially conducting conclusions from individual data and so to enhance the power, reproducibility and interpretability of the analysis. The goal of this project is to develop novel statistical methods and bioinformatical tools for iGSEA to efficiently synthesize diverse mRNA expression data from studies involving newly emerging RNA-Seq experiments as well as conventional microarray experiments, with an emphasis on integrating isoform-specific expression.
In Aim 1, we will develop an innovative meta-analysis method for iGSEA using isoform-specific expression. Specifically, we will incorporate ideas from exe-effect and random-effects models, newly proposed and tested for meta-analysis of genome-wide association studies, into iGSEA, in order to achieve the maximum possible statistical efficiency while allowing for inclusion of heterogeneous studies.
Aim 2 will propose robust meta-analysis methods to integrate both isoform- and gene-level expression data from a variety of sources.
Aim 3 will develop a fully integrated Bayesian method to incorporate existing biological information more effectively. A powerful Bayesian hierarchical approach will be proposed to jointly model different sources of information. This will not only drastically improve the power of iGSEA, but also simultaneously reveal interesting genes and gene sets, as well as `responsible' isoforms of each identified gene.
To understand molecular mechanisms underlying complex human diseases, one important task is to identify groups of related genes that are combinatorial involved in such biological processes, mainly through Gene Set Enrichment Analysis (GSEA). In the past, many statistical methods have been developed for GSEA; and many studies have shown that GSEA is a very useful bioinformatics tool, which plays critical roles in the innovation of disease prevention and intervention strategies. However, in the dawn of a new big data era, there is an increasingly urgent need to perform integrative GSEA (iGSEA), i.e., integrating multiple relevant GSEA studies, to turn individual data into collective knowledge. The goal of this project is to develop a comprehensive set of statistical methods and computational tools critically needed for iGSEA involving multiple RNA-Seq and/or microarray datasets, which allow utilization of isoform-specific expression, integration of mixed mRNA data from different technologies, and incorporation of elaborate biological information, to promote the power, reproducibility and interpretability of the analysis.
Showing the most recent 10 out of 12 publications