In recent years, the notion of 'one gene makes one protein that functions in one signaling pathway' in mammalian cells has been shown to be overly simplistic. Recent evidence suggests that more than 50% of the human genes produce multiple protein isoforms, through alternative splicing and alternative usage of transcription initiation and/or termination. Notably, the disruption of many of these genes is implicated in cancer and several neuropsychiatric disorders. For majority of human genes the resulting multiple protein isoforms are functionally different and can participate in different signaling pathways. However, nearly after a decade since the completion of the human genome draft sequence, we still assume 'gene' as the basic functional unit in a cell. We argue that the isoform-level gene products - 'transcript variants' and 'protein isoforms' are the basic functionalunits in a mammalian cell, and accordingly, the informatics resources for managing and analyzing gene regulation data in mammalian cells should adopt 'gene isoform centric' rather than 'gene centric' approaches. We propose to build an informatics platform for understanding gene regulation at isoform-level by developing statically rigorous bioinformatics resources for processing Next-Generation Sequencing (NGS) data. Recently, computational approaches that combine seemingly disparate experimental data have been successful in developing concise gene regulation models and transcriptional modules. We plan to extend these methodologies to perform integrative analysis of multiple high-throughput data sets currently generated across different laboratories, including ours at Wistar, into computational models to predict different transcriptional isoforms of mammalian genes and protein-DNA interactions at isoform level. We will apply innovative statistical modeling approaches that combine state-of-the-art meta-classification algorithms, such as Nave Bayes Tree, Bagging and LogitBoost, with Random Forest feature selection to classify different types of target promoters with good classification accuracy and reduced instability, in order to predict gene promoters and infer the protein-DNA interactions from ChIP-seq data. The computational models and the derived information will be integrated into a novel database, which will serve as an in silico platform for transcriptional regulation studies. This will be completed by pursuing the following aims, (1) develop computational pipelines to identify the orthologous promoters, corresponding transcript variants and protein isoforms that are conserved between human and mouse, (2) develop efficient algorithms and informatics pipelines for integrative analysis of NGS datasets to predict the activity and expression of both known and novel promoters and their transcript variants, in various tissues, developmental stages, and disease conditions, and (3) develop a web-accessible database for integrating the information generated. The development of these methods and user-friendly software will provide useful tools to better understand gene regulatory mechanisms in mammalian cells, and more importantly, how dis-regulation of these mechanisms leads to a variety of diseases.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Ye, Jane
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Northwestern University Chicago
Public Health & Prev Medicine
Schools of Medicine
United States
Zip Code
Jung, Segun; Bi, Yingtao; Davuluri, Ramana V (2015) Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping. BMC Genomics 16 Suppl 11:S3
Plasschaert, Robert N; Vigneau, Sebastien; Tempera, Italo et al. (2014) CTCF binding site sequence differences are associated with unique regulatory and functional trends during embryonic stem cell differentiation. Nucleic Acids Res 42:774-89
Pal, Sharmistha; Bi, Yingtao; Macyszyn, Luke et al. (2014) Isoform-level gene signature improves prognostic stratification and accurately classifies glioblastoma subtypes. Nucleic Acids Res 42:e64
Stong, Nicholas; Deng, Zhong; Gupta, Ravi et al. (2014) Subtelomeric CTCF and cohesin binding site organization using improved subtelomere assemblies and a novel annotation pipeline. Genome Res 24:1039-50
Pal, Sharmistha; Gupta, Ravi; Davuluri, Ramana V (2014) Genome-wide mapping of RNA Pol-II promoter usage in mouse tissues by ChIP-seq. Methods Mol Biol 1176:1-9
Bi, Yingtao; Davuluri, Ramana V (2013) NPEBseq: nonparametric empirical bayesian-based procedure for differential expression analysis of RNA-seq data. BMC Bioinformatics 14:262
Ota, Hiromitsu; Sakurai, Masayuki; Gupta, Ravi et al. (2013) ADAR1 forms a complex with Dicer to promote microRNA processing and RNA-induced gene silencing. Cell 153:575-89
Zhang, ZhongFa; Pal, Sharmistha; Bi, Yingtao et al. (2013) Isoform level expression profiles provide better cancer signatures than gene level expression profiles. Genome Med 5:33
Bhattacharjee, M; Gupta, Ravi; Davuluri, R V (2012) Estimation of Gene Expression at Isoform Level from mRNA-Seq Data by Bayesian Hierarchical Modeling. Front Genet 3:239