Many diseases are understudied because they are rare or of little public interest. The effect of each understudied disease may be limited, but the cumulative effects of all these diseases could be profound. One common research challenge for these diseases is that the resources allocated to each is often limited. For instance, large- scale screening of drugs is often challenging, if not possible, in small labs. The decreasing costs of next generation sequencing make possible the generation of gene expression profiles of understudied disease samples. Integrating these expression profiles with other open data provides tremendous opportunities to gain insights into disease mechanisms and identify new therapeutics for understudied diseases. We have utilized a systems-based approach that employs gene expression profiles of disease samples and drug-induced gene expression profiles from cancer cell lines to predict new therapeutic candidates for hepatocellular carcinoma, Ewing sarcoma and basal cell carcinoma. All these candidates were successfully validated in preclinical models. The success of this approach relies on multiscale procedures, such as quality control of disease samples, selection of appropriate reference tissues, evaluation of disease signatures, and weighting cell lines. There is a plethora of relevant datasets and analysis modules that are publicly available, yet are isolated in distinct silos, making it tedious to implement this approach in translational research. A centralized informatics system that allows prediction of therapeutics for further experimental validation is thus of great interest to researchers working on understudied diseases. Accordingly, we propose four specific aims: 1) developing novel deep learning methods to select precise reference normal tissues for disease signature creation, 2) developing computational methods to reuse drug profiles from other disease models for drug prediction, 3) integrating open efficacy data to identify new targets from the systems-based approach, and 4) developing a centralized platform and promoting the platform in the scientific community. This proposal will reuse several big open databases (e.g., TCGA, TARGET, GTEx, GEO, LINCS, CTRP, GDSC) and employ cutting-edge informatics methods (e.g., deep learning). To demonstrate the scalability of the system, we will investigate three representative understudied diseases: multiple organ dysfunction syndrome (Aim 1), diffuse intrinsic pontine glioma (Aim 2) and hepatocellular carcinoma (Aim 3). Successful implementation of the systems-based approach can be used as a model for using other large open omics (proteins, metabolites) to discover therapeutics for diseases with unmet needs. This proposal will bring together experts in informatics, statistics, computer science, and physicians from Michigan State University, Stanford University, UC Berkeley and Spectrum Health. All data and code will be released to the public for continuing development. The system will be deployed to our OCTAD portal (, an open workplace for therapeutic discovery.

Public Health Relevance

About 25 million people are living with understudied diseases in the U.S. Although there are voluminous high dimensional molecular datasets that could be leveraged for research, individual labs have limited computational capacity to translate these molecular features into therapeutic hits. We propose to build a centralized information system that allows individual labs to easily harness open gene expression datasets and generate new therapeutic targets or drug candidates for further experimental validation.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Ravichandran, Veerasamy
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Michigan State University
Schools of Medicine
East Lansing
United States
Zip Code