There are data from over 1.6 million open digital samples in the NCBI's Gene Expression Omnibus (GEO; www.ncbi.nlm.nih.gov/geo). GEO houses high quality experiments that the NIH has funded to measure the functional genomics of diseased and healthy individual samples. The majority of these samples interrogate cancer phenotypes and can be used to better characterize the genomics of that disease. However, GEO digital samples lack any structured biological annotations (bioannotations) because they are variably described across different experiments by free text attributes. This proposal is about using the Search Tag Analyze Resource (STAR) as a genomics discovery platform to crowdsource the precise bioannotation of this open Big Data. We will demonstrate the utility of a well-structured GEO to better characterize cancer functional genomics and to estimate a robust molecular nosology across the disease. The robust gene signatures we define is a first step towards a more comprehensive genomic understanding of the spectrum of the disease and making novel drug and biomarker discoveries. Therefore, successful funding and completion of this work has the potential to improve translational discoveries that greatly reduce the burden of disease on patients and thus improve overall health and wellbeing of society.

Public Health Relevance

This proposal is about crowdsourcing a deeper molecular understanding of cancer with open functional genomics data from the NCBI's Gene Expression Omnibus (GEO). This data contains over 1.6 million digital samples across a great many diseases that can be mined for translational discovery and clinical impact. Although this Big Data is rich in content, it is difficult to interpret for molecular characteristics that can readil translate into novel drugs and biomarkers for disease. This is because samples are poorly described by unstructured free text attributes with little biological semantics or interpretable meaning. We previously built the Search Tag Analyze Resource (STAR; stargeo.org) as an online tool for anyone to bioannotate this data uniformly across studies to characterize disease genomics. Here, we will investigate how to drive precision in the STAR bioannotation process, how to characterize cancer genomics on a massive scale, and how to compare and contrast the performance of GEO relative to other open functional genomics cancer datasets.

Agency
National Institute of Health (NIH)
Institute
National Cancer Institute (NCI)
Type
Exploratory/Developmental Cooperative Agreement Phase I (UH2)
Project #
5UH2CA203792-02
Application #
9243231
Study Section
Special Emphasis Panel (ZRG1-BST-U (50)R)
Program Officer
Miller, David J
Project Start
2016-04-01
Project End
2018-03-31
Budget Start
2017-04-01
Budget End
2018-03-31
Support Year
2
Fiscal Year
2017
Total Cost
$317,000
Indirect Cost
$117,000
Name
University of California San Francisco
Department
Pediatrics
Type
Schools of Medicine
DUNS #
094878337
City
San Francisco
State
CA
Country
United States
Zip Code
94118
Hadley, Dexter; Pan, James; El-Sayed, Osama et al. (2017) Precision annotation of digital samples in NCBI's gene expression omnibus. Sci Data 4:170125
Himmelstein, Daniel Scott; Lizee, Antoine; Hessler, Christine et al. (2017) Systematic integration of biomedical knowledge prioritizes drugs for repurposing. Elife 6:
Chen, Bin; Sirota, Marina; Fan-Minogue, Hua et al. (2015) Relating hepatocellular carcinoma tumor samples and cell lines using gene expression data in translational research. BMC Med Genomics 8 Suppl 2:S5