Current technologies for cancer genomics research generate petabytes of data that are dispersed across multiple archives in a non-standard fashion. This dispersal poses major challenges to comprehensive analyses based on the integration of such data. Two common types of secondary data generated from sequencing- based studies involve mutation and gene expression associated with the cancer state as inferred from comparing matched tumor and normal samples. Massive collaborations like the Cancer Genome Atlas (TCGA) and the International Cancer Genomic Consortium (ICGC) are instrumental in facilitating the generation of the sequence data and providing a modicum of standardization through best practices, but they do not always follow the same standards between projects. Moreover, proprietary databases like the Catalogue of Somatic Mutations in Cancer (COSMIC) generally store and annotate data in a format uniquely optimized for their own database to meet individual business needs. Thus, integrating mutation and expression data across resources involves a massive undertaking with efforts devoted to data curation, unification, harmonization, and appropriate annotation for proper representation at a central location. Additionally, it is difficult to comprehensively collect and map protein functional sites to the mutation sites from a variety of databases such as UniProt, RefSeq, and many others because the underlying sequences in these databases can be different. To address this challenge, the Early Detection Research Network (EDRN) Associate Membership funded the development of BioMuta and BioXpress, cancer-associated mutation and expression databases, respectively, to provide access to unified data from several popular cancer repositories and functional data from well-known molecular biology resources. Links to BioMuta are available through the EDRN portal and UniProt. The focus of the proposed project is to provide a custom portal encompassing up-to-date releases of BioMuta and BioXpress leveraging the existing EDRN framework and data. This will provide a broader understanding of the cancer landscape moving toward the proteomic space and working synergistically with other ITCR resources. To supplement these data, we further propose to integrate normal expression data across several species that can be used to derive a deeper understanding of the cancer-associated expression profiles. Text-mining support will also be applied to the identified cancer-related mutation and expression profiles for evidence to aid in interpretation of the findings. It is expected that such large-scale integration of cancer data and supporting information will not only benefit cancer research, but will also become a critical necessity for ensuring the most efficient synthesis of information and therefore the earliest detection methods possible.
The proposed research will simultaneously streamline and advance cancer biomarker identification pipelines by making various and numerous pre-analyzed cancer-relevant mutation and expression data, mapped to protein functional site data and protein functional information, available in a unified manner through a single user interface.
Hu, Yu; Dingerdissen, Hayley; Gupta, Samir et al. (2018) Identification of key differentially expressed MicroRNAs in cancer patients through pan-cancer analysis. Comput Biol Med 103:183-197 |
Gupta, Samir; Dingerdissen, Hayley; Ross, Karen E et al. (2018) DEXTER: Disease-Expression Relation Extraction from Text. Database (Oxford) 2018: |