zed and structured in the form of annotations of biological entities such as genes, genetic variants, diseases, and pathways. These annotations are fragmented across dozens of data repositories like NCBI Entrez, Ensembl, UniProt, and hundreds (or more) of other specialized databases. While the volume and breadth of annotations is valuable, their fragmentation across many data silos is often frustrating and inefficient. Bioinformaticians everywhere must continuously and repetitively engage in data wrangling in an effort to comprehensively integrate knowledge from all these resources, and these uncoordinated efforts represent an enormous duplication of work. The problem of fragmentation is exacerbated (perhaps even fundamentally caused) by the inability of data providers to efficiently contribute to existing repositories. As a result, annotaion providers must generate new resources in order to host newly-generated annotations that are unavailable in the central repositories. In this proposal, we will create a hybrid solution that combines the high performance of a centralized system with the flexibility and breadth of a federated system. The centralized component will provide high-performance computational infrastructure for the integration, query and access of biological annotations. The technical design of this component will be based on our successful MyGene.info web services (://mygene.info). The federated component builds on our extensive background in crowdsourcing. We will build community infrastructure that allows the small- and medium-scale data wrangling that is already being performed (and repeated) by many scientists to be aggregated into a single big-data resource. Additionally, semantic interoperability will be added to our system to ensure that it will integrate with current and future Linked Data applications.

Public Health Relevance

A primary challenge in the biomedical Big Data era is that the vast amount of scientific discoveries outpaces the traditional efforts of structuring them in a computable form. Successful completion of this work will result in a platform to harvest structured data from individual researchers directly, and speed up biomedical research with this aggregated community intelligence.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project--Cooperative Agreements (U01)
Project #
3U01HG008473-02S1
Application #
9268840
Study Section
Special Emphasis Panel (ZRG1-BST-N (50)R)
Program Officer
Sofia, Heidi J
Project Start
2015-06-01
Project End
2018-05-31
Budget Start
2016-09-26
Budget End
2017-05-31
Support Year
2
Fiscal Year
2016
Total Cost
$372,625
Indirect Cost
$115,625
Name
Scripps Research Institute
Department
Type
DUNS #
781613492
City
La Jolla
State
CA
Country
United States
Zip Code
92037
Xin, Jiwen; Afrasiabi, Cyrus; Lelong, Sebastien et al. (2018) Cross-linking BioThings APIs through JSON-LD to facilitate knowledge exploration. BMC Bioinformatics 19:30
Wilkinson, Mark D; Sansone, Susanna-Assunta; Schultes, Erik et al. (2018) A design framework and exemplar metrics for FAIRness. Sci Data 5:180118
Cai, Binghuang; Li, Biao; Kiga, Nikki et al. (2017) Matching phenotypes to whole genomes: Lessons learned from four iterations of the personal genome project community challenges. Hum Mutat 38:1266-1276
Xin, Jiwen; Mark, Adam; Afrasiabi, Cyrus et al. (2016) High-performance web services for querying gene and variant annotation. Genome Biol 17:91