Biomedical investigators are generating increasing amounts of complex and diverse data. This data varies tremendously, from genome sequences through phenotypic measurements and imaging data. If researchers and data scientists can tap into this data effectively, then we can gain insights into disease mechanisms and how to tackle them. However, the main stumbling block is that it is increasingly hard to find and integrate the relevant datasets due to the lack of sufficient metadata. A researcher studying Crohn's disease may miss a crucial dataset on how certain microbial communities affect gut histology due to the lack of descriptive tags on the data. Currently, applying metadata is difficult, time-consuming and error prone due to the vast sea of confusing and overlapping standards for each datatype. Often specialized `data wranglers' are employed to apply metadata, but even these experts are hindered by lack of good tools. Here we propose to develop an intelligent agent that researchers and data wranglers can use to assist them apply metadata. The agent is based around a personalized dashboard of metadata elements that can be collected from multiple specialized portals, as well as sites such as Wikipedia. These elements can be coupled with classifiers that can be used to self-identify datasets to which they may be relevant, making the selection of appropriate vocabularies easier for researchers. We will deploy the system for a number of targeted use cases, including annotation of the National Center for Biomedical Information Bio-Samples repository, and annotation of images within the Figshare repository.

Public Health Relevance

Biomedical data is being generated at an increasing rate, and it is becoming increasingly difficult for researchers to be able to locate and effectively operate over this data, which has negative impacts on the rate of new discoveries. One solution is to attach metadata (data about data) onto all information generated in a research project, but application of metadata is currently difficult and time consuming due to the diverse range of standards on offer, typically requiring the expertise of trained data wranglers. Here we propose to develop an intelligent concept assistant that will allow researchers to generate and share sets of metadata elements relevant to their project, and will use machine learning techniques to automatically apply this to data.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project--Cooperative Agreements (U01)
Project #
5U01HG009453-02
Application #
9357656
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Sofia, Heidi J
Project Start
2016-09-23
Project End
2019-08-31
Budget Start
2017-09-01
Budget End
2018-08-31
Support Year
2
Fiscal Year
2017
Total Cost
Indirect Cost
Name
Lawrence Berkeley National Laboratory
Department
Type
DUNS #
078576738
City
Berkeley
State
CA
Country
United States
Zip Code
94720
Matentzoglu, Nicolas; Malone, James; Mungall, Chris et al. (2018) MIRO: guidelines for minimum information for the reporting of an ontology. J Biomed Semantics 9:6
Lizio, Marina; Harshbarger, Jayson; Abugessaisa, Imad et al. (2017) Update of the FANTOM web resource: high resolution transcriptome of diverse cell types in mammals. Nucleic Acids Res 45:D737-D743
Osumi-Sutherland, David; Courtot, Melanie; Balhoff, James P et al. (2017) Dead simple OWL design patterns. J Biomed Semantics 8:18
Köhler, Sebastian; Vasilevsky, Nicole A; Engelstad, Mark et al. (2017) The Human Phenotype Ontology in 2017. Nucleic Acids Res 45:D865-D876
Links, Amanda E; Draper, David; Lee, Elizabeth et al. (2016) Distributed Cognition and Process Management Enabling Individualized Translational Research: The NIH Undiagnosed Diseases Program Experience. Front Med (Lausanne) 3:39