The Big Data revolution requires that biomedical scientists be able to locate, analyze, and integrate the large datasets that now pervade biomedicine. Such work is possible only when experimental datasets are made available online and when they are annotated with metadata that explain how the data are organized, what the data represent, and how the data were collected. The Center for Expanded Data Annotation and Retrieval (CEDAR) will take advantage of the recent growth in community-driven metadata standards to develop innovative computational methods to ease the authoring and use of metadata annotations.
Our specific aims focus on working with communities of investigators to standardize descriptions of the data generated through biomedical studies; creating a computational collective for development, evaluation, use, and refinement of metadata templates for describing laboratory studies; developing a comprehensive and open repository of metadata that will inform the learning algorithms that will drive much of our Center's technology; training the biomedical community in the use of metadata and in CEDAR's resources; and evaluating our work in the context of ImmPort, an NIAID-supported multi-assay data repository that will offer end-to-end opportunities to demonstrate and validate our ideas. We anticipate a growing community of users, starting with the Human Immunology Project Consortium, then the BD2K Center Consortium, then the Stanford Digital Repository, growing until we have developed a wide user base leading to measurable changes in the quality of the metadata used to annotate online datasets. The Overall description of our project provides a synopsis of CEDAR's activities and overall specific aims.

Public Health Relevance

The ability to locate, analyze, and integrate Big Data depends on the metadata that describe data sets and the experiments that have been performed. This project will facilitate annotation of data with high quality metadata. The results of our work will lead to better data and, thus, to better science. Ultimately, such results will led to better health.

National Institute of Health (NIH)
National Institute of Allergy and Infectious Diseases (NIAID)
Specialized Center--Cooperative Agreements (U54)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Giovanni, Maria Y
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Stanford University
Internal Medicine/Medicine
Schools of Medicine
United States
Zip Code
Panahiazar, Maryam; Dumontier, Michel; Gevaert, Olivier (2017) Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO). J Biomed Inform 72:132-139
Raymond, Steven L; López, María Cecilia; Baker, Henry V et al. (2017) Unique transcriptomic response to sepsis is observed among patients of different age groups. PLoS One 12:e0184159
Haynes, Winston A; Vallania, Francesco; Liu, Charles et al. (2017) EMPOWERING MULTI-COHORT GENE EXPRESSION ANALYSIS TO INCREASE REPRODUCIBILITY. Pac Symp Biocomput 22:144-153
Posch, Lisa; Panahiazar, Maryam; Dumontier, Michel et al. (2016) Predicting structured metadata from unstructured metadata. Database (Oxford) 2016:
Sweeney, Timothy E; Braviak, Lindsay; Tato, Cristina M et al. (2016) Genome-wide expression for diagnosis of pulmonary tuberculosis: a multicohort analysis. Lancet Respir Med 4:213-24
Musen, Mark A; Bean, Carol A; Cheung, Kei-Hoi et al. (2015) The center for expanded data annotation and retrieval. J Am Med Inform Assoc 22:1148-52
Ayvaz, Serkan; Horn, John; Hassanzadeh, Oktie et al. (2015) Toward a complete dataset of drug-drug interaction information from publicly available sources. J Biomed Inform 55:206-17
Andres-Terre, Marta; McGuire, Helen M; Pouliot, Yannick et al. (2015) Integrated, Multi-cohort Analysis Identifies Conserved Transcriptional Signatures across Multiple Respiratory Viruses. Immunity 43:1199-211