Generating insights from data repositories: Using the repositories available to us, such as the CMS Virtual Research Data Center (VRDC), our objective is to answer concrete clinical questions, taking into account not only the features of the repository (e.g., size and data elements available), but also its limiting characteristics (i.e., data granularity and dataset population). We intend to explore both hypothesis- and data-driven approaches to investigating clinical questions. Additionally, by collaborating with external institutions with access to rich EHR data (e.g., Observational Health Data Science and Informatics collaborative; OHDSI), as we already have, we will be able to access a larger set of repositories and investigate a broader set of clinical questions. Finally, by co-investigating data repositories with domain experts from NIH, we will make it possible to test hypotheses arising from preclinical or basic biological research using the appropriate data repositories. Expertise with available repositories: There has been significant growth in number of institution or network centric IDRs and similar growth in number of available clinical trial repositories. Researchers are facing a difficult task of choosing the most appropriate repository for a given research question. We plan to acquire practical expertise with advantages and limitations of both clinical and research datasets whenever possible either through their active use or though published reports otherwise. Characterizing data repositories: To facilitate the choice of appropriate repository or to facilitate improvement of a repository over time, we plan to develop methods to best characterize the repository size, population characteristics, clinical breadth and depth of data, and data quality. In addition to methods, we intend to develop tooling (e.g., code libraries and packages) to support dataset characterization. We also expect to contribute to the development of best practices for repository creation and maintenance through dataset characterization. Integrating data repositories: While it is valuable to analyze individual repositories, more benefits may come from integrating individual repositories into larger repositories, for example to support large-scale analyses, meta-analyses, and comparisons across repositories (e.g., for reproducibility testing). Integrating repositories rests, in a large part, on the transformation of local repositories using a homegrown data model into repositories based on a common analytical model, supporting federated queries across repositories. The emergence of common data models (CDMs) for an analytic purpose reflects a vision for analytical interoperability. Integrated data repositories not only share a harmonized information model, but also commit to target terminologies for coding biomedical entities (e.g., RxNorm for drugs, SNOMED CT for diagnoses, and LOINC for clinical observations). We intend to keep contributing to the development of common data model, such as the Observational Medical Outcomes Partnership (OMOP) model. Moreover, we want to support the integration of several routine healthcare clinical repositories, research repositories, and repositories across healthcare and research in support of translational research.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
Zip Code
Huser, Vojtech; Kahn, Michael G; Brown, Jeffrey S et al. (2018) Methods for examining data quality in healthcare integrated data repositories. Pac Symp Biocomput 23:628-633
Huser, Vojtech; Shmueli-Blumberg, Dikla (2018) Data sharing platforms for de-identified data from human clinical trials. Clin Trials 15:413-423