A modular data analysis ecosystem using portable encapsulated projects

Sheffield, Nathan

Abstract

Overview As the amount of available data increases, it becomes more challenging to process it. Data processing is simple on the surface: it is a mapping from data to analysis. Unfortunately, too often, this requires a unique structure for each combination of dataset and analysis. This makes it dif?cult to do things like run several different analyses on one dataset, or plug several different datasets to one analysis, because each connection structure must be de?ned manually. To alleviate this challenge of linking data to tools, this proposal develops the concept of Portable Encapsulated Projects (PEP) and a series of tools that read and process such projects. Essentially, the PEP format aims to standardize the description of data collections, enabling both data providers and data users to communicate through the common interface of a standard format. Practically, this means individuals who describe their projects using this format will immediately inherit both greater portability for analysis as well as greater access to external complementary data. This link operates around a simple, standard, extensible de?nition of a project. Accompanying this, this proposal develops Python and R packages to provide a modular framework with a low barrier to entry that makes it easy to build robust pipelines and other tools centered around the PEP format. This system presents a new approach to organizing data-intensive biomedical research projects. Signi?cance and innovation This proposal sits at the interface of data management and bioinformatics tool development. While signi?cant effort is already dedicated to each of these individually, there has been less focus at the level of connecting the two. This proposal will build a standardized interface between data and tools in bioinformatics, providing practical advances in formats and tools to facilitate this interaction. This effort approaches computational projects in a novel way, and builds both concepts and tools that can revolutionize bioinformatics research. The goal is not to develop new tools, but to make existing tools more easily applied to existing data. In computational research, a huge amount of effort is spent in data cleanup: preparing data for analysis. By facilitating the connection from data to tools, this will encourage re-analysis of existing data with novel analysis techniques, leading to new discovery. It will also make it easier to analyze new data in tandem with existing data, increasing the value of both. It will contribute to reusability, larger-scale analysis, portable computing environments, and data sharing. There is increasing interest in data sharing and accessibility across scienti?c domains, and this proposal will facilitate this. Early versions are already adopted for both local compute and cluster computing at four different research institutions, and as the project matures, it will unite various research environments around a common data description. This will make it easier to share data and tools across users, research groups, and institutions. 1

Public Health Relevance

The next generation of biomedical insights based on large datasets will require novel computational approaches to biomedical data analysis. This research will provide new ways to explore new data types, to integrate them with existing knowledge, and ultimately to drive biomedical discovery in the complex human system.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of General Medical Sciences (NIGMS)
Type: Unknown (R35)
Project #: 5R35GM128636-03
Application #: 10019399
Study Section: Special Emphasis Panel (ZGM1)
Program Officer: Ravichandran, Veerasamy

Project Start: 2018-08-01
Project End: 2023-07-31
Budget Start: 2020-08-01
Budget End: 2021-07-31
Support Year: 3
Fiscal Year: 2020
Total Cost
Indirect Cost

Institution

Name: University of Virginia
Department: Public Health & Prev Medicine
Type: Schools of Medicine
DUNS #: 065391526

City: Charlottesville
State: VA
Country: United States
Zip Code: 22904

Related projects


NIH 2020 R35 GM	A modular data analysis ecosystem using portable encapsulated projects Sheffield, Nathan / University of Virginia
NIH 2019 R35 GM	A modular data analysis ecosystem using portable encapsulated projects Sheffield, Nathan / University of Virginia
NIH 2018 R35 GM	A modular data analysis ecosystem using portable encapsulated projects Sheffield, Nathan / University of Virginia

Comments

Be the first to comment on Nathan Sheffield's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: