Overview As the amount of available data increases, it becomes more challenging to process it. Data processing is simple on the surface: it is a mapping from data to analysis. Unfortunately, too often, this requires a unique structure for each combination of dataset and analysis. This makes it dif?cult to do things like run several different analyses on one dataset, or plug several different datasets to one analysis, because each connection structure must be de?ned manually. To alleviate this challenge of linking data to tools, this proposal develops the concept of Portable Encapsulated Projects (PEP) and a series of tools that read and process such projects. Essentially, the PEP format aims to standardize the description of data collections, enabling both data providers and data users to communicate through the common interface of a standard format. Practically, this means individuals who describe their projects using this format will immediately inherit both greater portability for analysis as well as greater access to external complementary data. This link operates around a simple, standard, extensible de?nition of a project. Accompanying this, this proposal develops Python and R packages to provide a modular framework with a low barrier to entry that makes it easy to build robust pipelines and other tools centered around the PEP format. This system presents a new approach to organizing data-intensive biomedical research projects. Signi?cance and innovation This proposal sits at the interface of data management and bioinformatics tool development. While signi?cant effort is already dedicated to each of these individually, there has been less focus at the level of connecting the two. This proposal will build a standardized interface between data and tools in bioinformatics, providing practical advances in formats and tools to facilitate this interaction. This effort approaches computational projects in a novel way, and builds both concepts and tools that can revolutionize bioinformatics research. The goal is not to develop new tools, but to make existing tools more easily applied to existing data. In computational research, a huge amount of effort is spent in data cleanup: preparing data for analysis. By facilitating the connection from data to tools, this will encourage re-analysis of existing data with novel analysis techniques, leading to new discovery. It will also make it easier to analyze new data in tandem with existing data, increasing the value of both. It will contribute to reusability, larger-scale analysis, portable computing environments, and data sharing. There is increasing interest in data sharing and accessibility across scienti?c domains, and this proposal will facilitate this. Early versions are already adopted for both local compute and cluster computing at four different research institutions, and as the project matures, it will unite various research environments around a common data description. This will make it easier to share data and tools across users, research groups, and institutions. 1
The next generation of biomedical insights based on large datasets will require novel computational approaches to biomedical data analysis. This research will provide new ways to explore new data types, to integrate them with existing knowledge, and ultimately to drive biomedical discovery in the complex human system.