The generation of biological data is rapidly presenting us with one of the most demanding data analysis challenges the world has ever faced - not only in terms of storage and accessibility, but perhaps more critically in terms of its extensive heterogeneity and variability. In this proposal, we present a new approach to these challenges, which we call ?Deep Curation?: a large-scale, integrated modeling approach to simultaneously cross-evaluate millions of heterogeneous data against themselves. The word ?deep? reflects the multiple layers of curation we perform, including layers not only for data, but also for parameters derived from these data, the mathematical equations, the unified model, and the simulation output. Thus, the deeply-curated model is an invaluable tool for processing, curating and analyzing data automatically. Our proposed efforts in Deep Curation are based on a computer model of Escherichia coli that accounts for the function of roughly 40% of the well-annotated genes, and is based on an extensive set of diverse measurements compiled from thousands of reports (currently in 2nd round of review at Science). The goal of this proposal is to expand this model to enable Deep Curation of data related to growth on >100 currently-unincorporated environments. We can then assess the cross-consistency of the data sets simultaneously, as a unified whole, identifying critical areas in which datasets are not cross-consistent and therefore further experimental investigation is needed. The Significance of this proposal is that Deep Curation represents a first-in-kind quantum leap forward in our ability to exploit massively heterogeneous, variable and complex biological datasets; that it automates and accelerates transformative biomedical discovery; that we will create a bi-directional pipeline between EcoCyc, the most comprehensive database on any organism, and the most complex biological model in existence; and that whole-cell modeling is a rapidly-growing field with transformative potential as it advances towards more complex cells and groups of cells. The Innovation associated with this proposal is that Deep Curation is a brand-new and highly innovative approach that is not currently available to any other lab in the world; that the proposed work will produce a dramatically expanded whole-cell model of previously-unseen complexity; as well as novel and highly innovative modeling technology; that we include explicit curation of knowledge regarding mechanism in addition to data; and that the automated communication between the EcoCyc database and the E. coli model will dramatically expand the capacity, scope and visibility of both in a synergistic way.
Our Specific Aims are:
Aim 1 (Curation), build the Data and Parameter layers related to E. coli growth on diverse environments;
Aim 2 (Modeling), implement the Equation, Model and Simulation layers;
Aim 3 (Deep Curation), use the integrated model to cross-evaluate the unified data set at the whole-organism scale;
and Aim 4 (Distribution), make the model available to the broader community via GitHub (software tools), EcoCyc (data and parameters), and Google Cloud (simulations and interactive visualizations).

Public Health Relevance

Research Narrative The generation of biological data is rapidly presenting us with one of the most demanding data analysis challenges the world has ever faced - but also with unprecedented opportunities for major new discoveries. We propose a transformative new approach to Big Data in biology: ?Deep Curation?, which refers to the fact that we are not only curating data, but all of the biological knowledge available about a given organism (the bacterium ?E. coli? in this case), and representing all of it as an integrated computer model that can predict cellular behavior. Deep curation has the potential to revolutionize how biological research is performed, giving scientists complete access to massive, heterogeneous datasets, seamlessly unified into one theoretical framework.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZLM1)
Program Officer
Vanbiervliet, Alan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Stanford University
Biomedical Engineering
Schools of Medicine
United States
Zip Code