Due to recent technological advances, it is possible to image the high-resolution structure of brain volumes at spatial extents that are much larger than was previously possible. Emerging X-ray microtomography (XRM) methods allow for the collection of whole mouse brains in a high-throughput paradigm, permitting the generation of sub-micron three-dimensional image volumes in less than a day without the alignment challenges or tissue clearing approaches of other methods. Similarly, electron microscopy (EM) efforts now routinely exceed 100 terabytes in scale and projects are underway to map cubic millimeters of brain regions, resulting in petabytes of image and annotation data. Both of these methods are widely used throughout the BRAIN Initiative and the broader neuroscience community, and the instruments required to collect these datasets are becoming more common and higher throughput resulting in an increased need for data storage and archival solutions that can accommodate these larger datasets that are being generated at an increasing pace. Finally, the sample preparation of XRM and EM are compatible with and amenable to co-registration, and work is underway to pursue multimodal experiments; new instruments are now available with the ability to perform both XRM and EM data collection from a single sample. Existing paradigms for data storage and access are often insufficient to accommodate the required storage, processing, and dissemination needed to fully exploit the generated data. At this scale, traditional analysis approaches are often ineffective; for example, it is difficult for a human to view all of data collected or manually annotate more than a small fraction of the volume. Contemporary analysis approaches leveraging automated methods require robust and efficient access to data, which can be challenging when managing massive datasets spread across many files. Without a standard data storage mechanism, data access is cumbersome, storage is expensive and can lack sufficient durability, metadata is unreliable or unavailable and may not be attributable in useful ways, and file formats and organization are often different across laboratories, resulting in a high-barrier for collaboration and sharing. Thus, we propose the Block and Object Storage Service Database (bossDB) to deliver a high-performance, cost efficient data archive by utilizing a cloud-based tiered storage architecture, where data is seamlessly migrated between low cost, durable object storage (i.e., S3) and a fast in-memory spatial data store. This system will be developed through an agile process that will actively fold in community stakeholders for regular reviews and continuous opportunities for design input, and will provide and support integration of a robust suite of user-facing tools that are vital to foster community adoption, such as a web-based management console and visualization tool, a Python SDK for programmatic access, and a client to facilitate large-scale ingest of data into the platform. We will build an integrated, managed framework that will enable compute data quality metrics on large datasets, and a metadata store to capture experimental details, dataset properties and information about available results. Our approach provides a secure, versioned API to facilitate programmatic access to data through a standardized and stable interface. For members of the community who prefer a locally-deployed solution, we will additionally ensure bossDB data storage capabilities exist in a local (on-premises) deployable version of the archive, and will integrate with the other major community solution, DVID, which provides complementary capabilities. This proposal will result in a professionally-engineered, highly-available data archive that provides solutions to many of the barriers associated with large-scale neuroscience discovery. Through a service-oriented architecture, our approach is flexible and designed to provide many capabilities to the user while abstracting most of the underlying technical details so that neuroscientists and data analytics researchers can focus on the scientific questions of greatest interest. We believe that providing this archive will enable many new experiments in XRM, EM and mutlimodal approaches, and can be adapted as user and community needs evolve.
We propose the Block and Object Storage Service Database (bossDB), an open, accessible, cloud-based data archive for the NIH electron microscopy and X-ray microtomography communities. bossDB leverages a proven architecture to provide petascale-capable storage, curation, processing, sharing, and visualization of the massive and complex data generated by this community, and will revolutionize how researchers share and analyze data through a standardized, highly-available interface.