Ideally, as neuroscientists collect terabytes of image stacks, the data are automatically processed for open access and analysis. Yet, while several labs around the world are collecting data at unprecedented rates- up to terabytes per day-the computational technologies that facilitate streaming data-intensive computing remain absent. Also deploying data-intensive compute clusters is beyond the means and abilities of most experimental labs. This project will extend, develop, and deploy such technologies. To demonstrate these tools, we will utilize them in support of the ongoing mouse brain architecture (MBA) project, which already has amassed over 0.5 petabytes (PBs) of image data. The main computational challenges posed by these datasets are ones of scale. The tasks that follow remain relatively stereotyped across acquisition modalities. Until now, labs collecting data on this scale have been almost entirely isolated, left to """"""""reinvent the wheel"""""""" for each of these problems. Moreover, the extant solutions are insufficient for a number of reasons: they often include numerous excel spreadsheets that rely on manual data entry, they lack scalable scientific database backends, and they run on ad hoc clusters not specifically designed for the computational tasks at hand.
We aim to augment the current state of the art by implementing the following technological advancements into the MBA project pipeline: (1) Data Management will consist of a unified system that automatically captures metadata, launches processing pipelines, and provides quality control feedback in minutes instead of hours. (2) Data Processing tasks will run algorithms """"""""out-of-core"""""""", appropriate for their computational requirements, including registration, alignment, and semantic segmentation of cell bodies and processes. (3) Data Storage will automatically build databases for storing multimodal image data and extracted annotations learned from the machine vision algorithms. These databases will be spatially co-registered and stored on an optimized heterogeneous compute cluster. (4) Data Access will be automatically available to everyone-including all the image data and data derived products-via Web-services, including 3D viewing, downloading, and further processing. (5) Data Analytics will extend random graph models suitable for multiscale circuit graphs.

Public Health Relevance

Nervous system disorders are responsible for approximately 30% of the total burden of illness in the United States. Whole brain neuroanatomy-available from massive neuroscientific image stacks-is widely believed to be a key missing link in our ability to prevent and treat such illnesses. Thus, this project aims to close this gap via the development and application of BIGDATA tools for management, storage, access, and analytics.

National Institute of Health (NIH)
National Institute on Drug Abuse (NIDA)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Pollock, Jonathan D
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Cold Spring Harbor Laboratory
Cold Spring Harbor
United States
Zip Code