Data Science Core-Abstract Achieving the scientific goals of the Overall Research Strategy requires a significant effort and advancement in data science for neuroscience. In particular, scientific progress depends on novel experimental design, data collection and processing (as described in Projects 1 and 2), and novel analysis and models (as described in Project 3), which lead to general principles to be tested (as described in Project 4). The fundamental goal of the Data Science Core is to accelerate the process connecting the raw data collected in Projects 1, 2, and 4 to the analyses used to obtain data derivatives, which can then be used to build models in Project 3, and extend them in Project 4. The two main challenges we face to accelerate these links are big data and reproducibility. First, the data collected are too large to fit into memory, or even on disk, with each experiment ordering on one terabyte (TB), and the entire dataset amassing hundreds of TB or more. Therefore, the classic paradigm of using MATLAB for all analyses that are stored locally is not sufficient. The solution to this is twofold: (1) build a cloud data management system, so that all consortium members can quickly access and analyze the data, and (2) build scalable algorithms, so that different individuals can apply them to these big data. The cloud data management system will be built on the infrastructure developed for the Open Connectome Project 1? ?, originally developed to host data on institutional resources. In the last year, the team has matured to become NeuroData (?http://neurodata.io?), porting all the infrastructure to the commercial cloud, and already hosting 20+ datasets comprising 50+ TB, including all three scales of analysis proposed here (h? ttp://neurodata.io?). The scalable algorithms will be based on another project from NeuroData called FlashX (?http://flashx.io?). FlashX is a C++ graph analytics and machine learning library, designed to run analytics on arbitrarily large data using only a single machine (not a cluster) 2? ,3?, and the recent recipient of a DARPA SBIR award to commercialize. We will use FlashX as a backend to support all the algorithms for processing behavior and imaging data. Second, this is a team effort, so sharing analyses and derivatives and keeping track of metadata will be important. The solution to this is to build a comprehensive scientific environment in the cloud, that enables sharing of entire ?digital experiments?, linking to the data and ensuring that the entire analysis pipeline can be trivially run and extended by anyone and anywhere. This system will extend NeuroData?s ?Science in the Cloud? (?http://scienceinthe.cloud?) 4? ,5?, which recently received private funding to professionalize. Our entire system is built on and will continue to be open source, portable and reproducible, and will use and extend best practices of data science and FAIR (? ?data management. Completing all the aims in this Data Science Findable, Accessible, Interoperable, and Re-usable) Core will not only enable and accelerate the scientific progress addressed by this proposal, it will establish new standards in data science that can be immediately applied to all other U19 efforts, as many other efforts within and outside NIH and even the international science effort at large.

Agency
National Institute of Health (NIH)
Institute
National Institute of Neurological Disorders and Stroke (NINDS)
Type
Research Program--Cooperative Agreements (U19)
Project #
5U19NS104653-04
Application #
9988548
Study Section
Special Emphasis Panel (ZNS1)
Project Start
2017-09-25
Project End
2022-08-31
Budget Start
2020-09-01
Budget End
2021-08-31
Support Year
4
Fiscal Year
2020
Total Cost
Indirect Cost
Name
Harvard University
Department
Type
DUNS #
082359691
City
Cambridge
State
MA
Country
United States
Zip Code
02138
Fang, Tao; Lu, Xiaotang; Berger, Daniel et al. (2018) Nanobody immunostaining for correlated light and electron microscopy with preservation of ultrastructure. Nat Methods 15:1029-1032
Berger, Daniel R; Seung, H Sebastian; Lichtman, Jeff W (2018) VAST (Volume Annotation and Segmentation Tool): Efficient Manual and Semi-Automatic Labeling of Large 3D Image Stacks. Front Neural Circuits 12:88
Haesemeyer, Martin; Robson, Drew N; Li, Jennifer M et al. (2018) A Brain-wide Circuit Model of Heat-Evoked Swimming Behavior in Larval Zebrafish. Neuron 98:817-831.e6
Chen, Xiuye; Mu, Yu; Hu, Yu et al. (2018) Brain-wide Organization of Neuronal Activity and Convergent Sensorimotor Transformations in Larval Zebrafish. Neuron 100:876-890.e5
Jordi, Josua; Guggiana-Nilo, Drago; Bolton, Andrew D et al. (2018) High-throughput screening for selective appetite modulators: A multibehavioral and translational drug discovery strategy. Sci Adv 4:eaav1966