The Project aims to create a sustainable collaborative ecosystem built around several large scientific data sets for the broader science community. Based upon the expertise developed for the Sloan Digital Sky Survey (SDSS) SkyServer and the associated projects the Project will formalize the main system components and reengineer them to be much more reusable.
The Project will take full ownership of the Sloan Digital Sky Survey archive and will provide a robust environment for its continued operations, using an economy of scale enabled by common, shared building blocks derived from the existing SDSS SkyServer framework, based upon a large, scalable database system.
Using these building blocks, the team will build and operate open data archives from large observations and numerical simulations, including computational fluid dynamics, ocean circulation and astrophysics, reaching PB scales. The Project will further extend the tools to life sciences, like large-scale, next-generation genome sequencing experiments, as well as high-throughput neuroscience imaging data. The resulting distributed, parallel database framework will be linked to small, user-created data sets that can be used also collaboratively, in conjunction with each other and the large data collections.
The Project will work with selected communities to help deploying and serving data using our building blocks, demonstrating portability, generality and economies of scale; will help and encourage other institutions and communities to use the tools, while seeking collaborations that result in disruptive changes, and will build tools that accelerate the timescale to deploy new services and applications and rapidly test new ideas.
The Project will enable individual users to bring their "small data" and analyze it collaboratively in the context of the large data. Our particular goals are:
(i) Take full ownership of the SDSS Archive (database and flat files) and ensure a scalable and robust environment for its continued operation;
(ii) Build upon our decade-long effort on SDSS and its ad-hoc spinoffs, through reengineering its components into portable and general building blocks;
(iii) Systematically address curation issues arising from using a service-oriented architecture (SOA), and the resulting service life-cycle;
(iv) Work with projects from additional scientific domains to help deploying and serving data using our building blocks, demonstrating portability, generality and economies of scale;
(v) Develop scalable extensions to our database cluster in order to deal with large numerical simulations scaling up to petabytes, and turn them into open numerical laboratories;
(vi) Use our CasJobs Collaborative Environment to address the problem of small but complex data in the "Long Tail" of science.