This planning grant gathers scientific community requirements for a set of capabilities termed Scalable Data Analytics. The project investigates community needs to support scientific discovery by providing an effective interface between extant hardware resources, data sources and repositories, and system software infrastructure. The proposed effort focuses on software environments and tools for data acquisition, management, visualization, sharing, and analysis for the working scientist, which can scale up to massively parallel and cloud fabrics, but, crucially, which can as easily scale down to a single laptop.
Software systems for data analytics are integral to the fabric of scientific innovation. The ability to acquire, process, and analyze large amounts of complex structured and unstructured data is at the core of diverse disciplines. While scientists can exploit large repositories of software tools optimized and refined over the years, significant new challenges are posed by the rapidly evolving characteristics of scientific datasets. These challenges are addressed by software systems that enable development of new software incrementally, modification of existing methods, or techniques for integrating pipelines of off-the-shelf components. For such application needs, scientists increasingly rely on dynamic computer programming languages. These languages facilitate interactive prototyping, support rapid development, and can be embedded or used to manage complex scientific software pipelines.
Software systems for data analytics are integral to the fabric of scientific innovation. The ability to acquire, process, and analyze large amounts of complex structured and unstructured data is at the core of diverse disciplines such as high-energy physics, astronomy, chemistry, biology, economics, and social sciences. While scientists can exploit large repositories of software tools optimized and refined over the years, significant new challenges are posed by the rapidly evolving characteristics of scientific datasets. These challenges are addressed by software systems that enable development of new methods incrementally, modification of existing algorithms, or techniques for integrating pipelines of off-the-shelf components. For such application needs, scientists increasingly rely on dynamic computer programming languages and systems (Python, Perl, Matlab, Maple, Mathematica, JavaScript Octave, Julia and R). These languages facilitate interactive prototyping, support rapid development, and can be used to manage complex scientific software pipelines. In this role of providing an interface between scientists and computational infrastructure high-level dynamic languages have proven highly effective. However, their utility is significantly constrained by deficiencies in performance, their ability to handle data at scale, and their interface to underlying hardware (parallel and distributed environments) and software ecosystem (libraries for intrinsics, visual analytics). This report is a strategic plan for enabling science and education through an S2I2 institute aimed at supporting a sustained scientific software infrastructure. We will address the following key issues: the scientific community and specific grand challenge research questions that the S2I2 will support; the software elements relevant to the community, the sustainability challenges to address; the required organizational, personnel and management structures and operational processes; the integration of education and training, mentoring of students, postdoctoral fellows as well as software professionals; approaches for long-term sustainability of the software infrastructure as well as the software; and risks including risks associated with establishment and execution, infrastructure needs & community engagement. The activities covered under this S2I2 Conceptualization grant consisted of a series of workshops that assessed the state of dynamic programming languages for scientific computing, the quality of the virtual execution environments that support them, and the degree to which such languages allow scientists to interact with the rest of the software and hardware infrastructure. In particular, we focused on problems related to data analytics at scale. The workshops investigated the need for a software institute that would support rapid scientific advances by acting as a bridge between experts in computing, languages, compilers, middleware, distributed systems, and the broader scientific community. The scientific activities on which we focused our discussion can broadly be termed Scalable Data Analytics. The basic scientific research aim of these activities is to provide tools for data acquisition, management and analysis for scientists which can scale up to massively parallel and cloud fabrics, but, crucially, which can as easily scale down to a single laptop. This smooth scaling from exploratory mode to production is a crucial attribute of any viable solution. While a software institute will impact many disciplines, we have identified key stakeholders in statistics, machine learning, physics, and biocomputing. This report summarizes the clear and present need for common, open-source, software components to support data analytics in a variety of scientific communities. In particular research in high-level, domain specific, dynamic programming languages is needed, languages that can only be designed, created, and maintained under the auspices of a Software Institute with funding from the NSF and collaboration with industrial partners and research labs. The Institute should also investigate issues of programmer productivity and correctness. The report defines the key requirements for a software institute and the challenges that have to be overcome. We emphasize the importance of dynamic languages in the scientific process, and the cost and complexity of providing support for these languages. Finally, we address community building and organizational issues for the proposed institute. Our conclusions are that a significant investment is required to build a cyber infrastructure for 21st century dynamic computer programming languages, and that NSF support is critical to the success of such an effort.