Computer programming languages play a central role in scientific productivity. They are the interface between domain experts and the machine. The first generation of scientific languages were statically compiled computer programming languages such as Fortran, C and C++. Optimized for efficiency they stayed close to the hardware at the cost of ease of use. The languages of choice for scientists are increasingly dynamic ones like Perl, Python, R, Matlab, Mathematica, Maple, and JavaScript. Educational material in Bioinformatics, for example, includes many textbooks for Perl, Python and R and none for C, C++ or Fortran. But the application of cutting-edge scientific methodology is limited by the capabilities of the system in which it is implemented. While dynamic languages start out as glue for native libraries, with time, more and more of the actual computations is expressed in those languages. But quickly users experience limitations of these languages: performance is not adequate for compute intensive tasks, space overheads preclude manipulating large datasets, and information security and reproducibility of results must be addressed.

This workshop explores the need for a Scientific Software Innovation Institute (S2I2) centered around Virtual Execution Infrastructures for Next Generation Scientific Programming Languages. There is a need to develop expertise and momentum to encourage development of open source high-performance implementations of productivity-oriented scientific programming languages. To this end, the VEESC workshop brings together experts in virtual machines and compilers, designers and implementors of high-level dynamic languages and related systems, and key representatives from end-user communities in the sciences.

Project Report

Software systems are part of the support fabric of scientific innovation. The ability to acquire, process, simulate, and analyze experimental data quickly is a crosscutting requirement for scientific advances in fields as diverse as high-energy physics, computational chemistry, biology and astronomy. While today research scientists can exploit large bodies of software components, written in many different computer programming languages, optimized and refined over the years, new problems often demand the ability to develop new software incrementally, to modify existing methods, or to quickly glue together a pipeline of off-the-shelf components with minor modifications. For such increasingly popular software development practices, scientists are turning to a new breed of dynamic computer programming languages. Some well known examples include Python, R, Perl, Matlab, Maple, and JavaScript. These languages facilitate interactive prototyping, support rapid development, and can be embedded or used to manage complex scientific software pipelines. They are increasingly assuming the role of interface between scientist and computational infrastructure. Unfortunately, dynamic languages in uses in the sciences have serious deficiencies in their performance, scalability and support of modern multicore architectures as well as cloud fabrics. Performance of programs written using these domain-specific languages is several orders of magnitude less than similar ones written in static languages like Fortran or C, scalability to large-scale datasets is also lacking in most dynamic languages, and finally, while dynamic languages make it easier to produce working systems quickly they do not enhance our ability to reason about the correctness of these systems. The VEESC workshop assessed the state of dynamic programming languages for scientific computing, the quality of the virtual execution environments that support them, and the degree to which such languages allow scientists to interact with the rest of the software and hardware infrastructure. The workshop also investigated the growing need for a software institute that would support scientific advances by acting as a bridge between advances in computing, languages, compilers, middleware, distributed systems, and the broader scientific community. The scientific activities on which we focused our discussion can broadly be termed Scalable Data Analytics. The basic scientific problem is to provide tools for data acquisition, management and analysis for the working scientist which can scale up to massively parallel and cloud fabrics, but, crucially, which can as easily scale down to a single laptop. The workshop participants emphasized the importance of smooth scaling of the scientific activity from the exploratory mode to production settings. While we view the impact of a software institute as applying to many disciplines, it is also important to start with a focused effort and involve stake-holder communities that will actively contribute to its success. As such, we have identified statistics, machine learning, and biocomputing as target domains for the proposed effort. There is a clear need for a common, open-source, high-performing execution environment for dynamic languages that can only be designed, created, and maintained under the auspices of a Software Institute with funding from the NSF and collaboration with industrial partners and research labs. The Institute should also investigate issues of programmer productivity and correctness linked to dynamic languages. The report defines the key requirements for a software institute and the challenges that have to be overcome. We emphasize the importance of dynamic languages in the scientific process, and the cost and complexity of providing support for these languages. Finally, we address community building and organizational issues for the proposed institute. The participants strongly agreed that a significant investment is required to build a cyberinfrastructure for 21st century dynamic computer programming languages, and that NSF support is critical to the success of such an effort. The most pressing goal identified in the workshop is to support scientists performing different data analysis tasks ranging in scale from single node with small data to multi-node with massive data.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
1042905
Program Officer
Gabrielle D. Allen
Project Start
Project End
Budget Start
2010-09-01
Budget End
2011-08-31
Support Year
Fiscal Year
2010
Total Cost
$46,800
Indirect Cost
Name
Purdue University
Department
Type
DUNS #
City
West Lafayette
State
IN
Country
United States
Zip Code
47907