Together with theory and experimentation, computer simulation now constitutes the third pillar of scientific inquiry, enabling researchers to build and test models of complex phenomena that either cannot be, or would be prohibitively expensive to be, replicated in the laboratory. Applications range from the practical, such as designing more efficient aircraft and effective drugs, to basic research in understanding the molecular basis of diseases such as Alzheimer?s. Yet computing capability is currently only a small fraction of what is needed: detailed biological simulations are limited to small numbers of macro-molecules; additional factors of millions are needed to simulate cells and far more than that for larger structures. The overall goal of this work is to give the Scientific Computing user community the capability to conduct transformative research via scalable, cost-effective, high-performance, general-purpose systems built from off-the-shelf components. The particular objective is to build a compute cluster and related infrastructure that facilitates research that advances such computer systems. The unifying technical mechanism to be explored is the integration of communication and computation in accelerator-centric clusters with direct and programmable interconnects.
Three fundamental issues limiting performance are computational efficiency, power density, and communication latency. All of these issues are being addressed through increased heterogeneity, but the last in particular by integrating communication into the accelerator. This integration enables direct and programmable communication among compute components. Direct links enable the bypassing of CPU, network interface, and even device memory. Programmable communication enables data transfers to proceed with high efficiency even under substantial loads. The proposed infrastructure is a large-scale FPGA-centric cluster with direct and programmable communication (DPC). This server class is referred to as Novo-G#, where # is a place holder for DPC, because this award will target enhancing and leveraging Novo-G, the reconfigurable supercomputer at the University of Florida. The infrastructure will consist of the physical hardware, but also software and configurations to be developed to enhance both general usability and the enabled research projects. Another aspect of this infrastructure, as with the Novo-G, is the community of collaborators who will contribute tools, applications, evaluation, and feedback. Currently, a number of internal and external collaborators have been identified but there are more potential research projects?in diverse areas of applications, architecture, and systems that will be enabled by the proposed infrastructure.
The broader impact of the enabled research is to advance the capability of scientific computing. The technical broader impact of the proposed infrastructure is to provide a system testbed for transformative research in a variety of areas in Computer Science and Engineering including programmable network components, processor/network interfaces especially for accelerators, FPGA-based systems, applications in Reconfigurable Computing, architecture of clusters with direct and programmable communication, and libraries and tools to support such clusters. The community of researchers using the infrastructure will consist of the PIs and their collaborators, but also the members of the broader community who commit to contributing to the infrastructure. The infrastructure will provide a platform to develop novel components for education and outreach.