Together with theory and experimentation, computer simulation now constitutes the ?third pillar? of scientific inquiry, enabling researchers to build and test models of complex phenomena that cannot be replicated in the laboratory. The method of Molecular Dynamics simulation in particular is critical: applications range from the practical, e.g., drug design, to basic research in understanding disease processes. The overall goal is to give the Molecular Dynamics user community the compute capability to conduct transformative research via scalable, cost-effective, high-performance, general-purpose systems built from off-the-shelf components. The objective of this research is to bridge the many orders-of-magnitude gap between the largest current simulations and the potential physical systems to be studied.
Three fundamental issues limiting performance are (i) computational efficiency per chip area, (ii) power density, which is reaching the limit of economical cooling methods, and (iii) the bottleneck between processing and communication devices. All three are addressed by Large-Scale Reconfigurable Computing using Field Programmable Gate Arrays (FPGAs). FPGAs are commodity integrated circuits whose circuitry can be configured, or programmed, in the field. Their reconfigurable architecture gives them the ability to obtain maximum efficiency for a particular application while at the same time drawing less than a quarter the power of a conventional processing device. And because FPGAs are the core components in internet routers, they are built to handle flexible high-throughput communication.
This planning award is to investigate the novel system design to use the same FPGAs for communication and for computation. This approach has several advantages. First, it mitigates the critical bottleneck caused by the separation of functions among multiple devices. Second, FPGA-based communication gives the flexibility either to use standard protocols, or to use innovative application-aware routing that enables important patterns, such as the Fast Fourier Transform, to be routed congestion-free. Finally, FPGAs are well-suited for Molecular Dynamics computation allowing the hierarchical data-movement to be addressed through innovative routing, load-balancing, and arithmetic.
Together with theory and experimentation, computer simulation now constitutes the third pillar of scientific inquiry, enabling researchers to build and test models of complex phenomena that either cannot be, or would be prohibitively expensive to be, replicated in the laboratory. Applications range from the practical, such as designing more efficient aircraft and effective drugs, to basic research in understanding the molecular basis of diseases such as Alzheimer’s. Yet computing capability is currently only a small fraction of what is needed: detailed biological simulations are limited to small numbers of macro-molecules; additional factors of millions are needed to simulate cells and far more than that for larger structures. The overall goal of this work is to give the Scientific Computing user community the capability to conduct transformative research via scalable, cost-effective, high-performance, general-purpose systems built from off-the-shelf (COTS) components. The particular objective is to build a compute cluster and related infrastructure (called Novo-G#) that facilitates research that advances such computer systems. Specifically Novo-G# will be an enhancement of the Novo-G, an accelerator-based reconfigurable compute cluster housed at the University of Florida (UF). The unifying technical mechanism to be explored is the integration of communication and computation in accelerator-centric clusters with direct and programmable interconnects. The specific accelerator technology we are investigating is the FPGA: they are currently the only COTS component that combines innate communication support, high-computational capability, low power, and an installed application base. The intellectual merit of this work is derived from the novel technical mechanism underlying our system: using the same FPGAs for communication as for computation. This approach has several advantages. First, it mitigates the critical bottleneck caused by the separation of functions among multiple devices. Second, FPGA-based communication gives the flexibility either to use standard protocols, or to use innovative application-aware routing that enables important patterns, such as the Fast Fourier Transform, to be routed congestion-free. Finally, FPGAs are well-suited for Molecular Dynamics, and other Scientific Computing computations, by allowing hierarchical data-movement to be addressed through innovative routing, load-balancing, and arithmetic. The construction, demonstration, and dissemination of Novo-G#, together with building and organizing the associated communities, will be the primary activities of an Infrastructure Enhancement grant (Phase II). In this current grant (Community Infrastructure Planning, or Phase I), our overall technical goal was to eliminate most of the risk associated with creating this potentially high-impact infrastructure. Our aim was to identify and answer the major technical questions and to gather sufficient information to justify all major design decisions. The overall non-technical goal of Phase I was to begin building and organizing the stakeholder communities: code owners, vendors, end users, and system developers and researchers. All objectives were successfully and completely finished. The key finding was that our tentative design for the Novo-G# is appropriate for the overall goal of creating an FPGA-centric compute cluster and that this cluster will have the capabilities projected. Also, several publications and technical components were created; the follow-on (Phase II) proposal was written and largely followed the original plan outlined in the Phase I proposal; and, finally, that Phase II proposal has been funded and work begun on building the Novo-G#. Among the specific tasks completed are as follows: (i) instrumentation of key Scientific Computing programs and their use to derive design requirements for the Novo-G#; (ii) design and evaluation of the inter-FPGA direct network; (iii) testing the network with a critical bottleneck application; and (iv) building a community for the Novo-G# of end users, researchers, and developers, including representatives from academia, government, industry, and research labs. The broader impact is derived from the infrastructure to be created, which will include the system design, software, and FPGA configurations. The system infrastructure will be self-sustaining through the creation and organization of the stakeholder communities: code owners, users, and developers. A 256-FPGA system is projected to deliver an order-of-magnitude performance improvement over much larger conventional systems and at a fraction of the operating cost. This approach will lead to more researchers having more computational capability and enable them to attack problems that can currently be addressed only by high-end facilities. These applications could include, e.g., finding transition paths between functional states of molecular assemblies. The system will be completely general-purpose with no application-specific architectural decisions. The resulting design will therefore be generally cost-effective for high-performance scientific and engineering applications, especially for those currently limited by the computation-communication bottleneck.