Proposal #: CNS 07-09140 07-08307 07-08820 PI(s): Brockman, Jay B. Bader, David A. Gao, Guang R. Barabasi,Albert-Laszlo;Chawla,Nitesh;Kogge,PeterM. Vetter, Jeffrey S. Institution: University of Notre Dame Georgia Institute Tech U.Delaware Notre Dame, IN 46556-5602 Atlanta, GA 30332-0002 Newark, DE 19716-1551 Proposal #: CNS 07-09385 07-09111 07-09254 PI(s): Gilbert, John R. Upchurch, Edwin T. Yelick, Katherine A. Wolski, Richard. Institution: UC-Santa Barbara California Inst Tech UC-Berkeley Santa Barbara, CA 93106-2050 Pasadena, CA 91125-0600 Berkeley, CA 94704-5940 Title: Colla Rsch:IAD:Dev Rsch Infr. for Multithreaded Computing Community Using Cray Eldorado Platform
Project Proposed:
This collaborative project, developing a shared infrastructure needed to broaden its impact for developing software to run on the next generation of computer hardware, brings a diverse group of researchers from six universities in a joint effort. The work responds to the trend towards multicore processors where developers envision placing tens to hundreds of cores on a single die, each running multiple threads (in contrast to the currently dominant message-passing architectures resulting from the advent of MPI and Linux clusters). Three objectives are proposed: . Acquiring computer hardware as a shared community resource capable of efficiently running, in experimental and production modes, complex programs with thousands of threads in shared memory; . Assembling software infrastructure for developing and measuring performance of programs running on the hardware; and . Building stronger ties between the people themselves, creating ways for researchers at the partner institutions to collaborate and communicate their findings to the broader community. The Cray XMT system, scheduled for delivery in 2007 serves as an ideal platform. The second bullet includes algorithms, data sets, libraries, languages, tools, and simulators to evaluate performance of program running on the hardware focusing on applications that benefit from large numbers of threats, massively data intensive, "sparse-graph" problems that are difficult to parallelize using conventional message-passing on clusters. Each university contributes a piece to the infrastructure, using it for support of projects. Sandia National Laboratories has agreed to host the system and provide supplementary funding. Each university will use the Cray XMT system in courses.
Broader Impacts: The infrastructure measures performance providing a basis for the community to improve sharin, and build strong ties for collaboration and communication. Courses will be created and materials will be made available. Workshops for dissemination of the findings are also planned.
The main focus of this grant was to develop an infrastructure for research in the area of shared-memory multithreaded programming, based upon the acquisition of a Cray XMT computer system. The project was a collaboration between six universities: the University of Notre Dame, University of Delaware, Georgia Tech, University of California at Berkeley, University of California at Santa Barbara, and California Institute of Technology. Sandia National Labs hosted the XMT computer system in their computing facilities in Albuquerque, NM for use by the university community, providing maintenance and support of the sytem. In addition to contributing to the research program, the University of Notre Dame served as the administrative lead institution for the project. Research work on this project at the University of Notre Dame centered on exploring architectural and hardware enhancements to multithreaded systems that extend the capabilities and improve performance beyond that of the XMT. Work by PI Brockman and his students focused on lightweight chip-level multithreading schemes. Irregular and dynamic applications, such as graph problems and agent-based simulations, often require fine-grained parallelism to achieve good performance. However, current multicore processors only provide architectural support for coarse-grained parallelism, making it necessary to use software-based multithreading environments to effectively implement fine-grained parallelism. Although these software-based environments have demonstrated superior performance over heavyweight, OS-level threads, they are still limited by the significant overhead involved in thread management and synchronization. In order to address this, they developed a Lightweight Chip Multi-Threaded (LCMT) architecture that further exploits thread-level parallelism by incorporating direct architectural support for an 'unlimited' number of dynamically created lightweight threads with very low thread management and synchronization overhead. The LCMT architecture can be implemented atop a mainstream architecture with minimum extra hardware to leverage existing legacy software environments. They compared the LCMT architecture with a Niagara-like baseline architecture. Results show up to 1.8X better scalability, 1.91X better performance, and more importantly, 1.74X better performance per watt, using the LCMT architecture for irregular and dynamic benchmarks, when compared to the baseline architecture. The LCMT architecture delivers similar performance to the baseline architecture for regular benchmarks. Work by co-PI Kogge and his students investigated the design details of a Light Weight Processing migration-NUMA (non-uniform memory access) architecture, a novel high performance system design that provides hardware support for a partitioned global address space, migrating subjects, and word level synchronization primitives. Using the architectural definition, combinations of structures are shown to work together to carry out basic actions such as address translation, migration, in-memory synchronization, and work management. Results from simulation of microkernels show that LWP-mNUMA compensates for latency with far greater memory access concurrency than possible on conventional systems. In particular, several microkernels model tough, irregular access patterns that have limited speedups, in certain problem areas, to dozens of conventional processors. On these, results show speedup increasing up to 1024 multicore mNUMA processing nodes, running over 1 million threadlets. Work by co-PI Chawla and his students examined capabilities in 'Big Data' analysis, specifically the constraints of magnetic versus flash disk and the use of small compute clusters for large-scale data analysis.