The Extensible Terascale Facility (ETF) is the next stage in the evolution of NSF large-scale cyberinfrastructure for enabling high-end computational research. The ETF enables researchers to address the most challenging computational problems by utilizing the integrated resources, data collections, instruments and visualization capabilities of nine resource partners. On October 1, 2004, the ETF concluded a three-year construction effort to create this distributed environment called the TeraGrid (TG) and we are now entering the production operations phase.
The TeraGrid resource partners include: the University of Chicago/Argonne National Laboratory, the San Diego Supercomputer Center at UCSD, the Texas Advanced Computing Center at UT-Austin, the National Center for Supercomputing Applications at UIUC, Indiana University, Purdue University, Oak Ridge National Laboratory, and the Pittsburgh Supercomputing Center.
A separate proposal was submitted to NSF on October 19, 2004 for the TeraGrid Grid Infrastructure Group (GIG). Under the direction of Charlie Catlett at UC/ANL, in general, the GIG will be responsible for coordination of development activities for the TeraGrid with subcontracts to the partner sites. The resource partners (RP) will each have independent cooperative agreements with NSF, but will work closely with the GIG to implement the vision of the TeraGrid.
This proposal outlines the plan at PSC to participate as a resource partner in the TeraGrid team to provide the expanding user community with ongoing access to this computational science facility. This proposal covers the period November 1, 2004 through October 31, 2009.
The TeraGrid is one of the first large-scale grids to fully integrate grid capabilities with the policies and procedures of both autonomous open national computing centers and universities. As we move from construction to operation, the TeraGrid resource partners will work with users to ensure that TeraGrid is delivering the promise of convenient, reliable, persistent capabilities, and to promote a set of user priorities that include new Grid services such as co-scheduling, meta scheduling, parameter search tools and advances in data management and handling, and to implement and support a variety of tools and capabilities for science gateways.
The PSC is dedicated to the TG goal of promoting breakthrough science by U.S. researchers through the use of advanced IT capabilities. Our highly skilled staff will continue to make substantial contributions to TG in many areas of long-standing PSC excellence including user support, systems, networking, security and dissemination and with particular emphasis on the creation of a leadership class computing capability, within the NSF TG program, to implement the NSF HPC+X. strategy. Our strong, longstanding and ongoing relationships with many of the best computational scientists in the country are an integral part of our approaches both to provide service and to improve that service. We support their scientific endeavors by enhancing their applications to exploit very large parallel computing resources and by deploying and operating the best available of these resources to maximize they're scientific Output
We focus on the most demanding applications. For the past six months, more than half of the TCS processor hours delivered was for jobs using at least 1,024 processors. For the month of September, 30% of the work used at least 2,048 processors. Over the past year fifty-one PIs ran jobs which required at least 1,024 processors. These impressive results are achieved through our intense focus on the user requirements for high-end computing. This has many aspects including user and applications support staff tightly associated with the research groups, flexible and responsive scheduling and carefully considered systems design. This focus has served the community well and will be applied to the expanding TG efforts.
PSC has been selected to install, evaluate and operate a new prototype system, which has the potential to become a Leadership Class resource for the NSF program. We are confident that our continued concentration and emphasis on high-end computing will bring success to this endeavor. We eagerly anticipate the opportunity to advance U.S. Scientific and engineering research through our unique capabilities and contributions to the TeraGrid.
This proposal covers plans for the support, further development, and tight integration of our systems into a unified resource for the support of computational science research and the integration of that resource into the TG but it also covers the further development of the TG. In addition, we describe the user support required for the growing productive use of all these resources in support of the distributed, national base of IT-dependent research in science and engineering. We cover this very broad ground starting from the local PSC level and working to the national, distributed TG level. The prescribed headings that follow may suggest that each of the activities stands on its own. In fact, PSC devotes considerable effort to integrating these efforts to be more responsive to users. At all levels, we address the existing capabilities, their continual augmentation and improvement, their assembly into an integrated whole and the efforts required to make that whole responsive to the needs of the client, the scientific and engineering research communities. Our activities are highly efficient. Our staff and budget are significantly smaller than other organizations with comparable responsibilities. In part, this is achieved by targeted development of new technology approaches rather than accepting higher-cost commercial offerings designed for enterprises with different requirements.
During this project, PSC operated 5 major supercomputers for use by US scientists in their computational scientific studies: Blacklight, 2010-present, an SGI UV 1000 cc-NUMA system, the world’s largest shared memory system BigBen, 2004-10, a Cray XT3, a leading massively parallel system LeMieux, 2001-06, a Compaq Alphaserver ES45, at introduction the world’s 2nd most powerful system and the most powerful committed to open research Pople, 2008-11, an SGI Altix 4700 shared-memory NUMA system, a step toward Blacklight Rachel, 2003-08, an HP GS1280 AlphaServer, with separately funded Jonas, the first two GS1280s from HP and PSC’s start in shared memory PSC made numerous improvements to these systems which their vendors adopted and incorporated into their products. These systems were all supported by various data storage systems and high performance networking. With guidance from PSC User Support, researchers who used these systems achieved many scientific accomplishments with great social benefits. For instance, scientists from the University of Oklahoma used PSC’s and other centers’ computers, along with the use of Doppler radar and high-speed networks, to improve the forecasting of thunderstorm supercells which often spawn tornados, thus providing earlier warning of possible tornados. In chemistry, researchers from PSC and Pitt found a mechanism that may be a key to understanding aldehyde dehydrogenase-related metabolic disease such as Sjogren-Larsson syndrome which is an inherited disorder that leads to skin scaling and mental retardation; and researchers at Pitt made studies of the structure of water that have important implications for atmospheric chemistry. In genomics, researchers from Berkeley showed that differences in social behavior can lead to changes in the selection pressures and gene-level evolutionary changes in a species. In the field of Big Data, researchers from the University of Illinois digitized hand-written forms from the 1940 US census and built an electronic database of the census information. With the database, other researchers will be able to query and analyze the data to gain insights on the society of 1940. Graphics for each of these scientific achievements accompany this report. Articles on many more scientific results obtained by many other users of PSC’s supercomputers are given at www.psc.edu/science/. This website contains more than 100 articles. You can locate articles on specific topics by keyword search or browse by the topical areas of Life Sciences, Earth & Society, The Universe, Technology & Manufacturing, and Physics & Chemistry. PSC is active in education, outreach and training. During this project there were about 7700 participants in our training and education events. In the last 12 months alone, 1125 people attended 28 courses, workshops, lectures, etc. where PSC people taught. PSC has several science education programs, most aimed at K-12 and undergraduate level teachers and some aimed at students. Better Educators of Science for Tomorrow prepares teachers to refocus their teaching strategies towards encouraging students to become aware of emerging and exciting biomedical careers. It provides a high school level bioinformatics curriculum. Computational Modules in Science Teaching brings innovative science tutorials into secondary school and college classrooms. The Computation and Science for Teachers Professional Development Program is an integrated set of modules to train teachers on how to incorporate computational reasoning and tools such as modeling and simulation into their middle and high school math and science curricula. Safety Awareness For Everyone on the Net focuses on raising the awareness of students, parents, and educators about cyber threats, measures of protection, and cyber ethics. In addition, PSC has a vigorous student intern program which helps them prepare for careers in science and technology. PSC staff members also create many new technologies. Just three are discussed here. PSC people have obtained a patent for ZEST, a highly scalable parallel file system designed for maximum efficiency with write-intensive application workloads such as checkpointing. PSC people have patents pending for SLASH2 and Data Supercell. SLASH2 is a distributed filesystem that incorporates existing storage resources into a common filesystem domain. It provides system-managed storage tasks for users who work in widely distributed environments. The SLASH2 metadata controller performs inline replication management, maintains data checksums and coordinates third party, parallel, data transfer between constituent data systems. The SLASH2 I/O service is a portable, user-space process that does not interfere with the underlying storage system's administrative model. Data Supercell is a high-performance (high bandwidth, low latency), high-reliability, high-capacity, low-power storage system for large data storage requirements including but not limited to near-line storage, data warehousing, large-scale data analytics, and archiving. Using SLASH2, the Data Supercell couples appropriate hardware and software technologies in a unique way to deliver valuable functionality for meeting large data handling requirements.