A scientific program is a mathematical one so there are two factors contributing to its performance. One is the time required to perform arithmetic. The other is the time needed to move data through the memory hierarchy of the computer. In today's large applications, the latter cost often dominates. The work funded by this grant focuses on efficient computational methods for solving the problems in matrix algebra that arise in a wide variety of science and engineering applications. Codes for such problems are typically constructed as sequences of calls to the routines known as the Basic Linear Algebra Subprograms (BLAS). Writing programs in this way promotes readability and maintainability but can be costly in terms of memory efficiency especially for matrices of large order.
The grant will be used primarily to support a Ph.D. student who will study ways to combine multiple BLAS routines into a single routine that performs the functions of more than one BLAS. He will examine ways to create composed BLAS via novel algorithms and performance programming techniques, developing a general methodology for their creation in the process. His work will ultimately form the basis for a tool that creates composed BLAS automatically. Composed routines can significantly reduce the amount of data read from main memory. Preliminary results include speedups as large as 90%.
A scientist needing high quality computational tools is tasked with mastering advanced concepts in computer science and with carrying out complex programming tasks in addition to managing the scientific content of the work. The problem is compounded by the constant evolution of computer hardware. Scientists are often forced to choose between investing substantial time in tuning computer codes or accepting performance that can be significantly lower than the best attainable. In both cases, the productivity ofthe scientist degrades. To aid in the process of software tuning, our research concerned tools for automating the development of high-performance software for scientific applications. A scientific program is a mathematical one, and there are two contributing factors to its performance. One is the time required for performing arithmetic. The other is the time needed for moving data through the memory hierarchy of the computer. In today's large applications, the latter cost often dominates. Our work focused on reducing memory access costs in the solution of the matrix algebra problems that constitute the most expensive part of many scientific computations. Historically, codes for matrix algebra have been constructed as sequences of calls to subprograms that perform fundamental operations. Writing programs in this way promotes readability and maintainability but can be costly in terms of memory efficiency as the same data are accessed by the routines one after another. One remedy is to employ a powerful optimization known as loop fusion that combines the subprograms into a single routine that employs the data only once. Our work began with a study of the performance effects of loop fusion which ultimately led us to develop a formal, analytical model of the computer's memory hierarchy for fused matrix algebra problems. We integrated the model into the Build to Order (BTO) compiler which converts high level descriptions of matrix algebra into optimized implementations. Including the model accurately and efficiently reduced compilation times for BTO. Intellectual Merit: We have improved the understanding the factors that influence memory traffic in matrix algebra computations, and we have built a tool that aids in the development of high-performance matrix algebra implementations. Broader Impacts: The model developed during this grant's funding is now released to the public as part of the open source compiler BTO which helps to remove the onus of high-performance programming for scientists. The grant funded a number of educational opportunities. It primarily supported one PhD student, but it also provided a number of undergraduates with their first research experiences.