Every aspect of parallel software development is more complicated than for serial programs. This research focuses on one of the primary sources of complexity: intra-application communication. Currently it is a programmer's responsibility to find an efficient mapping of their application's communication patterns onto the communication infrastructure of the target system. This research flips that responsibility by developing a flexible communication architecture and associated tools and algorithms that allow the target platform to be specialized for a particular application, rather than vice versa. In addition to reducing the programmer's burden, specialization has the potential to improve communication efficiency while the automated techniques can increase portability.
This research poses questions whose answers have consequences at several levels of the traditional system stack: Can programmers be freed from hardware-specific optimization of communication without degrading performance? What abstractions are needed to allow hardware to adapt to the programmer, rather than the other way around? Can communication efficiency be improved when running on an application-specific communication platform? The project answers these questions by exploring abstractions and algorithms to profile a parallel program's communication, synthesize a custom network design, and implement it in a configurable network architecture substrate. The research methods center around the X10 language, and include compiler instrumentation passes, offline communication profile analyses, development of a portable network intermediate representation, and network place and route software algorithms.
In order to make efficient use of modern computing hardware, it is often necessary for software to be parallelized. Parallelization can often be challenging, particularly with respect to understanding a program's runtime behavior and diagnosing correctness and performance problems. The goal of this project was to develop and apply a new perspective from which to analyze a parallel program's execution. As a result of this award, we have developed the concept of a parallel block vector profile. This profile considers an application in fragments called basic blocks, revealing the degree of parallelism exhibited by the application, i.e., the number of parallel threads, each time each basic block is executed. There are a number of ways to exploit this information. For example, we have analyzed applications on a per-fragment basis, and found, contrary to prevalent models, that basic blocks do not always belong exclusively to the serial or parallel regions of an application. A second analysis, explored the scalability of an application on a per-block basis, a much finer granularity than has been previously explored. A third use would be in conjunction with a timing model, to help reveal hidden performance bottlenecks. Because a parallel block vector profile simply reveals information, its uses are potentially both numerous and varied. A second outcome of this award was the development and release of an open source tool, called Harmony, which can be used to collect parallel block vector profiles. Collecting the fine-grained information demanded by the profiles without incurring substantial overhead is a non-trivial endeavor, and Harmony has been carefully designed and optimize to minimize any perturbation to the application under measurement.