Today, simulation is the de facto method for studying multicore cache hierarchies. But simulation is costly due to the combinatorial design spaces involved, especially as multicore processors scale to 100s of cores and 100+ MB of on-chip cache. Reuse distance (RD) analysis can help architects evaluate multicore memory performance more efficiently. Unfortunately, locality in multicore processors depends on how per-thread memory reference streams interleave. Reliance on memory interleaving makes multicore locality profiles architecture dependent, limiting their ability to analyze different configurations. For loop-based parallel programs, however, threads are typically symmetric and exhibit similar locality characteristics. Such thread symmetry makes multicore RD analysis tractable: locality profiles remain stable with respect to cache capacity scaling, and change systematically with core count and problem size scaling.
This project is exploring several research directions related to multicore RD analysis for loop-based parallel programs. First, it is characterizing how Concurrent RD and per-thread RD profiles for symmetric threads vary with processor and problem scaling. Second, it is developing techniques to predict these profile variations. Simple prediction techniques such as reference groups, as well as more sophisticated parametric and non-parametric learning approaches, are being studied. Finally, it is applying the new RD analysis to explore large-scale multicore design spaces, identifying good cache hierarchy organizations. It is also using the RD analyses to improve existing memory performance enhancement techniques such as multithreading and locality optimization.