Anirban Mandal, Min Yeol Lim, Allan Portereld, Rob Fowler. Effects of Multi-core Memory Concurrency Limits on Multi-threaded Applications, Technical Report TR-10-03, RENCI, North Carolina, September 2010.
Memory access is becoming an increasingly significant impediment to extracting performance out of multi-core systems. More than ever, the effectiveness of memory system use by an application is becoming a critical determinant of performance. In previous work, we demonstrated how explicit consideration of memory concurrency provides a better model for memory performance on multi-socket, multi-core systems than just using best case latency and bandwidth. This paper investigates some of the implications of this on application structure and compiler optimization. We developed a methodology to use hardware performance counters in a performance reflection tool, RCRTool, to measure achieved memory concurrency. We applied this to several important memory-bound scientificc applications and kernels compiled with varying levels of optimization. We convolve the observed application concurrency with available system memory concurrency to derive insights for compilers and application tuners. The models provide compilers and runtimes with information about how load on the memory sub-system changes the effectiveness of various optimizations. As the number of hardware cores/threads increases, and as o-chip memory bandwidth per core remains constant or decreases, these measurements and analysis can provide insights to compiler and application writers. For example, on highly-threaded systems the system can be saturated if each software thread offers only 2 or 3 concurrent memory references. The implication is that optimizations to improve cache usage are more important than ever, while program transformations designed to increase memory concurrency may lose their utility.