Improving performance on NUMA systems

Wednesday, March 2, 2016
Speaker(s) : Baptiste Lepers, postdoc à l'EPFL

NUMA systems are characterized by Non-Uniform Memory Access times, where accessing data in a remote node takes longer than a local access. NUMA hardware has been built since the late 80’s, and the operating systems designed for it were optimized for access locality. They co-located memory pages with the threads that accessed them, so as to avoid the cost of remote accesses. However, modern NUMA hardware is much more complex and optimizing for locality is no longer sufficient.
* Remote access costs per se are not the main concern for performance. Instead, congestion on memory controllers and interconnects hurts performance a lot more. Because of that, memory placement algorithms must be redesigned to target traffic congestion. In this talk, I will describe Carrefour, an algorithm that addresses this goal. We implemented Carrefour in Linux and obtained performance improvements of up to 3.6x relative to the default kernel, as well as significant improvements compared to NUMA-aware patchsets available for Linux.
* Modern NUMA hardware has asymmetric interconnects. When the nodes are connected by links of different bandwidth, we must consider not only whether the threads and data are placed on the same or different nodes, but also how these nodes are connected. The key new insight is that the best-performing connectivity is the one with the greatest total bandwidth as opposed to the smallest number of hops. I will present a dynamic thread and memory placement algorithm in Linux that delivers similar or better performance than the best static placement and up to 2x better performance than when the placement is chosen randomly.

Gilles.Muller (at) nulllip6.fr