* Remote access costs per se are not the main concern for performance. Instead, congestion on memory controllers and interconnects hurts performance a lot more. Because of that, memory placement algorithms must be redesigned to target traffic congestion. In this talk, I will describe Carrefour, an algorithm that addresses this goal. We implemented Carrefour in Linux and obtained performance improvements of up to 3.6x relative to the default kernel, as well as significant improvements compared to NUMA-aware patchsets available for Linux.
* Modern NUMA hardware has asymmetric interconnects. When the nodes are connected by links of different bandwidth, we must consider not only whether the threads and data are placed on the same or different nodes, but also how these nodes are connected. The key new insight is that the best-performing connectivity is the one with the greatest total bandwidth as opposed to the smallest number of hops. I will present a dynamic thread and memory placement algorithm in Linux that delivers similar or better performance than the best static placement and up to 2x better performance than when the placement is chosen randomly.