Our paper on CPU cache prefetching was accepted at the DaMoN workshop, co-located with SIGMOD.
Title: Fetch Me If You Can: Evaluating CPU Cache Prefetching and Its Reliability on High Latency Memory
Authors: Fabian Mahling, Marcel Weisgut, Tilmann Rabl
Abstract:
Memory can be located close to a CPU, at remote sockets, or on devices connected via interconnects such as CXL or NVLink. A larger distance between memory and a core accessing the memory usually results in higher access latency. Software prefetching algorithms claim to hide memory access latencies by moving data to the CPU cache before a core accesses the data. In this work, we analyze to what extent software prefetching can hide increased memory access latencies. We evaluate these on seven systems, each offering different memory technologies and access latencies. We show that prefetching can increase performance by up to 2.6x and 2.8x for B+-Tree and binary search workloads. We find that CPU fill buffers, which track L1 cache misses, and a workload's memory intensity dictate how much access latency can be hidden. CPUs implement prefetches differently. We introduce microbenchmarks that identify concrete target cache and eviction strategies for different prefetch localities across x86 and ARM architectures. When the fill buffers are full, CPUs either drop prefetches or halt until all can be executed. We refer to these behaviors as weak and strong prefetching reliability. We introduce microbenchmarks identifying a CPU's reliability. When prefetching 8 KiB B+-Tree nodes, weak reliability achieves a speedup of 2x while strong reliability degrades performance with a slowdown of 2.5x for lookup workloads.
Download