Our paper "RMG Sort: Radix-Partitioning-Based Multi-GPU Sorting" by Ivan Ilic, Ilin Tolovski and Tilmann Rabl was accepted at BTW '23.
Abstract: In recent years, graphics processing units (GPUs) emerged as database accelerators due to their massive parallelism and high-bandwidth memory. Sorting is a core database operation with many applications, such as output ordering, index creation, grouping, and sort-merge joins. Many single-GPU sorting algorithms have been shown to outperform highly parallel CPU algorithms. Today’s systems include multiple GPUs with direct high-bandwidth peer-to-peer (P2P) interconnects. However, previous multi-GPU sorting algorithms do not efficiently harness the P2P transfer capability of modern interconnects, such as NVLink and NVSwitch. In this paper, we propose RMG sort, a novel radix partitioning-based multi-GPU sorting algorithm. We present a most-significant-bit partitioning strategy that efficiently utilizes high-speed P2P interconnects while reducing inter-GPU communication. Independent of the number of GPUs, we exchange radix partitions between the GPUs in one all-to-all P2P key swap and achieve nearly-perfect load balancing. We evaluate RMG sort on two modern multi-GPU systems. Our experiments show that RMG sort scales well with the input size and the number of GPUs, outperforming a parallel CPU-based sort by up to 20×. Compared to two state-of-the-art, merge-based, multi-GPU sorting algorithms, we achieve speedups of up to 1.3× and 1.8× across both systems. Excluding the CPU-GPU data transfer times and on eight GPUs, RMG sort outperforms the two merge-based multi-GPU sorting algorithms up to 2.7× and 9.2×.
Our extended abstract "What We Can Learn from Persistent Memory for CXL" by Lawrence Benson, Marcel Weisgut, and Tilmann Rabl was accepted at the NoDMC workshop at BTW '23.
Abstract: With high-capacity Persistent Memory (PMem) entering the long-established data center memory hierarchy, various assumptions about the performance and granularity of memory access have been disrupted. To adapt existing applications and design new systems, research focused on how to efficiently move data between different types of memory, how to handle varying access latency, and how to trade off price for performance. Even though Optane is now discontinued, we expect that the insights gained from previous PMem research apply to future work on Compute Express Link (CXL) attached memory. In this paper, we discuss how limited hardware availability impacts the performance generalization of new designs, how existing CPU components are not adapted towards different access characteristics, and how multi-tier memory setups offer different price-performance trade-offs. To support future CXL research in each of these areas, we discuss how our insights apply to CXL and which problems researchers may encounter along the way.