Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

Andreas Zimmerer

Affiliation: UTN
Title: Pruning in Snowflake: Working Smarter, Not Harder

 

Abstract

Modern cloud-based data analytics systems must efficiently process petabytes of data residing on cloud storage. A key optimization technique in state-of-the-art systems like Snowflake is partition pruning - skipping chunks of data that do not contain relevant information for computing query results.
While partition pruning based on query predicates is a well-established technique, we present new pruning techniques that extend the scope of partition pruning to LIMIT, top-k, and JOIN operations, significantly expanding the opportunities for pruning across diverse query types. We detail the implementation of each method and examine their impact on real-world workloads.
Our analysis of Snowflake's production workloads reveals that real-world analytical queries exhibit much higher selectivity than commonly assumed, yielding effective partition pruning and highlighting the need for more realistic benchmarks. We show that we can harness high selectivity by utilizing min/max metadata available in modern data analytics systems and data lake formats like Apache Iceberg, reducing the number of processed micro-partitions by 99.4% across the Snowflake data platform.

Short CV

Andi Zimmerer is a doctoral researcher in the Data Systems Lab at the University of Technology Nuremberg working with Prof. Andreas Kipf.
Before that, he worked for three years at Snowflake Inc. as a database software engineer on search optimization (index for lookup queries), optimizing top-k queries [1][2], and data partition pruning.
He earned his M.Sc. with honors as part of the Software Engineering Elite Graduate Program from the Technical University of Munich (TUM), the University of Augsburg, and the Ludwig Maximilian University of Munich (LMU) where he specialized in data processing systems. For his thesis, he worked at MIT’s Data Systems Group under the supervision of Prof. Tim Kraska.