Introduction of Speaker
The presenter for this session was Andreas Kipf, a professor at the University of Technology Nuremberg (UTN) and head of the Data Systems Lab. This lab focuses on exploring the convergence of hardware and data management systems. Before his position he was a researcher, at Amazon Web Services (AWS) working on the Amazon Redshift team a prominent cloud data warehouse offering. During his tenure at AWS he gained insight into internal data and system designs typically unavailable to scholars. Kipfs experience covers both research and practical cloud computing in industry. This blend is crucial for the talk as it enables him to contrast the approach of academic research, with the real-world challenges companies encounter. He is well-equipped to connect frameworks with the complexities found in production settings. The session concentrated on his research about database workloads and methods to enhance performance, via targeted optimizations.
Goal, Problem, Solution
Summary written by Erion Leka
The central objective of the study is to enhance the performance of data warehouses by bridging the divide between theoretical premises and practical application. Often, research relies on assumptions about how data is spread, based on ideas that have stayed the same for decades. The researchers sought to fix the mismatch where databases are designed around one pattern of usage but operated by clients in different manners. This mismatch may result in optimizations that appear effective in publications yet do not provide real benefits, in operational settings. The goal was to comprehend these usage behaviors to develop quicker systems.
The primary issue highlighted is that common benchmarks employed in research, like TPC-H or TPC-DS lack realism. These benchmarks are rigid and don’t match how modern cloud analytics actually works. Researchers have devoted years to tuning databases to excel at these particular evaluations. Kipf noted that these benchmarks miss the complexity of real situations. For example, synthetic benchmarks usually have simple, uniform queries, while real-world queries are often repeated a lot. This means traditional tests often ignore the ways a real system could be optimized. Moreover standard benchmarks emphasize integer computation whereas actual users handle volumes of text (strings). Although CPUs can process integers easily string manipulations demand more intricate logic and memory handling. As a result a database engine tuned for TPC-H could underperform when managing a real-world workload, on Amazon Redshift.
The suggested solution is composed of three components. This three-part approach tackles the problem from the data layer up to the execution engine. First, the team analyzed metadata from customer workloads and released a dataset called "RedSet" with over 400 instances to encourage other researchers to test their ideas using real data. Next, they built a benchmarking tool called "RedBench." This tool leverages data insights to create workloads that replicate the actions of real users. which uses data insights to create workloads that mimic real user behavior. Finally, they introduced workload-driven improvements, "Predicate Caching" and "Parachute," to make use of the patterns they discovered.
Background
To understand the background, the Amazon team analyzed the RedSet data. Access to a production dataset like this is rare in the database field and offers a unique view. The dataset contains metadata from queries run on over 400 Amazon Redshift instances in the US East 1 region. They examined details such as arrival timestamps, compilation wait times, and specific SQL commands. This detailed analysis provides insights into system performance that summary statistics often hide.
An important finding concerns query repetition. The data showed that users are very consistent in the queries they send to the database. In one cluster, about 75 percent of queries repeat ones from the past month. Such a high repetition rate suggests the system behaves as if it forgets previous computations. In some clusters, this rate exceeds 95 percent. This differs from benchmarks, where each query is treated as unique. In reality, users often run the same reports and dashboards daily, so treating every query as new usually causes redundant work. This insight challenges the assumption of query optimizers that every request is fresh.
The study also revealed the pattern of query arrivals. Traditional capacity planning usually assumes a smooth, sinusoidal activity pattern, but the real behavior is far more irregular. While it is often thought that databases are busy on weekdays and slow on weekends, the data showed this isn’t always true. Some clusters have steady traffic all day, while others show weekly spikes or unpredictable patterns. This makes scaling more difficult and complicates fixed resource allocation.
Regarding query types, the team found there isn’t a single common pattern. This diversity suggests that a universal setup probably won’t work well for the majority of users. Certain users mainly execute "Select" queries whereas one notable example revealed a cluster that ran Delete" queries. Moreover data changes tend to be "infrequent " with intervals where data is accessed but not altered offering potential caching advantages. In read-heavy situations, the cost of maintaining a cache is shared across many subsequent reads, making it an efficient approach.
In conclusion the background analysis highlighted the use of string data. Strings are famously challenging to optimize due to their length and encoding unlike fixed-size integers. While studies typically emphasize integers users commonly utilize string columns for identifiers and logs sometimes conducting "Join" operations on these strings. This implies that the database community must prioritize efficient string handling. Tackling this string- reality demands a change, in the design of storage engines and execution pipelines.
Main Ideas, Methodology, Approach
The approach was outlined in two parts: workload synthesis (RedBench) and enhancements (Predicate Caching and Parachute). This segment explained the particular technical difficulties encountered when converting the findings from RedSet into software solutions.
To carry out Workload Synthesis the team developed RedBench. Privacy remains an issue in cloud settings preventing the straightforward sharing of customer logs. Because actual customer SQL queries cannot be disclosed (because of privacy rules), RedBench functions in two modes. The "Matching" mode extracts metadata from a query and locates a similar "twin" within a public dataset such as IMDB. By matching queries with public datasets, they can reproduce the complexity of the query plan without revealing any sensitive company information. The "Generating" mode produces queries following specific rules to guarantee that the workload reflects the statistical properties of the actual data, including repetition cofrequencies, read/write and the general timing of query requests.
The initial enhancement, Predicate Caching tackles the issue of repeated queries. Given that queries recur 75% of the time recalculating results proves inefficient. Conventional caching methods tend to be too broad to work in this particular scenario. Unlike a "Result Cache" which is invalidated if even one row changes, Predicate Caching acts as a flexible, temporary index. It stores "predicates." Rather than keeping the entire result set, this method identifies which parts of the storage blocks are relevant for a specific filter. For example, when a query looks for orders with a discount, the system tracks which rows match.
To save memory the system was designed using a bit-vector (for example 1 bit per 1,000–16,000 tuples based on the setup). Maintaining details for each row would consume too much RAM. A set bit signals data, whereas a zero lets the system bypass the block. This approach is efficient, in terms of memory and additionally aids "Joins" by keeping track of Bloom filter outcomes enabling the system to omit rows that would not succeed in a join operation. While this estimation brings a risk of false positives the reduction in memory usage is significant and the impact, on performance is minimal.
The second enhancement, Parachute deals with "dangling tuples"—rows that are evaluated but ultimately rejected since they fail to find a match in a join. During processing, the database system naively examines rows only to later discover they don’t match any record. This method employs "Sideways Information Passing." Parachute enables information to move downward through the query plan by pre-calculating compact structures such as histograms, during data loading. For example when combining a "Movies" table (data spanning 1990-2020) with a "Cast Info" table Parachute applies a filter to exclude actor details outside that time frame prior to initiating the join. This early filtering step significantly narrows the search scope before the costly join operation starts. It demands, about 15% disk storage but cuts down the processing duration.
Main Results
The professor showed findings to confirm these methods. These tests were essential to prove that the theoretical improvements result in real-world time savings, for users.
Regarding Predicate Caching evaluations on benchmarks such as TPC-H demonstrated that the system is capable of reaching up to a 10x acceleration in optimal cases. Even typical cases exhibited enhancements. The primary reason for this performance boost is the decrease in I/O since the system avoids accessing data chunks from the disk by relying on the cached bit-vector. This significant acceleration validates that I/O frequently represents the bottleneck, in data warehousing. Nevertheless the build overhead resulted in a 1.5 percent decrease in speed. This overhead shows that a careful balance is needed, and the feature may need to be switchable depending on the workload. Additionally, caching join filters provided an extra 10 percent performance boost on Amazon Redshift.
For Parachute the team quantified the decrease in dangling tuples. Dangling tuples signify wasted CPU effort and memory bandwidth that contribute no benefit to the output. In a system such as DuckDB, roughly 25 percent of tuples processed are dangling. The findings indicated that Parachute alone could cut dangling tuples from 25% to 12% and down to, around 3% when supplemented with further sideways information passing. By excluding these early on the engine can concentrate its resources on the data that truly counts. This decrease results in execution speeds. The findings also showed that Parachute is more effective than methods such, as " Predicate Transfer" since it needs just one pass through the data.
During the Q&A session it was verified that Predicate Caching performs with string data since the method is independent of data type. It simply logs matches within a bit vector, which is advantageous due to the common occurrence of strings in actual workloads. This flexibility ensures that the optimization stays applicable, even as data schemas change over time.
Summary
Andreas Kipf’s lecture highlighted the gap between database assumptions and real-world industrial conditions. It served as a call for the database community to move beyond artificial benchmarks and concentrate on real applications. Analysis of the "RedSet" dataset showed that real workloads have repeated queries, heavy use of strings, and irregular timing. The "RedBench" tool was presented to assist developers in mimicking these circumstances.
The lecture showed that these trends can be used for enhancing performance. These particular cases reveal a concept: knowing your workload allows for customized, rather than generic solutions. "Predicate Caching" exploits repetition to avoid scanning data whereas "Parachute" relies on -calculated statistics to minimize redundant tasks, in joins. These represent examples of "workload-driven optimization."
Looking forward, the Data Systems Lab at UTN is exploring how Artificial Intelligence can improve data systems. Their focus is on "Semantic Query Processing ". It is designed to enable users to pose natural language inquiries that the database responds to by interpreting the data using AI models. This approach represents a shift from structural enhancements to a deeper, semantic understanding of the data.
Summary of lecture From Workload Synthesis to Workload-Driven Optimizationsy held by Prof. Dr. Andreas Kipf - Dec 2th 2025