Experimentally Evaluating Systems Performance

About the Speaker

Marcel Weisgut is a PhD student in Professor Rabl's Data Engineering Systems group at HPI. His research interests lie in the area of efficient data management with heterogeneous hardware. He is particularly interested in memory technologies such as CXL memory, memory management, and in-memory database systems.

Overview

Evaluating systems performance through benchmarks is essential to understand how well a system operates. Benchmarking is a vital task in computer science research as it enables researchers to validate new designs or identify bottlenecks. The lecture is based on the book Systems Benchmarking by Kounev et al. [1].

Benchmarking

A benchmark is a methodical instrument used in pair with methodology to assess and compare the characteristics and performance of systems or components. Benchmarks in computer science have historically measured the performance by evaluating the quantity of productive work completed in relation to the time and resources spent. As a result of the demands of modern systems, their scope has expanded to include some new aspects which include reliability, security, and energy efficiency. The entity being assessed, whether it is a complete system or a single component is referred to as the system under test (SUT).

Types of Benchmarks

Benchmarks are categorised according to their methodology and scope. The primary distinction between these categories lies in the level of flexibility offered and the amount of effort needed to implement these benchmarks. In general, benchmarks fall into two categories: specification-based and kit-based types:

Specification-based benchmarks specify the necessary functions, input parameters, and anticipated results while defining a business problem. It is up to the user to implement these benchmarks. This method encourages creativity and innovation by allowing users to tackle problems in novel ways. However, it frequently requires a large amount of development work before it can be implemented. In academic research, for instance, a benchmark could specify a set of computational tasks, such as assessing the efficiency of a new algorithm, while keeping the implementation specifics open to promote creative solutions catered to the particular research requirements.
Kit-based benchmarks offer pre-built solutions that users can utilize immediately, reducing flexibility while guaranteeing faster setup and lower costs. They make cross-system comparisons easier but are generally less flexible in terms of customization and adaptation to unique scenarios. For instance, comprehensive, ready to run test suites are offered by industry standard benchmarks such as SPEC CPU. They offer fast and consistent evaluation of system performance.

Every benchmark needs a program to run. Ideally that would be the real user’s application, since it would offer the most realistic representation of performance and usage patterns of the real world. However, this is mostly not feasible because of problems including hardware variations, restrictions, and the challenges of accurately replicating the same exact environment. Hence there are different types of benchmarks that vary in scope and focus:

Synthetic Benchmarks use artificial programs that are designed to replicate the features of an application. Although they are flexible and are able to test the system's limits in a controlled way, their simplified representation of memory and operation interactions may prevent them from accurately representing workloads encountered in the real world.
Microbenchmarks are designed to assess very specific components of a system, for example the memory management unit or the CPU's floating point performance. They are used to identify the maximum performance that a specific system component may achieve by isolating out small pieces of code or operations. They don't, however, depict how these separate parts work together and how they affect the system's overall performance.
Kernel Benchmarks highlight the most time-consuming part of an application, usually the core code segment. Although these benchmarks are small and lightweight, they could overlook how one component of the system interacts with another.
Application Benchmarks give a more realistic view of performance in real-world situations by using actual applications. However, these benchmarks frequently rely on smaller datasets or scaled-down versions of actual workloads because of constraints such as time and resource limits. Because of this reduction, the benchmark might not always accurately represent the demands and complexities found in the real environment, such as extensive memory usage or high network traffic.

Performance Benchmarking Strategies

Performance benchmarking aims to compare system performance under controlled conditions to ensure meaningful comparisons. A standardized environment typically includes consistent hardware, software, configurations, and workloads. In this context, "standardized" means that the conditions under which the benchmark is executed, such as the hardware, operating system, software stack, and configurations, are carefully defined and kept consistent across different tests. These are the two main strategies used:

Fixed-work benchmarks calculate how long it takes to finish a set amount of work. These are suitable for evaluating efficiency by calculating execution rates. However, improving only one component of the system, like the processor speed, memory bandwidth, or disk speed often leads to diminishing results because of bottlenecks in other areas. For example, if a CPU is made significantly faster but the system still relies on slow memory access, the overall performance gain will be limited by memory speed. This idea is demonstrated by Amdahl's Law, which emphasises that the maximum speedup is bound by the fraction of time the improved component is running.
Fixed-time benchmarks calculate how much work is finished in a specific length of time. They can be used to check the responsiveness and throughput of a system. Fixed-time benchmarks do not have an upper limit on performance improvements, because as the system becomes more efficient, it can process a greater volume of work within the same time period. However, the overall gains are still subject to system-wide bottlenecks.

Quality Criteria

For a benchmark to provide meaningful and reliable insights, it must meet several key quality criteria. These criteria help ensure that benchmarking results are accurate, fair, and applicable to real-world scenarios.

One of the most important quality criteria is relevance. A benchmark may be relevant in one scenario while being irrelevant in another. To provide useful information, benchmark designers should design benchmarks to fit intended use cases, while users must assess relevance based on their specific context. This assessment should include the breadth of applicability of the benchmark, as well as how relevant the benchmarked workload really is in the user's context. For instance, an XML parsing benchmark is highly relevant for XML parsing performance, but less so for enterprise server applications, and irrelevant for 3D graphics.

Another criterion is reproducibility. To provide credible evidence, a benchmark must produce consistent outcomes when run under the same conditions. This includes consistency between multiple runs in the same test environment and also the ability to replicate results in another, identical setup. Achieving that requires detailed descriptions of hardware, software, and configurations to allow others to recreate the test environment. Ideally, just using the same hardware and software should lead to perfectly consistent benchmark results. However, in reality there are additional factors that cause variability, such as thread scheduling, power management, or temperature changes.

The fairness criterion ensures that systems compete solely on their performance, under equal conditions and with comparable results. Since benchmarks always involve some simplifications, systems should not be allowed to exploit these simplifications in order to optimize specifically for the benchmark rather than real-world usage. For this purpose, a benchmark can specify run rules that the SUT must fulfill. For instance, run rules often require hardware and software to meet certain standards. Also portability is a key aspect, requiring benchmarks to run across different systems or environments while still maintaining consistency and fairness. For instance, it might be necessary to restrict the compiler flags that can be used, reducing potential differences in the results. However, the amount of run rules must be balanced. Too many constraints can exclude valid scenarios while too few constraints can lead to misleading results, where performance gains observed in the benchmark do not accurately reflect real-world improvements because of possible exploitations of the benchmark's simplifications as explained above.

Verifiability ensures that benchmark results are accurate and trustworthy. That means the workload is carried out accurately and in compliance with the run rules. Well-designed benchmarks often include a self-validation feature to confirm that the workload behaves as expected. They may also perform verification to check if outputs are actually correct, giving the developers trust that they did not accidentally break feature correctness while trying to further optimize it.

The last quality criterion is usability. Running a benchmark should be as easy and as practical as possible for the user. Self-validation can be an important feature as it gives the user confidence and trust in the outcomes. Usability also means that the test environments are practical and reproducible. This is essential because overly complicated or expensive setups can limit a benchmark's accessibility.

Metrics

To quantify the performance of the SUT and accurately interpret benchmarking results, appropriate metrics are required. There is a distinction between measurements, measures and metrics. Measurements are the raw, individual outcomes of the benchmark experiment, also called sample points. They may vary when the experiment is repeated. A measure captures the outcome of the experiment by mapping sample points to real numbers. Measurements typically represent some type of size, quantity, or count. Metrics are then computed from one or more measurements using statistics. They summarize measurements to reflect specific performance characteristics of the system.

Basic Performance Metrics

There are some fundamental measurements commonly used in benchmark experiments. These include count, which measures how often a specific event occurs, for example the number of database transactions executed within a specific time frame. Duration tracks the time interval required for a process or event, for instance the time taken to complete a single transaction. Another measurement could be the size of a parameter, such as the amount of data written to a database during a transaction.

These fundamental measurements are typically not used directly, but are often combined into more meaningful metrics using specific formulas to produce values of interest. One example is the rate metric, which is calculated by normalizing events counts over a common time interval. These metrics allow a fair comparison of performance across a different time frame. Rate metrics typically represent some type of speed. In this context, speed reflects the amount of work completed within a given measurement interval. Examples of work could be web requests sent through a browser, database transactions, or network operations, such as transferring data packets. To compare systems speed, the speedup or the relative change can be calculated. Speedup measures how much faster one system is compared to another. Relative change quantifies the percentage improvement of one system over another.

There are additional metrics critical for performance evaluation: Response time is defined as the time it takes a system to react to a request and provide a response. This could be any time spent waiting to access resources such as the CPU, storage devices, or network links which is also called congestion time. Throughput is defined as the rate at which requests are processed. And utilization, which is defined as the fraction of time a resource like the CPU, network link, or storage device is actively engaged in processing requests.

From Measurements to Metrics

When measuring data, results can often vary due to several factors such as environmental conditions, system workload, or inherent randomness in the system being evaluated. This variability is a key characteristic of measurements, emphasizing that a single measurement might not fully capture the behavior of the system. Instead, multiple measurements are taken to form a sample, which provides a better representation of the underlying system property. This emphasizes the need for an average value metric. There are different types of averages, most commonly mean, median, and mode. The mean, if not further specified, usually refers to the arithmetic mean, but it is sensitive to outliers. The median, which is the middle value, is less influenced by outliers. Finally the mode identifies the most frequently occurring value.

Composite metrics aggregate multiple metrics into a single value. These metrics can either represent multiple system properties, or the same property as measured under different conditions, for example response times under different workloads. They are often defined as the mean value of the underlying metrics. Three types of mean values that are most common for composite metrics include the arithmetic mean, which is suitable when the sum of raw values has physical meaning such as durations. The harmonic mean is used to summarize rates, for example throughput. The geometric mean is appropriate when the product of values has physical meaning, for example when averaging speedups.

While composite metrics simplify performance evaluation by aggregating multiple metrics into a single value, it might become challenging to create a good one. Combining multiple metrics often results in a loss of detailed information. Also defining appropriate weights for individual performance metrics needs to be done carefully, as the resulting metric will be very sensitive to these weights. Another challenge is that systems are often specialized for specific types of workloads. Performance metrics for different workloads may capture distinct aspects of system behavior and aggregating metrics across diverse workloads can lead to an oversimplified view, where critical differences between workloads are lost.

Often this aggregation happens on ratio metrics. Ratio metrics are defined as the ratio of two measured quantities A and B, expressed as A/B. Examples for ration metrics are million instructions per second (MIPS), speedup, or cache miss rate. When aggregating such ratio metrics across benchmarks, the simple harmonic mean can be used when the numerator A (e.g., total work or instructions) is equally important across the benchmarks. The arithmetic mean on the other hand can be used when the denominator B (e.g., time or workload contribution) is equally important. However, simply averaging absolute execution times across workloads can lead to incorrect conclusions because longer-running workloads can disproportionately influence the average, skewing the results. This can mask the performance differences between systems, making it appear as though one system is better or worse than it actually is. To avoid bias from absolute values, relative performance improvements could be evaluated using speedup metrics. Speedup normalizes execution times by comparing them to a reference, making relative differences the focus. For example, doubling the speed of a workload has the same proportional impact on performance, no matter how long the workload originally took.

But aggregating normalized values introduces another pitfall: Rankings comparing multiple systems change depending on which system is chosen as the baseline, which could make it difficult to determine which server performs the best. The solution for this is the geometric mean. The geometric mean provides a consistent method for aggregating normalized values, as it is independent of which of the systems serve as a baseline.

Summary

Benchmarks are a way of evaluating and comparing the performance of different systems. There are various types of benchmarks. They can differ in their implementation approach, such as specification-based or kit-based benchmarks, or in the abstraction level on which they test a system. To measure performance there are different strategies, such as fixed-work and fixed-time. Several quality criteria describe what a good benchmark characterizes, including relevance, reproducibility, fairness, verifiability and usability. Benchmarking results are given in metrics. They aggregate the experiment's measurements into meaningful and interpretable values. Calculating metrics can involve different types of averages and means. It is also possible to summarize metrics from multiple benchmarks into one composite metric.

References

[1] S. Kounev, K. D. Lange, and J. Von Kistowski, Systems Benchmarking: For Scientists and Engineers. Springer, 2020.

Experimentally Evaluating Systems Performance

About the Speaker

Overview

Benchmarking

Types of Benchmarks

Performance Benchmarking Strategies

Quality Criteria

Metrics

Basic Performance Metrics

From Measurements to Metrics

Summary

References

Chair

News

06.01.2026 | Congratulations Wang Yue!

01.08.2025 | Having Fun with the Team...

25.05.2025 | Paper on CPU Cache Prefetching at DaMoN’25

02.04.2025 | Paper on Benchmarking NVLink-Attached GPU Memory at HCDS’25

18.03.2025 | EDBT 2025 Best E&A Paper Award

Events

24.03.2022 | FG DB Symposium

Directions