Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

Results

Currently, the cluster benchmark results of this project are to be found in our paper (TBD). Nevertheless, we also evaluated some further individual benchmarks of the Pis and want to share our results here.

IO-Benchmarks

I/O is a key-factor for performance of a cluster that deals with data-intensive tasks. If the bandwidth is too low, the transfer from data into memory can be a huge bottleneck. Thus, we want to gain a understanding on the I/O Performance of our cluster. As a first benchmark of reading speed, we employ hdparm. While mainly being a tool for HDD configuration, it also offers different benchmarking options. While hdparm offers the option to measure the Linux buffer cache read performance (using option -T), we are interested in disk performance. Thus, we are interested into two different benchmarks:

  • hdparm -t:  In this case we are measuring the buffered disk reads. This means that we are measuring the speed of reading through the buffer cache to the disk without any prior caching of data. We are also using the page cache of the kernel.
  • hdparm -t --direct: Using the additional --direct flag, we are measuring the O_DIRECT disk reads. This option employs the kernel's O_DIRECT flag which bypasses the page cache, leading to raw I/O into hdparm's buffers.

We test the Raspberry Pi 4 as well as the Raspberry Pi 3B+. For each model, we test two kinds of MicroSD cards to see if the choice of a SD-card makes a difference: We test the SanDisk Ultra as well as the Samsung Evo. For comparison, we also measure the performance on a USB3.0-MicroSD card reader with a ThinkPad X1 Carbon (2017). The measurements give the following results:

Please note that the standard deviation is so low that we leave it out in the plot. We can see that read speed of hdparm does not depend on the SD card used, but just on the hardware (Pi3/Pi4). We are probably saturating the Pi's available bandwith with the sequential reads. The differences between the SD cards are only getting visible on the X1 for comparison, where the Samsung Evo outperforms the Sandisk Ultra.

While hdparm offers interesting insights in reading speed, we want to measure some real-world performance. Thus, we are also measuring the write speed using dd. As we want to test random writes/reads later on, for this benchmark, we will write 50000 blocks of 8k size from /dev/zero to our sd card and see how the system behaves. Furthermore, the fsync option makes sure that we flush the buffers and really write the data to the disk. We get the following results:

We can now see the same behaviour like on the X1 before even on the Pis: The Samsung Evo is outperforming the Sandisk Ultra. Moreover, the Pi4 is a little faster than the Pi3 on the respective same SD cards. 

Lastly, measuring random reads and writes using a tool called iozone can also be a performance indicator, especially when running on-disk databases on your hardware. In this benchmark, we combine different options of iozone to ensure a good measurement:

  • -e: This forces iozone and the OS to flush the cache onto the disk.
  • -I: This - analogously to O_DIRECT with hdparm - tries to use direct I/O, if possible.
  • -a -s 100M -r 4k -i 0 -i 2 We are using iozone's automatic mode, but with some restrictions. Our test file should have a size of 100M, and our measurements should be done with a record size of 4k. With the -i options, we specify that we are only interested in random reads/writes. 0 is always necessary to create the test file, 2 is the random option.

The benchmark gives the following results:
 

Obviously, the random reads are always a lot better than the random writes. The Samsung Evo once again outperforms the Sandisk Ultra and the Pi4 outperforms the Pi3 on read speed while the bottleneck on the writes does seem to lie in the SD cards itself. Even the X1 has a hard time getting faster write speeds.
 

Memory Bechmarks

We also want to compare the speed of the memory of the Raspberry Pis. For this, we employ sysbench, a well-known benchmarking tool. In our configuration, we write/read 2G of data in chunks of 1M into the RAM. Please note that there are different versions of sysbench available. We ran the benchmarks using version 1.1.0-bcd950b (using bundled LuaJIT 2.1.0-beta3, compiled directly from the git repository. For comparison, we also run the memory benchmark on a 5th gen ThinkPad X1 Carbon.

The Pi's offer the following memory read/write speeds:
 

Computational Benchmarks

Lastly, we want to gain an understanding of the computation power of the Raspberry Pis. For this, we first use the CPU-benchmark of sysbench to compare the performance of the Pi3B+ and Pi4 in single- and multithreaded scenarios. For reference, we again also run the benchmark on a 5th gen ThinkPad X1 Carbon. After that, we use the well-known linpack benchmark for evaluate a floating point performance measurement.

For the sysbench benchmark, we calculate the prime numbers up to 20000 with either a single or eight threads. Since sysbench 1.0, all tests run for 10 seconds, so we do not care about execution time; instead, the events per second are the important measurement. We receive the following results:

 

 

Now, we come to the Linpack Benchmark. We are using Roy Longbottom's port of Linpack to ARM. While mainly the high performance linpack benchmark port is being used, Longbottom ported Linpack to single precision. This is interesting because the SIMD units on the Pis only work in single precision mode (on a Pi3b+, the CPU itself is 32-bit and on a Pi4, Raspbian is a 32-bit OS that only supports SIMD on sp).