Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Bachelor-Project ProCSIA: Column Store Benchmarks

Fabian Tschirschnitz, January 6, 2011

Authors: Philipp Langer, Florian Westphal, Marian Gawron, Andreas Henning, Fabian Tschirschnitz, Patrick Schulze, Gary Yao, Michael Wolowyk

Introduction

Roughly one week ago the bachelor project team “ProCSIA” (Profiling Column Stores with IBM’s Information Analyzer) evaluated six mainly column-oriented DBMSs in order to find out which one of them works fastest on a given OLAP workload and on simple profiling operations.
These were mostly open source projects (in the case of Sybase IQ 15.2 a free evaluation copy was used):

This test was conducted using a TPC-H benchmark (size: 1 GiB) as well as our own simple benchmark for data profiling-like queries. The second benchmark contains column-based operations, such as calculating a minimum (select min attr from table) or finding out about a column’s frequency distribution (select attr, count(*) from table group by attr). The TPC-H SQL queries as well as the queries used in our own benchmark are attached at the end of this blog post.
Please be aware of the fact that the trial versions of the DBMSs listed above are not comprehensive enough to make solid comparisons of actual performance. However, these trials provided us a hint on which column stores are suited best for our profiling tasks.
We ran the benchmark on a Dell Optiplex 745 with an Intel Core 2 CPU 6600 @ 2,40 GHz and 4 GB of RAM of which 3,25 GB could actually be used on a 32-bit operating system.

DISCLAIMER: The bachelor project team performed these tests to the best of their knowledge. However, they are no many-years database performance tweaking engineers. Also, they aimed to find out how fast the systems are in a none-tweaked mode.

Results

Benchmark 1

We did not execute the TPC-H benchmark on MetaKit as it does not provide an SQL interface. In the figures below you can see how each DBMS performed (for detailed information about the queries, see the end of this blog post). It is interesting to see how different DBMSs perform on this small amount of data. We plan to test the DBMSs with a 100 GiB TPC-H database to find out about scaling issues.

Figure 1: TPC-H 1G load times

Figure 2: Fast running TPC-H queries

Figure 3: Fast running TPC-H queries

TPC-H overall results:

MonetDB:14,571 sec | Power@Size: 8853,5
Vectorwise: 15,38 sec | Power@Size: 8252.1
InfiniDB: 108 sec | Power@Size: 890,5

Sybase: 262 sec | Power@Size: 448,4
Infobright: 10,36 h| Power@Size: 230,20

Infobright’s poor performance is due to the TPC-H query 21, which took 9 hours to execute. Query 21 contains complex subqueries demanding too much of the MySQL optimizer. This is an issue related to the MySQL optimizer rather than to Infobright’s implementation.

Benchmark 2

The positive impression MonetDB and Vectorwise made in the TPC-H benchmark was confirmed in our own benchmark. Infobright is very good at aggregations as it builds up a knowledge grid saving e.g. the maximum and minimum of most columns.

Figure 4: Calculating the average of a column

We then let the DBMSs calculate the minimum of a column (type varchar) directly followed by the calculation of a maximum. For example, this column was not in Infobright’s knowledge grid so the calculation of the minimum took rather long but the maximum was calculated very fast because of caching. The same phenomenon can be observed with the other databases, too (except MetaKit which works on flat files and does not provide any further optimizing).

Figure 5: Calculating the min/max of a column

The last test we conducted was a frequency distribution. For unknown reasons MonetDB took so long to execute this quite simple query that we had to abort the query.

Figure 6: Calculating the frequency-distribution of a column

Conclusion

In conclusion, we found that MonetDB, Vectorwise, and Infobright provided the most promising results for our application (while being open source products). As mentioned above we are still in the process of evaluating how the DBMSs work on a 100 GiB TPC-H database.