Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

In-process OLAP

Hannes Mühleisen, CWI

Introduction to the speaker

Dr. Hannes Mühleisen is a senior researcher from the Centrum Wiskunde & Informatica (CWI) in Amsterdam and part of its database architecture group. In his research, he focuses on OLAP systems and developing applicable database systems. Together with his colleague from CWI Mark Raasveldt, Dr. Hannes Mühleisen created DuckDB, an open-source in-process SQL OLAP database management system. Besides developing DuckDB, he shares his knowledge by giving lectures on Big Data Systems and Data Engineering at CWI and has a side position at MonetDB Solutions.

Summary

Written by David Matuschek, Selina Raschack & Julian Zabbarov

Introduction

To introduce what in-process Online Analytical Processing (OLAP) is, Dr. Hannes Mühleisen started his talk with a reflection on the client-server setup of database systems. He explained that it’s nowadays reasonable to assume that most database management systems (DBMS) implement the so-called client-server model in which data is stored in external databases (DB) while the processing and information retrieval part happens on the client side. So why is the client-server setup so common?

 

In Dr. Hannes Mühlheisen's opinion, the reason for the dominance of the client-server model lies in the history of databases. After the introduction of the client-server system with Sybase in 1984, this setup was state of the art for years in database system design. The main reason for this is that the client-server model excels in handling Online Transaction Processing ( OLTP) tasks, which require DBMS to process a large amount of simple queries at the same time. To address this high velocity of data in OLTP, DBMS like PostgreSQL implement the client-server model and, hence, are able simultaneously handle requests from thousands of clients in real time. As a large number of use cases nowadays require handling multiple users at the same time, many database management systems (DBMS) implement the described client-server model.

 

Still, the client-server setup is not “god-given” as Dr. Hannes Mühleisen explains it. Due to the overhead in network communication between the client and server, client-server systems are for instance not the fastest when it comes to running read queries. This is why in some OLTP use cases so-called in-process DBMS like SQLite are the preferred choice. In comparison to the server-client approach, in-process systems integrate the DBMS into the client side wherefore the communication between client and server requires no network connection. This way, the time for saving and retrieving information is reduced considerably.

Why Machine Learning needs in-process OLAP

The advantage of in-process databases in having high query processing speeds led Dr. Hannes Mühleisen to consider this approach for OLAP tasks. In comparison to OLTP, OLAP tasks yield single but complex and long-running queries that focus on analysing large amounts of data. For such tasks, the client-server setup in Dr. Mühleisen's opinion is “very problematic”. He argues that communication protocols between the analysing processes on the client side and the data handling on the server side are not made for transferring “serious” amounts of data and were therefore not suitable for OLAP. This communication overhead is in Dr. Mühlheim’s opinion also a major reason why the machine learning (ML) community, in which OLAP tasks are used and needed on a regular basis, so far avoids DBMS. By storing and accessing data of ML tasks in external databases, ML engineers would include a huge runtime overhead into their pipelines. Consequently, databases were often seen as bottlenecks and, hence, removed from ML workflows.

 

Past attempts to address this problem by “absorbing” machine learning functionalities into DB were in the speaker’s opinion not successful. The main reason for this is that ML  libraries for databases were years behind the state-of-the-art algorithms. An in-process integration of databases, on the contrary, would in Dr. Hannes Mühleisen's opinion allow the “DB/ML marriage”  for OLAP tasks. By running databases on the client side, ML engineers could make use of state-of-the-art ML algorithms while simultaneously accessing data without having to copy it from any servers. This zero-copy data integration brings, besides the described faster query speeds, a great amount of convenience with it. As in-process databases require no server management whatsoever, the amount of hurdles for ML engineers of getting a DBMS running and maintained are reduced dramatically.

DuckDB

Convinced by the advantages of in-process databases for OLAP tasks, the possible scope of application in the ML community and the lack of existing solutions in that field, Dr. Hannes Mühlheisen and his colleague Dr. Mark Raasveldt invented DuckDB. DuckDB is the first in-process OLAP DBMS and was initially released in 2018. Meanwhile, after three years of the release, DuckDB is getting downloaded around 100.000 times per week and starts to establish itself as the choice for in-process OLAP DBMS.

Characteristics of DuckDB

There are several reasons why DuckDB has become a popular choice when it comes to in-process OLAP DBMS. The most important reasons and characteristics are the following.

One-file-to-save-it-all

One of the biggest advantages of the SQLite system is that the whole database is stored within a single file. Inspired by this concept DuckDB decided to also store the whole database within a single file.

Simple integration

The installation and integration of DuckDB should be as easy as possible, to reduce the overhead for the developers. Because of this reason, DuckDB does not need any external dependencies, a compilation by the user itself, or the setup of a runtime. The whole DuckDB system is built into two files on the release. When the system is integrated into a software, the database is not executed as a separate process but fully embedded into the host process.

Vectorized Processing

As already mentioned, DuckDB is designed to perform OLAP tasks. To fulfill the requirements for OLAP queries, DuckDB implemented vectorized processing which comes with a lot of advantages especially when it comes to the execution time of the queries. How it works and how significant the advantages are, is amongst other things discussed in the following section.

Technical concepts used by DuckDB

In the previous section we've seen that the primary intended use case for DuckDB is in the field of embedded analytics. Therefore, DuckDB stores information column-based as this supports the following two benefits [1]: (i) OLAP queries touch multiple columns but at the same time are enabled to fetch a large number of tuples for those columns and (ii) the storage format itself allows lightweight compression on a per column basis.

Data storage and Vectorized query execution

Let's take a step back and briefly discuss the differences between row-based and column based storage and the benefits that come from query processing that is optimized for each of the approaches.

Relational database systems, like SQLite, store data in tables that support a row-based data fetching. This so-called tuple-at-a-time execution supports OLTP tasks but leads to interpretation overhead in OLAP tasks and queries might not be compiled in a way to support CPU parallelism. Column-at-a-time execution on the other hand may be constrained by memory bandwidth. In their research published in 2005 Boncz et. al. [2] analysed both approaches and identified the benefits and drawbacks coming with them. To overcome these challenges they introduced X100 where they combined both concepts. This led to a pipelined process for query processing in which vertical data fragments “vectors” are passed between the operators within the query execution.

 

DuckDB follows the vectorized processing or “vector volcano” architecture as Dr. Hannes Mühleisen likes to summarize it. A table in DuckDB internally is a list of lists of vectors, each of those vector lists representing a logical column of data.

Canonical Vector Format

One of the challenges the DuckDB team is working on right now is to enable their system to operate directly on compressed data. The major challenge here is that there exist many different compression methods and many operators and functions, supported by DuckDB. Even the vectors themselves can be stored differently to suit the underlying data and database operators are free to implement each of the possible cases .

 

Therefore, DuckDB transforms vectors into a canonical form. They call their method “orrification canonicalisation” which normalizes a vector into the following format: (Define, Offsets, Data). The define mask is a boolean filter which defines where actual data is ("true") and where null values are ("false"). Offsets and data contain defined offsets and the plain data values respectively. Full details on the “orrification canonicalisation” will be published soon in an upcoming paper by the team behind DuckDB.

Concurrency Control

DuckDB implements a modified version of HyPer's fully serializable Multi-Version Concurrency Control (MVCC). With MVCC each update results in a new version of that respective data object. Old versions of that object will be available and visible to readers while the update transaction proceeds concurrently. Read-only transactions therefore never have to wait for other transactions to finish first. Being fully serializable ensures that two concurrent transactions do not update the same data object. [3] With this DuckDB provides ACID-compliance.[4] 

 

While HyPer is a hybrid OLTP & OLAP database system it uses a tuple-at-a-time JIT-compiling query engine that is optimized for both OLTP and OLAP workloads.[5]  It therefore keeps undo versions per data tuple. DuckDB, on the other hand, keeps undo versions per column chunk. This allows OLAP transactions to update efficiently in bulk on column subsets.

 

To demonstrate DuckDB’s strength Dr. Hannes Mühleisen presented a small MVCC benchmark for the query: UPDATE tbl SET i=i+1. DuckDB is compared with MonetDB, HyPer and SQLite. The benchmark does not only show that updates in DuckDB are way faster than in the other systems but also indifferent to the number of columns that are updated. The updates in DuckDB are executed within a constant amount of time. Full details on the MVCC implementation will be published soon in an upcoming paper by the team behind DuckDB.

Challenges for in-process DBMS

Despite its advantages in processing speed and convenience, DuckDB also comes with its challenges. Most of these challenges come down to the fact that DuckDB is, as all i n-process DBMS solutions, part of the application process and runs on the same hardware as the main process. This causes developers of in-process DBMS to face several challenges. The following problem is not only limited to in-process DBMS, but since they are likely to be executed on user and not only server hardware, it is a challenge for in-process systems.

Distrusting Hardware

The most important aspect of the environment where the DBMS runs is the hardware it relies on. While CPUs and harddrives can check or correct errors themselves, there are components like the RAM which are likely to lead to misbehavior. If users don’t use expensive self-correcting RAM, errors within the RAM are quite common. To process the data within the database, the DBMS is forced to load data into RAM, but the RAM is unreliable. For example, 1 in 1700 crashes in a Microsoft system are caused by a single bitflip within the RAM [6]. Errors like this are uncorrectable errors that would directly lead to crash or misbehavior like false results if they are not handled correctly. To test the behavior of a system for example bit flips, techniques like memory fault injections can be used. With this technique, it is possible to measure the behavior of the DBMS when bitflips are introduced on purpose. While crashes or errors of the DBMS are part of the expected behaviors, incorrect results for given queries would be the worst case. The user of the DBMS has nearly no chance to detect incorrect results. To prevent this behavior, the inventors of DuckDB introduced the concept of memory fault mitigation.

Memory Fault Mitigation

The idea behind Memory Fault Mitigation is to create checksums for intermediate data formats like vectors. Checksums are created and compared before and after an operation is performed. To get even more control over the data within the RAM, existing hashes could be used. Since DuckDB stores the database within a single file, they create parts within this file, so-called “blocks”, and create a hash for each of those blocks. After the data has been read by the system, the data is getting hashed again and those two hashes get compared. If they are not the same, the data of the RAM is not the same, as the data on the hard drive.

Summary

In ML processes a large amount of data has to be consolidated and processed. The respective queries tend to be rather complex and long-running. In the field of database systems this translates to so-called Online Analytical Processing. The nature of ML processes, usually being one big and long-running task itself, makes the advantages of so-called in-process DBMS desirable while advantages of DBMS that implement the client-server model are less relevant. DuckDB therefore fills a gap being the first fully in-process DBMS optimized for OLAP. It features well researched concepts like column-based data storage, vectorized query processing, or multi-version concurrency control. It also follows an architecture that makes it easy to integrate and use within ML processes. Apart from that, DuckDB addresses different challenges like distrusting hardware by introducing checksums for intermediate data formats like vectors as a memory fault mitigation technique.


[1] https://www.infoq.com/articles/columnar-databases-and-vectorization/

[2] Peter A. Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR 2005, Second Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2005. 225–237.

[3] Thomas Neumann, Tobias Mühlbauer, and Alfons Kemper. 2015. Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. 677–689. DOI: https://doi.org/10.1145/27233722749436

[4] Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: an Embeddable Analytical Database. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1981–1984. DOI: https://doi.org/10.1145/3299869.3320212

[5] Harald Lang, Tobias Mühlbauer, Florian Funke, Peter A. Boncz, Thomas Neumann, and Alfons Kemper. 2016. Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 311–326. DOI: https://doi.org/10.1145/2882903.2882925

[6] https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/eurosys84-nightingale.pdf