Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

About the Speaker

Prof. Dr. Hannes Mühleisen is a creator of the DuckDB database management system and Co-founder and CEO of DuckDB Labs, a consulting company providing services around DuckDB. Hannes is also Professor of Data Engineering at Radboud Universiteit Nijmegen. His main interest is - shockingly - analytical data management systems.

About the Talk

Analytical data management systems have long been monolithic monsters far removed from the action by ancient protocols. Redesigning them to move into the application process greatly streamlines data transfer, deployment, and management. This new class of systems has a whole new class of use cases, for example in-browser or edge OLAP, running SQL queries in lambdas, and Big Data on laptops.

Two Tier Architectures are Anachronistic

Summary written by Matthias Kind and Isabel Kurth

Goal, problem, solution

Today, databases commonly use so-called two tier architectures with network based client-server protocols to transfer queries, transactions and data. Dr. Hannes Mühleisen sees the history of databases as reasons for the dominance of client-server models. In 1984 the company Sybase (since 2010 part of SAP) was founded and can be seen as the first client/server database system that is based on two machines. This structure can handle the high velocity of data in OLTP tasks and therefore was widely used for years in database management system design. However, these data management systems have some problems, such as a long runtime when transferring data and the difficult and lengthy setup of the systems for users. The client-server setup is not fast when it comes to read queries because of the overhead in communication between client and server. The aim of DuckDB is to reduce these runtimes and make the database system intuitive, simple and uncomplicated to use for data scientists who need to run large queries. Dr. Hannes Mühleisen and his colleague Dr. Mark Raasveldt try to tackle this goal through a single node, in-process OLAP database supporting the well known SQL query language.

Background

Database systems using a client-server protocol have the problem that transferring large amounts of data from the database to a client program via a network requires a lot of time. Mühleisen and Raasveldt conducted a study to compare the wall clock time for retrieving a table over a loopback connection. [1] They tested different popular DBMS tools such as MySQL, PostgreSQL, MonetDB and MongoDB. As comparison, the authors show the wall clock time for netcat to transfer a CSV of the data. They were able to show in their study that the Netcat process taking 10.25s is at least an order of magnitude faster than the common data management systems (MySQL 101.3s, MongoDB 686.5s).

To investigate the reasons for the long wall clock times, the researchers looked at the protocols and investigated the used data serialization formats. For instance, in the PostgreSQL protocol, each row is transmitted using a separate control message. Analysis of the result set serialization reveals that the actual data is only a small portion of the protocol, with the remainder being protocol overhead.
This overhead includes the transfer of the total length, the number of fields, and the length of each field prior to the actual data, repeated for every row. Such a process is notably inefficient. [1]

There are other approaches that aim to improve the serialization speed and transfer size by various methods. Examples are Apache Arrow, Protobuf, APache Avro and more.They often transfer a schema for the data once at handshake, then use more efficient encodings because the data types are already known, as well as other tricks like variable width integer encoding.

Main ideas, methodology, approach

DuckDB was released in 2018 and since then has become widely popular (400k views a month, 13k stars on GitHub) when it comes to in-process OLAP database management systems. There are several characteristics, tools and solutions who are responsible for the success:

Installation

DuckDB does not need external dependencies and its Python client is easy to install locally on any device with a single command like pip install duckdb. When used from one of its many supported programming languages, it runs directly in the host process and does not need to be executed separately.

Features

DuckDB contains a rich feature set. It supports SQL as the query language with advanced features like constraints, window functions and useful utilities to deal with dates, timestamps and text patterns.
As importing large amounts of data plays a big role in OLAP workloads, there is a strong focus on making importing data into the database as easy and fast as possible, supporting CSV, JSON and Parquet files as well as standard SQL INSERTs.

Saved in single file

SQLite is a popular database system and widely common for embedded OLTP tasks. It has no client-server structure and also works in-process. It has been the role model for the construction of a similar system for OLAP tasks which is DuckDB. In the SQLite system, the whole database is stored in a single file. This approach has been adopted by DuckDB. This has the benefit of easy management when moving, distributing or backing up the database as well as reduced overhead for the library to read the data.

Vectorized processes

However, the main problem that the founders wanted to tackle was that the conventional client server systems were too slow due to serialization and embedded solutions like SQLite were not suited for OLAP workloads. It was therefore important for DuckDB to be fast, and this was achieved with various different approaches. Since it is an in-process management tool, client server protocols are completely eliminated, which also reduces the time needed to pull data. Furthermore, they used state-of-the-art research (like the X100 query engine [3]) in vector processing. In a query execution, vertical fragments (so-called vectors) are passed between the operators. In DuckDB a table consists of a list of lists of vectors. A logical column of data is represented by a list of vectors.

Use cases

As already mentioned, DuckDB aims at OLAP workloads that fit onto a single machine. In times of easy access to easily scalable distributed database offers on the cloud, it is tempting to just go with an online SaaS solution. However, this can often be excessive and adds lots of complexity and network and storage overhead.

The traditional paradigm of a centralized database infrastructure is undergoing a significant transformation due to the increasing power of devices such as laptops and smartphones. The emergence of powerful and portable computing devices has prompted a shift in where the data processing occurs, challenging the conventional notion of a centralized database. This decentralization is reshaping the landscape of Extract, Transform, Load (ETL) pipelines, as the traditional model designed for centralized data processing is being adapted to accommodate distributed computing environments. The implications of this shift extend beyond the mere location of data storage, influencing the dynamics of data processing, accessibility, and the architecture of database systems. As a result, there is a growing need for innovation solutions and approaches to address the evolving demands of a decentralized data ecosystem. DuckDB offers a solution for decentralized data management.

With increasing capabilities and storage capacities of personal devices, many workloads can be done locally. DuckDB is executed either via the CLI or can be linked into a variety of programming languages from a single dynamic library. Serialisation for network / IPC transfer is not a problem here, as the database runs in the same memory space as the embedding program.
In Python's Pandas for example it is possible to write a query result directly into the data frame without serialization. This eliminates a whole range of the aforementioned protocol problems and overheads. If several users want to use the same database in DuckDB, read-only access is possible. What is currently not yet possible is write access from multiple processes, as this would require file system level syncing and locking mechanisms.

DuckDB is so lightweight that it can even be run on a browser. It therefore gets compiled from C++ to WASM (WebAssembly) and executed directly on the client without the need to install anything. This opens interesting new opportunities, as this offloads the whole compute power to clients, thus the provider of an OLAP tool doesn't need to run expensive servers.

Common tasks may be analyzing or comparing local files, aggregations and joins over large tables, easy sharing of datasets, or some more uncommon tasks like stubbing databases such as BigQuery for local development [2].

Summary

In client server architectures, it is always assumed that the database management system is in a fixed location. An in-process database management system in comparison is on the local device. The database directly runs on the clients device and there is no need for inefficient client server protocols.
This in-process, local setup allows for data- and compute-decentralization. The data can be kept directly on the user's end device, be locally processed using well-known SQL dialect, and then be used by the embedding process without serialization or network cost. From a privacy perspective, this is a really promising approach to keep the data locally instead of centralizing it and giving it in the hands of companies.

References

  • [1] Raasveldt, Mark, and Hannes Mühleisen. "Don't hold my data hostage: a case for client protocol redesign." Proceedings of the VLDB Endowment 10.10 (2017): 1022-1033.
  • [2] Reddit: https://www.reddit.com/r/dataengineering/comments/zp6ai6/comment/j0rkjgu/?utm_source=share&utm_medium=web2x&context=3
  • [3] Boncz, P. A., Zukowski, M., & Nes, N. (2005, January). MonetDB/X100: Hyper-Pipelining Query Execution. In Cidr (Vol. 5, pp. 225-237).