Hyrise: The Open-Source In-Memory Research DBMS

General Information

Hyrise is the research in-memory database system that has been developed by our group since 2009 and has been entirely rewritten since 2016. Our goal is to provide a clean and flexible platform for research in the area of in-memory data management. Its architecture allows us, our students, and other researchers to conduct experiments around new data management concepts. To enable realistic experiments, Hyrise features comprehensive SQL support and performs powerful query plan optimizations. Well-known benchmarks, such as TPC-H or TPC-DS, can be executed with a single command and without any preparation.

Above the DBMS foundation of Hyrise, we build the “autonomous database”. The vision is to support database administrators (DBAs) in handling the growing complexity of not only the database systems themselves, but also that of the stored data and the workload. For this, Hyrise monitors runtime parameters, predicts the impact of possible configuration changes, and automatically applies those changes that are deemed beneficial. These changes may include the creation of secondary indexes, the (re-)encoding of columns, or the eviction of unused data to lower memory and storage tiers.

To foster reuse and reproduction, Hyrise is completely open source and available on Github. We value high quality code (C++20), as documented by strict code reviews, a test coverage of 90%, various linting and static code analysics tools, and a comment-to-code ratio of 1:4.

You are invited to read our Step-by-Step Guide and to contact us with any questions that may arise.

The project team consists of Martin Boissier, Stefan Halfpap, Keven Richly, Marcel Weisgut, Daniel Lindner, Dr. Michael Perscheid, and Prof. Dr. h.c. Hasso Plattner. We thank all student contributors, without whom this work would not have been possible.

Database Architecture

Hyrise consists of two parts. Firstly, the DBMS Foundation comprises the components that are necessary to store data and execute queries. Secondly, the Autonomous Database, which will be described below, is responsible for automatically tuning the system. The architecture diagram above visualizes these two parts.

Users can interact with Hyrise using one of three interfaces: First, the CLI Console offers features beyond those traditionally known from command line clients. These include the inline visualization of query plans in the form of annotated graphs. Second, Hyrise supports the PostgreSQL wire protocol and can thus be accessed using the psql client or compatible libraries. Finally, the benchmark binaries are a one-stop solution for executing different benchmarks and obtaining human- and machine-readable benchmark results.

Independent of the used interface, SQL queries enter the SQL Pipeline, which transforms the query string into a logical query plan, which is then optimized, transformed to a physical plan, and finally executed. We discussed the different optimization steps and quantified their impact here.

Hyrise stores table data in so-called chunks. A chunk is a fine-granular, horizontal partition of the table with a predefined number of rows. New rows are inserted into the last chunk of the table. Once this chunk reaches its target size, it is marked as immutable and a new mutable chunk is appended. Chunks are used as a flexible basis for indexes, filters, and statistics. Internally, chunks hold one segment per column of the table. This makes Hyrise a primarily column-oriented DBMS. Segments that are part of an immutable chunk may asynchronously be encoded (aka. compressed) using one of several encoding schemes. By default, dictionary encoding is used.

Autonomous Database

The DBMS Foundation is the basis for our autonomous database. We support a number tuning options that can be used to optimize the system’s performance. Among them is the automatic selection of encoding mechanisms, the data-driven partitioning, and the automatic migration of data between tiers. Many of these are developed as parts of individual research projects. As such, they are subject to frequent changes. To facilitate the independent development of these tuning options, we have decoupled them from the Hyrise core and implement them in the form of plugins.

At the same time, many tuning options have shared requirements. For an efficient selection of encoding mechanisms, where less frequently accessed segments are compressed more heavily, the number of accesses to these segments has to be tracked. The same information is needed by the automatic tiering plugin. Overlaps between plugins cannot only be found in their input data, but also in internal mechanisms. For example, the mentioned plugins both aim at balancing two competing goals, i.e., reducing the DRAM footprint without negatively affecting the query throughput. In the long run, we plan for these shared requirements to be fulfilled by the driver in the Hyrise core.

The driver takes input parameters from runtime KPIs (e.g., the system utilization, the number of accesses to individual segments, and more), the constraints defined by the DBA, and the options provided by the different plugins. Based on these parameters, it makes decisions in a centralized manner. These decisions can then be realized by the different plugins.

Hyrise Cockpit Demonstration

In practice, automated database administration approaches are often distrusted. The Hyrise cockpit provides means for database administrators and developers to experiment with plugins that tune database systems autonomously. These experiments should lead to a better understanding of the functionality of such approaches and, in the end, increase trust in such solutions. The Hyrise cockpit intends to create confidence in autonomous solutions by allowing to compare the performance for complex workloads of conventional and autonomously configured systems side by side. The cockpit is demonstrated in the video below which was presented at ICDE 2021.

Embedded Vimeo video

Note:This embedded video is provided by Vimeo, Inc., 555 West 18th Street, New York, New York 10011, USA.
When playing the video, a connection to the Vimeo servers is established. Vimeo will be informed which pages you visit. If you are logged into your Vimeo account, Vimeo can assign your surfing behavior to you individually. You can prevent this by logging out of your Vimeo account beforehand.

Data privacy Show video

Research Activities

The scope of our research includes (I) data compression and tiering, (II) data replication and scale-out, (III) index selection (IV) and the joint tuning of these approaches for both relational and spatio-temporal workloads. We also explore (V) the usage of data dependencies in the context of query optimization.

I Data Compression & Tiering

Data compression and tiering are powerful methods to address the memory bottleneck and cost inefficiencies for in-memory databases. The automatic decision on which data compression technique to use in in-memory column stores is challenging due to trade-offs and non-obvious impacts on large workloads. we propose a solution for an automatic selection of a budget- constraint encoding in Hyrise, based on linear programming (LP) and greedy heuristics. The encoding configurations are robust with respect to runtime performance, adaptable and workload-aware. To ensure performance robustness, LP techniques are applied to achieve equally distributed performance gains over all queries. The results show the potential of significant memory budget reductions without a deterioration of runtime performance.

Similarly, data tiering promises to reduce the amount of data in main memory by moving infrequently used data to cheaper and more elastic lower memory and secondary storage tiers. The challenge is to find an optimal balance for the trade-off between performance and costs. We propose an automatic tiering for Hyrise, using LP, that addresses this challenge. Our approach tracks frequency and pattern of data accesses to identify rarely used data, which are moved to secondary memory tiers (e. g., NVM / SSDs). This method is applicable to column selection problems in general and ensures Pareto-efficiency for varying memory budgets. Since, aspects like selectivity, size and frequency of queries are taken into account, the resulting performance is op- timized and outperforms other heuristics.

II Replication & Scale-out

Database replication and query load-balancing are important mechanisms to scale query throughput. The analysis of workloads allows load-balancing queries to replica nodes according to their accessed data. As a result, replica nodes must only store and synchronize subsets of the data. However, evenly balancing the load of large-scale workloads while minimizing the memory footprint is complex and challenging. Moreover, state-of-the-art allocation approaches are either time consuming or the resulting allocations are not memory-efficient. In our work, we used LP-based decomposition techniques to determine optimized data placements and workload distributions. We extended these solutions considering potential node failures. Further, we derived a heuristic solution to compute robust solutions for large, real-life workload instances providing a competitive performance for different potential as well as uncertain workload scenarios.

III Index Selection

Indexes are essential for the efficient processing of database workloads. However, some index selection algorithms are either not fast or not highly competitive, as we found in our survey, which evaluates state-of-the-art approaches using our open-source evaluation platform. To overcome the observed limitations of existing approaches, we developed three new index selection algorithms, serving different purposes: (i) EXTEND determines (near-)optimal solutions with an iterative heuristic. The produced solutions outperform others in most evaluated cases while the selection runtime is up to 10× lower. (ii) SWIRL is based on reinforcement learning (RL) and — after training — delivers solutions instantaneously. SWIRL decreases selection runtimes by orders of magnitude, while the solution quality is within 2% of the best solutions. While EXTEND is universally applicable with a high solution quality, SWIRL requires training, but reduces runtimes. (iii) Our decomposition concept for solver-based index selection approaches allows to deal with larger candidate sets and makes it possible to address risk-averse problem versions, where multiple potential future workloads are taken into account.

IV Joint Tuning & Spatio-Temporal Decisions

Challenges for self-driving database systems, which tune their physical design and configuration autonomously, are manifold: such systems have to anticipate future workloads, find robust configurations efficiently, and incorporate knowledge gained by previous actions into later decisions. We present a theoretical, component-based framework for self-driving database systems that enables database integration and development of self-managing functionality with low overhead, by relying on separation of concerns. We started to implement joint tuning approaches in Hyrise, accounting for combined indexing, sorting, and compression configurations for spatio-temporal applications.

V Data Dependencies

Efficient query optimization is usually based on metadata, such as cardinalities and other basic statistics. More advanced techniques consider data dependency types, such as functional, uniqueness, order, or inclusion constraints / dependencies. We identified 60 query optimization techniques for application areas like join, selection, sorting and set operations in the literature that are based on data dependencies.

Toward an efficient implementation and integration into commercial database systems, we laid out a vision for a workload-driven discovery system for query optimization. The dependency discovery is considered “lazy” since only those data dependency candidates are considered that are relevant for the observed workload. Our prototypical implementation in Hyrise identifies relevant data dependency candidates based on executed query plans and dynamically validates the candidates against the database, leading to performance improvements.

Additional Resources

Besides our publications (see below), we are also documenting our progress with Hyrise in the Hyrise Wiki, on our Medium Blog and on the Hyrise Twitter channel.

Publications

Riekenbrauck, N., Weisgut, M., Lindner, D., Rabl, T.: A Three-Tier Buffer Manager Integrating CXL Device Memory for Database Systems. Joint International Workshop on Big Data Management on Emerging Hardware and Data Management on Virtualized Active Systems @ ICDE 2024 (2024).

[ BibTeX ]

Richly, K., Schlosser, R., Boissier, M.: Budget-Conscious Fine-Grained Configuration Optimization for Spatio-Temporal Applications. Proceedings of the VLDB Endowment. bll. 4079–4092 (2022).

[ BibTeX ] [ Download ]

Weisgut, M., Ritter, D., Boissier, M., Perscheid, M.: Separated Allocator Metadata in Disaggregated In-Memory Databases: Friend or Foe?. 1st Workshop on Composable Systems (COMPSYS@IPDPS), awarded as best paper (2022).

[ BibTeX ]

Kossmann, J., Lindner, D., Naumann, F., Papenbrock, T.: Workload-driven, Lazy Discovery of Data Dependencies for Query Optimization. Proceedings of the Conference on Innovative Data Systems Research (CIDR) (2022).

[ BibTeX ]

Kossmann, J., Papenbrock, T., Naumann, F.: Data dependencies for query optimization: a survey. VLDB Journal. (2021).

[ BibTeX ]

Lindner, D., Loeser, A., Kossmann, J.: Learned What-If Cost Models for Autonomous Clustering. New Trends in Database and Information Systems - ADBIS 2021 Short Papers, Doctoral Consortium and Workshops, Tartu, Estonia. bll. 3–13 (2021).

[ BibTeX ] [ Download ]

Kossmann, J., Boissier, M., Dubrawski, A., Heseding, F., Mandel, C., Pigorsch, U., Schneider, M., Schniese, T., Sobhani, M., Tsayun, P., Wille, K., Perscheid, M., Uflacker, M., Plattner, H.: A Cockpit for the Development and Evaluation of Autonomous Database Systems. 37th IEEE International Conference on Data Engineering, ICDE. bll. 2685–2688 (2021).

[ BibTeX ] [ Download ]

@inproceedings{kossmann2021cockpit,
  abstract = {Databases are highly optimized complex systems with a multitude of configuration options. Especially in cloud scenarios with thousands of database deployments, determining optimized database configurations in an automated fashion is of increasing importance for database providers. At the same time, due to increased system complexity, it becomes more challenging to identify well-performing configurations. Therefore, research interest in autonomous or self-driving database systems has increased enormously in recent years. Such systems promise both performance improvements and cost reductions. In the literature, various fully or partially autonomous optimization mechanisms exist that optimize single aspects, e.g., index selection. However, database administrators and developers often distrust autonomous approaches, and there is a lack of practical experimentation opportunities that could create a better understanding. Moreover, the interplay of different autonomous mechanisms under complex workloads remains an open question. The presented cockpit enables an interactive assessment of the impact of autonomous components for database systems by comparing (autonomous) systems with different configurations side by side. Thereby, the cockpit enables users to build trust in autonomous solutions by experimenting with such technologies and observing their effects in practice.},
  author = {Kossmann, Jan and Boissier, Martin and Dubrawski, Alexander and Heseding, Fabian and Mandel, Caterina and Pigorsch, Udo and Schneider, Max and Schniese, Til and Sobhani, Mona and Tsayun, Petr and Wille, Katharina and Perscheid, Michael and Uflacker, Matthias and Plattner, Hasso},
  booktitle = {37th IEEE International Conference on Data Engineering, ICDE},
  keywords = {adm autonomous database hyrise in-memory_database mboissierselected myown self-driving self-managing},
  pages = {2685-2688},
  title = {A Cockpit for the Development and Evaluation of Autonomous Database Systems},
  year = 2021
}

Kossmann, J., Halfpap, S., Jankrift, M., Schlosser, R.: Magic mirror in my hand, which is the best in the land? An Experimental Evaluation of Index Selection Algorithms. Proceedings of the VLDB Endowment. bll. 2382–2395 (2020).

[ BibTeX ] [ Download ]

Schlosser, R., Halfpap, S.: A Decomposition Approach for Risk-Averse Index Selection. 32nd International Conference on Scientific and Statistical Database Management (SSDBM 2020). bll. 16:1–16:4 (2020).

[ BibTeX ] [ Download ]

10.

Kossmann, J., Schlosser, R.: Self-driving database systems: a conceptual approach. Distributed and Parallel Databases. 38 (4), 795–817 (2020).

[ BibTeX ]

11.

Dreseler, M., Boissier, M., Rabl, T., Uflacker, M.: Quantifying TPC-H Choke Points and Their Optimizations. Proceedings of the VLDB Endowment. bll. 1206–1220 (2020).

[ BibTeX ] [ Download ]

12.

Dreseler, M., Kossmann, J., Boissier, M., Klauck, S., Uflacker, M., Plattner, H.: Hyrise Re-engineered: An Extensible Database System for Research in Relational In-Memory Data Management. 22nd International Conference on Extending Database Technology (EDBT). bll. 313–324 (2019).

[ BibTeX ] [ Download ]

@inproceedings{dreseler2018,
  abstract = {Research in data management profits when the performance evaluation is based not only on individual components in isolation, but uses an actual DBMS end-to-end. Facilitating the integration and benchmarking of new concepts within a DBMS requires a simple setup process, well-documented code, and the possibility to execute both standard and custom benchmarks without tedious preparation. Fulfilling these requirements also makes it easy to reproduce the results later on. The relational open-source database Hyrise (VLDB, 2010) was presented to make the case for hybrid row- and column-format data storage. Since then, it has evolved from being a single- purpose research DBMS towards becoming a platform for various projects, including research in the areas of indexing, data partitioning, and non-volatile memory. With a growing diversity of topics, we have found that the original code base grew to a point where new experimentation became unnecessarily difficult. Over the last two years, we have re-written Hyrise from scratch and built an extensible multi-purpose research DBMS that can serve as an easy-to-extend platform for a variety of experiments and prototyping in database research. In this paper, we discuss how our learnings from the previous version of Hyrise have influenced our re-write. We describe the new architecture of Hyrise and highlight the main components. Afterwards, we show how our extensible plugin architecture facilitates research on diverse DBMS-related aspects without compromising the architectural tidiness of the code. In a first performance evaluation, we show that the execution time of most TPC-H queries is competitive to that of other research databases.},
  author = {Dreseler, Markus and Kossmann, Jan and Boissier, Martin and Klauck, Stefan and Uflacker, Matthias and Plattner, Hasso},
  booktitle = {22nd International Conference on Extending Database Technology (EDBT)},
  keywords = {adm hyrise mboissierselected myown},
  month = 3,
  pages = {313-324},
  title = {Hyrise Re-engineered: An Extensible Database System for Research in Relational In-Memory Data Management},
  year = 2019
}

13.

Boissier, M., Jendruk, M.: Workload-Driven and Robust Selection of Compression Schemes for Column Stores. 22nd International Conference on Extending Database Technology, EDBT. bll. 674–677 (2019).

[ BibTeX ] [ Download ]

14.

Dreseler, M.: Storing STL Containers on NVM. Persistent Programming in Real Life (2019).

[ BibTeX ] [ Download ]

15.

Schmidt, C., Dreseler, M., Akin, B., Roy, A.: A Case for Hardware-Supported Sub-Cache Line Accesses. Data Management on New Hardware (DaMoN), in conjunction with SIGMOD (2018).

[ BibTeX ] [ Download ]

16.

Dreseler, M., Kossmann, J., Frohnhofen, J., Uflacker, M., Plattner, H.: Fused Table Scans: Combining AVX-512 and JIT to Double the Performance of Multi-Predicate Scans. Joint Workshop of HardBD (International Workshop on Big Data Management on Emerging Hardware) and Active (Workshop on Data Management on Virtualized Active Systems), in conjunction with ICDE (2018).

[ BibTeX ] [ Download ]

17.

Kossmann, J., Dreseler, M., Gasda, T., Uflacker, M., Plattner, H.: Visual Evaluation of SQL Plan Cache Algorithms. Australasian Database Conference (ADC) (2018).

[ BibTeX ] [ Download ]

18.

Dreseler, M., Gasda, T., Kossmann, J., Uflacker, M., Plattner, H.: Adaptive Access Path Selection for Hardware-Accelerated DRAM Loads. Australasian Database Conference (ADC) (2018).

[ BibTeX ] [ Download ]

19.

Schwalb, D., Bk, G.K., Dreseler, M., S, A., Faust, M., Hohl, A., Berning, T., Makkar, G., Plattner, H., Deshmukh, P.: Hyrise-NV: Instant Recovery for In-Memory Databases using Non-Volatile Memory. International Conference on Database Systems for Advanced Applications (DASFAA) (2016).

[ BibTeX ]

20.

Schwalb, D., Dreseler, M., Uflacker, M., Plattner, H.: NVC-Hashmap: A Persistent and Concurrent Hashmap For Non-Volatile Memories. In-Memory Data Management Workshop (IMDM), in conjunction with VLDB (2015).

[ BibTeX ] [ Download ]

21.

Schwalb, D., Kossmann, J., Faust, M., Klauck, S., Uflacker, M., Plattner, H.: Hyrise-R: Scale-out and Hot-Standby through Lazy Master Replication for Enterprise Applications. Proceedings of the 3rd VLDB Workshop on In-Memory Data Mangement and Analytics (IMDM), in conjunction with VLDB 2015 Kohala Coast, Hawaii (2015).

[ BibTeX ]

22.

Faust, M., Schwalb, D., Plattner, H.: Composite Group-Keys: Space-efficient Indexing of Multiple Columns for Compressed In-Memory Column Stores. IMDM in conjunction with VLDB (2014).

[ BibTeX ]

23.

Schwalb, D., Faust, M., Wust, J., Grund, M., Plattner, H.: Efficient Transaction Processing for Hyrise in Mixed Workload Environments. IMDM in conjunction with VLDB (2014).

[ BibTeX ]

24.

Grund, M., Cudre-Mauroux, P., Krüger, J., Madden, S., Plattner, H.: An overview of HYRISE - a Main Memory Hybrid Storage Engine. IEEE Data Engineering Bulletin. (2012).

[ BibTeX ]

25.

Faust, M., Krüger, J., Schwalb, D., Plattner, H.: Fast Lookups for In-Memory Column Stores: Group-Key Indices, Lookup and Maintenance. ADMS (in conjunction with VLDB) (2012).

[ BibTeX ]

26.

Grund, M., Krüger, J., Plattner, H., Zeier, A., Cudre-Mauroux, P., Madden, S.: HYRISE - A Hybrid Main Memory Storage Engine. Proceedings of the VLDB Endowment Volume 4 Issue 2. bll. 105–116 (2011).

[ BibTeX ]

27.

Grund, M., Cudre-Mauroux, P., Madden, S.: A Demonstration of HYRISE- A Main Memory Hybrid Storage Engine. VLDB (2011).

[ BibTeX ]

Hyrise: The Open-Source In-Memory Research DBMS

General Information

Database Architecture

Autonomous Database

Hyrise Cockpit Demonstration

Research Activities

I Data Compression & Tiering

II Replication & Scale-out

III Index Selection

IV Joint Tuning & Spatio-Temporal Decisions

V Data Dependencies

Additional Resources

Publications

News

22.09.2023 | Trends and Concepts in the Softwareindustry Seminar offered in WiSe 2023/2024

22.05.2023 | Christopher Hagedorn Successfully Defended His PhD Thesis

03.03.2023 | Last Trends and Concepts course of Prof. Hasso Plattner

01.03.2023 | Jan Kossmann Successfully Defended His PhD Thesis

26.02.2023 | Paper on Data Tiering in Hyrise Published in BTW Proceedings

24.02.2023 | Paper on EPIC Research Group Published in SIGMOD Record

30.11.2022 | Paper on Database Optimizations for Spatio-Temporal Data published in PVLDB

04.10.2022 | Günter Hesse Successfully Defended His PhD Thesis

08.07.2022 | Successful PhD Defense by Markus Dreseler

Literature

Contact