Hasso-Plattner-Institut
Prof. Dr. h.c. mult. Hasso Plattner
 

Hyrise: The Open-Source In-Memory Research DBMS

General Information

Hyrise is the research in-memory database system that has been developed by our group since 2009 and has been entirely rewritten since 2016. Our goal is to provide a clean and flexible platform for research in the area of in-memory data management. Its architecture allows us, our students, and other researchers to conduct experiments around new data management concepts. To enable realistic experiments, Hyrise features comprehensive SQL support and performs powerful query plan optimizations. Well-known benchmarks, such as TPC-H or TPC-DS, can be executed with a single command and without any preparation.

Above the DBMS foundation of Hyrise, we build the “autonomous database”. The vision is to support database administrators (DBAs) in handling the growing complexity of not only the database systems themselves, but also that of the stored data and the workload. For this, Hyrise monitors runtime parameters, predicts the impact of possible configuration changes, and automatically applies those changes that are deemed beneficial. These changes may include the creation of secondary indexes, the (re-)encoding of columns, or the eviction of unused data to lower memory and storage tiers.

To foster reuse and reproduction, Hyrise is completely open source and available on Github. We value high quality code (C++20), as documented by strict code reviews, a test coverage of 90%, various linting and static code analysics tools, and a comment-to-code ratio of 1:4.

You are invited to read our Step-by-Step Guide and to contact us with any questions that may arise.

The project team consists of Martin BoissierStefan Halfpap, Keven Richly, Marcel Weisgut, Daniel Lindner, Dr. Michael Perscheid, and Prof. Dr. h.c. Hasso Plattner. We thank all student contributors, without whom this work would not have been possible.

Database Architecture

Hyrise consists of two parts. Firstly, the DBMS Foundation comprises the components that are necessary to store data and execute queries. Secondly, the Autonomous Database, which will be described below, is responsible for automatically tuning the system. The architecture diagram above visualizes these two parts.

Users can interact with Hyrise using one of three interfaces: First, the CLI Console offers features beyond those traditionally known from command line clients. These include the inline visualization of query plans in the form of annotated graphs. Second, Hyrise supports the PostgreSQL wire protocol and can thus be accessed using the psql client or compatible libraries. Finally, the benchmark binaries are a one-stop solution for executing different benchmarks and obtaining human- and machine-readable benchmark results.

Independent of the used interface, SQL queries enter the SQL Pipeline, which transforms the query string into a logical query plan, which is then optimized, transformed to a physical plan, and finally executed. We discussed the different optimization steps and quantified their impact here.

Hyrise stores table data in so-called chunks. A chunk is a fine-granular, horizontal partition of the table with a predefined number of rows. New rows are inserted into the last chunk of the table. Once this chunk reaches its target size, it is marked as immutable and a new mutable chunk is appended. Chunks are used as a flexible basis for indexes, filters, and statistics. Internally, chunks hold one segment per column of the table. This makes Hyrise a primarily column-oriented DBMS. Segments that are part of an immutable chunk may asynchronously be encoded (aka. compressed) using one of several encoding schemes. By default, dictionary encoding is used.

Autonomous Database

The DBMS Foundation is the basis for our autonomous database. We support a number tuning options that can be used to optimize the system’s performance. Among them is the automatic selection of encoding mechanisms, the data-driven partitioning, and the automatic migration of data between tiers. Many of these are developed as parts of individual research projects. As such, they are subject to frequent changes. To facilitate the independent development of these tuning options, we have decoupled them from the Hyrise core and implement them in the form of plugins.

At the same time, many tuning options have shared requirements. For an efficient selection of encoding mechanisms, where less frequently accessed segments are compressed more heavily, the number of accesses to these segments has to be tracked. The same information is needed by the automatic tiering plugin. Overlaps between plugins cannot only be found in their input data, but also in internal mechanisms. For example, the mentioned plugins both aim at balancing two competing goals, i.e., reducing the DRAM footprint without negatively affecting the query throughput. In the long run, we plan for these shared requirements to be fulfilled by the driver in the Hyrise core.

The driver takes input parameters from runtime KPIs (e.g., the system utilization, the number of accesses to individual segments, and more), the constraints defined by the DBA, and the options provided by the different plugins. Based on these parameters, it makes decisions in a centralized manner. These decisions can then be realized by the different plugins.

Hyrise Cockpit Demonstration

In practice, automated database administration approaches are often distrusted. The Hyrise cockpit provides means for database administrators and developers to experiment with plugins that tune database systems autonomously. These experiments should lead to a better understanding of the functionality of such approaches and, in the end, increase trust in such solutions. The Hyrise cockpit intends to create confidence in autonomous solutions by allowing to compare the performance for complex workloads of conventional and autonomously configured systems side by side. The cockpit is demonstrated in the video below which was presented at ICDE 2021.

Research Activities

The scope of our research includes (I) data compression and tiering, (II) data replication and scale-out, (III) index selection (IV) and the joint tuning of these approaches for both relational and spatio-temporal workloads. We also explore (V) the usage of data dependencies in the context of query optimization.

I Data Compression & Tiering

Data compression and tiering are powerful methods to address the memory bottleneck and cost inefficiencies for in-memory databases. The automatic decision on which data compression technique to use in in-memory column stores is challenging due to trade-offs and non-obvious impacts on large workloads. we propose a solution for an automatic selection of a budget- constraint encoding in Hyrise, based on linear programming (LP) and greedy heuristics. The encoding configurations are robust with respect to runtime performance, adaptable and workload-aware. To ensure performance robustness, LP techniques are applied to achieve equally distributed performance gains over all queries. The results show the potential of significant memory budget reductions without a deterioration of runtime performance.

Similarly, data tiering promises to reduce the amount of data in main memory by moving infrequently used data to cheaper and more elastic lower memory and secondary storage tiers. The challenge is to find an optimal balance for the trade-off between performance and costs. We propose an automatic tiering for Hyrise, using LP, that addresses this challenge. Our approach tracks frequency and pattern of data accesses to identify rarely used data, which are moved to secondary memory tiers (e. g., NVM / SSDs). This method is applicable to column selection problems in general and ensures Pareto-efficiency for varying memory budgets. Since, aspects like selectivity, size and frequency of queries are taken into account, the resulting performance is op- timized and outperforms other heuristics.

II Replication & Scale-out

Database replication and query load-balancing are important mechanisms to scale query throughput. The analysis of workloads allows load-balancing queries to replica nodes according to their accessed data. As a result, replica nodes must only store and synchronize subsets of the data. However, evenly balancing the load of large-scale workloads while minimizing the memory footprint is complex and challenging. Moreover, state-of-the-art allocation approaches are either time consuming or the resulting allocations are not memory-efficient. In our work, we used LP-based decomposition techniques to determine optimized data placements and workload distributions. We extended these solutions considering potential node failures. Further, we derived a heuristic solution to compute robust solutions for large, real-life workload instances providing a competitive performance for different potential as well as uncertain workload scenarios.

III Index Selection

Indexes are essential for the efficient processing of database workloads. However, some index selection algorithms are either not fast or not highly competitive, as we found in our survey, which evaluates state-of-the-art approaches using our open-source evaluation platform. To overcome the observed limitations of existing approaches, we developed three new index selection algorithms, serving different purposes: (i) EXTEND determines (near-)optimal solutions with an iterative heuristic. The produced solutions outperform others in most evaluated cases while the selection runtime is up to 10× lower. (ii) SWIRL is based on reinforcement learning (RL) and — after training — delivers solutions instantaneously. SWIRL decreases selection runtimes by orders of magnitude, while the solution quality is within 2% of the best solutions. While EXTEND is universally applicable with a high solution quality, SWIRL requires training, but reduces runtimes. (iii) Our decomposition concept for solver-based index selection approaches allows to deal with larger candidate sets and makes it possible to address risk-averse problem versions, where multiple potential future workloads are taken into account.

IV Joint Tuning & Spatio-Temporal Decisions

Challenges for self-driving database systems, which tune their physical design and configuration autonomously, are manifold: such systems have to anticipate future workloads, find robust configurations efficiently, and incorporate knowledge gained by previous actions into later decisions. We present a theoretical, component-based framework for self-driving database systems that enables database integration and development of self-managing functionality with low overhead, by relying on separation of concerns. We started to implement joint tuning approaches in Hyrise, accounting for combined indexing, sorting, and compression configurations for spatio-temporal applications.

V Data Dependencies

Efficient query optimization is usually based on metadata, such as cardinalities and other basic statistics. More advanced techniques consider data dependency types, such as functional, uniqueness, order, or inclusion constraints / dependencies. We identified 60 query optimization techniques for application areas like join, selection, sorting and set operations in the literature that are based on data dependencies.

Toward an efficient implementation and integration into commercial database systems, we laid out a vision for a workload-driven discovery system for query optimization. The dependency discovery is considered “lazy” since only those data dependency candidates are considered that are relevant for the observed workload. Our prototypical implementation in Hyrise identifies relevant data dependency candidates based on executed query plans and dynamically validates the candidates against the database, leading to performance improvements.

Additional Resources

Besides our publications (see below), we are also documenting our progress with Hyrise in the Hyrise Wiki, on our Medium Blog and on the Hyrise Twitter channel.

Publications

  • 1.
    Richly, K., Schlosser, R., Boissier, M.: Budget-Conscious Fine-Grained Configuration Optimization for Spatio-Temporal Applications. Proceedings of the VLDB Endowment. bll. 4079–4092 (2022).
     
  • 2.
    Kossmann, J., Lindner, D., Naumann, F., Papenbrock, T.: Workload-driven, Lazy Discovery of Data Dependencies for Query Optimization. Proceedings of the Conference on Innovative Data Systems Research (CIDR) (2022).
     
  • 3.
    Weisgut, M., Ritter, D., Boissier, M., Perscheid, M.: Separated Allocator Metadata in Disaggregated In-Memory Databases: Friend or Foe?. 1st Workshop on Composable Systems (COMPSYS@IPDPS), awarded as best paper (2022).
     
  • 4.
    Kossmann, J., Papenbrock, T., Naumann, F.: Data dependencies for query optimization: a survey. VLDB Journal. (2021).
     
  • 5.
    Lindner, D., Loeser, A., Kossmann, J.: Learned What-If Cost Models for Autonomous Clustering. New Trends in Database and Information Systems - ADBIS 2021 Short Papers, Doctoral Consortium and Workshops, Tartu, Estonia. bll. 3–13 (2021).
     
  • 6.
    Kossmann, J., Boissier, M., Dubrawski, A., Heseding, F., Mandel, C., Pigorsch, U., Schneider, M., Schniese, T., Sobhani, M., Tsayun, P., Wille, K., Perscheid, M., Uflacker, M., Plattner, H.: A Cockpit for the Development and Evaluation of Autonomous Database Systems. 37th IEEE International Conference on Data Engineering, ICDE. bll. 2685–2688 (2021).
     
  • 7.
    Kossmann, J., Halfpap, S., Jankrift, M., Schlosser, R.: Magic mirror in my hand, which is the best in the land? An Experimental Evaluation of Index Selection Algorithms. Proceedings of the VLDB Endowment. bll. 2382–2395 (2020).
     
  • 8.
    Schlosser, R., Halfpap, S.: A Decomposition Approach for Risk-Averse Index Selection. 32nd International Conference on Scientific and Statistical Database Management (SSDBM 2020). bll. 16:1–16:4 (2020).
     
  • 9.
    Kossmann, J., Schlosser, R.: Self-driving database systems: a conceptual approach. Distributed and Parallel Databases. 38 (4), 795–817 (2020).
     
  • 10.
    Dreseler, M., Boissier, M., Rabl, T., Uflacker, M.: Quantifying TPC-H Choke Points and Their Optimizations. Proceedings of the VLDB Endowment. bll. 1206–1220 (2020).
     
  • 11.
    Dreseler, M.: Storing STL Containers on NVM. Persistent Programming in Real Life (2019).
     
  • 12.
    Boissier, M., Jendruk, M.: Workload-Driven and Robust Selection of Compression Schemes for Column Stores. 22nd International Conference on Extending Database Technology (EDBT). bll. 674–677 (2019).
     
  • 13.
    Dreseler, M., Kossmann, J., Boissier, M., Klauck, S., Uflacker, M., Plattner, H.: Hyrise Re-engineered: An Extensible Database System for Research in Relational In-Memory Data Management. 22nd International Conference on Extending Database Technology (EDBT). bll. 313–324 (2019).
     
  • 14.
    Dreseler, M., Gasda, T., Kossmann, J., Uflacker, M., Plattner, H.: Adaptive Access Path Selection for Hardware-Accelerated DRAM Loads. Australasian Database Conference (ADC) (2018).
     
  • 15.
    Schmidt, C., Dreseler, M., Akin, B., Roy, A.: A Case for Hardware-Supported Sub-Cache Line Accesses. Data Management on New Hardware (DaMoN), in conjunction with SIGMOD (2018).
     
  • 16.
    Kossmann, J., Dreseler, M., Gasda, T., Uflacker, M., Plattner, H.: Visual Evaluation of SQL Plan Cache Algorithms. Australasian Database Conference (ADC) (2018).
     
  • 17.
    Dreseler, M., Kossmann, J., Frohnhofen, J., Uflacker, M., Plattner, H.: Fused Table Scans: Combining AVX-512 and JIT to Double the Performance of Multi-Predicate Scans. Joint Workshop of HardBD (International Workshop on Big Data Management on Emerging Hardware) and Active (Workshop on Data Management on Virtualized Active Systems), in conjunction with ICDE (2018).
     
  • 18.
    Schwalb, D., Bk, G.K., Dreseler, M., S, A., Faust, M., Hohl, A., Berning, T., Makkar, G., Plattner, H., Deshmukh, P.: Hyrise-NV: Instant Recovery for In-Memory Databases using Non-Volatile Memory. International Conference on Database Systems for Advanced Applications (DASFAA) (2016).
     
  • 19.
    Schwalb, D., Dreseler, M., Uflacker, M., Plattner, H.: NVC-Hashmap: A Persistent and Concurrent Hashmap For Non-Volatile Memories. In-Memory Data Management Workshop (IMDM), in conjunction with VLDB (2015).
     
  • 20.
    Schwalb, D., Kossmann, J., Faust, M., Klauck, S., Uflacker, M., Plattner, H.: Hyrise-R: Scale-out and Hot-Standby through Lazy Master Replication for Enterprise Applications. Proceedings of the 3rd VLDB Workshop on In-Memory Data Mangement and Analytics (IMDM), in conjunction with VLDB 2015 Kohala Coast, Hawaii (2015).
     
  • 21.
    Faust, M., Schwalb, D., Plattner, H.: Composite Group-Keys: Space-efficient Indexing of Multiple Columns for Compressed In-Memory Column Stores. IMDM in conjunction with VLDB (2014).
     
  • 22.
    Schwalb, D., Faust, M., Wust, J., Grund, M., Plattner, H.: Efficient Transaction Processing for Hyrise in Mixed Workload Environments. IMDM in conjunction with VLDB (2014).
     
  • 23.
    Grund, M., Cudre-Mauroux, P., Krüger, J., Madden, S., Plattner, H.: An overview of HYRISE - a Main Memory Hybrid Storage Engine. IEEE Data Engineering Bulletin. (2012).
     
  • 24.
    Faust, M., Krüger, J., Schwalb, D., Plattner, H.: Fast Lookups for In-Memory Column Stores: Group-Key Indices, Lookup and Maintenance. ADMS (in conjunction with VLDB) (2012).
     
  • 25.
    Grund, M., Krüger, J., Plattner, H., Zeier, A., Cudre-Mauroux, P., Madden, S.: HYRISE - A Hybrid Main Memory Storage Engine. Proceedings of the VLDB Endowment Volume 4 Issue 2. bll. 105–116 (2011).
     
  • 26.
    Grund, M., Cudre-Mauroux, P., Madden, S.: A Demonstration of HYRISE- A Main Memory Hybrid Storage Engine. VLDB (2011).