[1]
Hurdelhey, B., Weisgut, M. and Boissier, M. Workload-Driven Data Placement for Tierless In-Memory Database SystemsDatenbanksysteme für Business, Technologie und Web, BTW (2023), 47–70.
High main memory consumption is a significant cost factor for in-memory database systems. Tiering, i.e., placing parts of the data on memory or storage devices other than DRAM, reduces the main memory footprint. A controlled data placement can assign rarely accessed data to slow devices while frequently used data remains on fast devices, such as main memory, to maintain acceptable query latencies. We present an automatic data placement decision system for the in-memory database Hyrise. The system organizes the memory and storage devices in a tierless pool, with no fixed device class categorization or performance order. The system supports data placement use cases, such as minimizing end-to-end query latencies and making cost-optimal purchase recommendations in cloud environments. In this paper, we introduce an efficient calibration process to derive cost models for various storage devices. To determine data placements, we introduce a linear programming-based approach, which yields optimal configurations, and an efficient heuristic. With a set of main memory and SSD devices, we can reduce the main memory consumption for base table data of the TPC-DS benchmark by 74 percent when accepting a workload latency increase of 52 percent. In a comparison of data placement algorithms and cost models, we find that simplistic algorithms (e.g., greedy algorithms) can present viable alternatives to optimal linear programming algorithms, especially under cost prediction inaccuracies.
[2]
Richly, K., Schlosser, R. and Boissier, M. Budget-Conscious Fine-Grained Configuration Optimization for Spatio-Temporal ApplicationsProceedings of the VLDB Endowment 15 (13) (2022), 4079–4092.
Based on the performance requirements of modern spatio-temporal data mining applications, in-memory database systems are often used to store and process the data. To efficiently utilize the scarce DRAM capacities, modern database systems support various tuning possibilities to reduce the memory footprint (e.g., data compression) or increase performance (e.g., additional indexes). However, the selection of cost and performance balancing configurations is challenging due to the vast number of possible setups consisting of mutually dependent individual decisions. In this paper, we introduce a novel approach to jointly optimize the compression, sorting, indexing, and tiering configuration for spatio-temporal workloads. Further, we consider horizontal data partitioning, which enables the independent application of different tuning options on a fine-grained level. We propose different linear programming (LP) models addressing cost dependencies at different levels of accuracy to compute optimized tuning configurations for a given workload and memory budgets. To yield maintainable and robust configurations, we extend our LP-based approach to incorporate reconfiguration costs as well as a worst-case optimization for potential workload scenarios. Further, we demonstrate on a real-world dataset that our models allow to significantly reduce the memory footprint with equal performance or increase the performance with equal memory size compared to existing tuning heuristics.
[3]
Boissier, M. Robust and Budget-Constrained Encoding Configurations for In-Memory Database SystemsProceedings of the VLDB Endowment 15 (4) (2022), 780–793.
Data encoding has been applied to database systems for decades as it mitigates bandwidth bottlenecks and reduces storage requirements. But even in the presence of these advantages, most in-memory database systems use data encoding only conservatively as the negative impact on runtime performance can be severe. Real-world systems with large parts being infrequently accessed and cost efficiency constraints in cloud environments require solutions that automatically and efficiently select encoding techniques, including heavy-weight compression. In this paper, we introduce workload-driven approaches to automatically determine memory budget-constrained encoding configurations using greedy heuristics and linear programming. We show for TPC-H, TPC-DS, and the Join Order Benchmark that optimized encoding configurations can reduce the main memory footprint significantly without a loss in runtime performance over state-of-the-art dictionary encoding. To yield robust selections, we extend the linear programming-based approach to incorporate query runtime constraints and mitigate unexpected performance regressions.
[4]
Heinzl, L., Hurdelhey, B., Boissier, M., Perscheid, M. and Plattner, H. Evaluating Lightweight Integer Compression Algorithms in Column-Oriented In-Memory DBMSInternational Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures, ADMS@VLDB (2021).
Lightweight data compression algorithms are often used to decrease memory consumption of in-memory databases. In recent years, various integer compression techniques have been proposed that focus on sequential encoding and decoding and exploit modern CPUs' vectorization capabilities. Interestingly, another dominant access pattern in databases systems has seen little attention: random access decoding. In this paper, we compare end-to-end database performance for various integer compression codecs on three recent CPU architectures. Our evaluation suggests that random access performance is often more relevant than vectorization capabilities for sequential accesses. Before integrating selected encodings in the database core, we benchmarked seven libraries in an exhaustive standalone comparison. We integrated the most promising techniques into the relational in-memory database system Hyrise and evaluated their performance for TPC-H, TPC-DS, and the Join Order Benchmark on three different CPU architectures. Our results emphasize the importance of random access decoding. Compared to state-of-the-art dictionary encoding in TPC-H, alternatives allow reducing memory consumption of integer columns by up to 53 % while improving runtime performance by 5 % on an Intel CPU and over 16 % on an Apple M1.
[5]
Kossmann, J., Boissier, M., Dubrawski, A., Heseding, F., Mandel, C., Pigorsch, U., Schneider, M., Schniese, T., Sobhani, M., Tsayun, P., Wille, K., Perscheid, M., Uflacker, M. and Plattner, H. A Cockpit for the Development and Evaluation of Autonomous Database Systems37th IEEE International Conference on Data Engineering, ICDE (2021), 2685–2688.
Databases are highly optimized complex systems with a multitude of configuration options. Especially in cloud scenarios with thousands of database deployments, determining optimized database configurations in an automated fashion is of increasing importance for database providers. At the same time, due to increased system complexity, it becomes more challenging to identify well-performing configurations. Therefore, research interest in autonomous or self-driving database systems has increased enormously in recent years. Such systems promise both performance improvements and cost reductions. In the literature, various fully or partially autonomous optimization mechanisms exist that optimize single aspects, e.g., index selection. However, database administrators and developers often distrust autonomous approaches, and there is a lack of practical experimentation opportunities that could create a better understanding. Moreover, the interplay of different autonomous mechanisms under complex workloads remains an open question. The presented cockpit enables an interactive assessment of the impact of autonomous components for database systems by comparing (autonomous) systems with different configurations side by side. Thereby, the cockpit enables users to build trust in autonomous solutions by experimenting with such technologies and observing their effects in practice.
[6]
Richly, K., Schlosser, R. and Boissier, M. Joint Index, Sorting, and Compression Optimization for Memory-Efficient Spatio-Temporal Data Management37th IEEE International Conference on Data Engineering (ICDE) (2021), 1901–1906.
The wide distribution of location-acquisition technologies has led to large volumes of spatio-temporal data, which are the foundation for a broad spectrum of applications. Based on these applications' performance requirements, in-memory databases are used to store and process the data. As DRAM capacities are limited and expensive, modern database systems apply various configuration optimizations (e.g., compression) to reduce the memory footprint. The selection of cost and performance balancing configurations is challenging due to the vast amount of possible setups consisting of mutually dependent individual decisions. In this paper, we present a linear programming approach to determine fine-grained configuration decisions for spatio-temporal workloads. By dividing the data into partitions of fixed size, we can apply the compression, sorting, and index selections on a fine-grained level to reflect spatio-temporal access patterns. Our approach jointly optimizes these configurations to maximize performance under a given memory budget. We demonstrate on a real-world dataset that models specifically optimized for spatio-temporal data characteristics allow us to reduce the memory footprint (up to 60% by equal performance) and increase the performance (up to 80% by equal memory size) compared to established rule-based heuristics.
[7]
Dreseler, M., Boissier, M., Rabl, T. and Uflacker, M. Quantifying TPC-H Choke Points and Their OptimizationsProceedings of the VLDB Endowment 13 (8) (2020), 1206–1220.
TPC-H continues to be the most widely used benchmark for relational OLAP systems. It poses a number of challenges, also known as “choke points”, which database systems have to solve in order to achieve good benchmark results. Examples include joins across multiple tables, correlated subqueries, and correlations within the TPC-H data set. Knowing the impact of such optimizations helps in developing optimizers as well as in interpreting TPC-H results across database systems. This paper provides a systematic analysis of choke points and their optimizations. It complements previous work on TPC-H choke points by providing a quantitative discussion of their relevance. It focuses on eleven choke points where the optimizations are beneficial independently of the database system. Of these, the flattening of subqueries and the placement of predicates have the biggest impact. Three queries (Q2, Q17, and Q21) are strongly influenced by the choice of an efficient query plan; three others (Q1, Q13, and Q18) are less influenced by plan optimizations and more dependent on an efficient execution engine.
[8]
Boissier, M. and Jendruk, M. Workload-Driven and Robust Selection of Compression Schemes for Column Stores22nd International Conference on Extending Database Technology, EDBT (2019), 674–677.
Modern main memory-optimized column stores employ a variety of compression techniques. Deciding for one compression technique over others for a given memory budget can be challenging since each technique has different trade-offs whose impact on large workloads is not obvious. We present an automated selection framework for compression configurations. Most database systems provide means to automatically choose a compression configuration but lack two crucial properties: The compression selection cannot be constrained (e.g., by a given storage budget) and robustness of the compression configuration is not considered. Our approach uses workload information to determine robust configurations under the given constraints. The runtime performance of the various compression techniques is estimated using adapted regression models.
[9]
Schlosser, R., Kossmann, J. and Boissier, M. Efficient Scalable Multi-Attribute Index Selection Using Recursive Strategies35th IEEE International Conference on Data Engineering, ICDE (2019), 1238–1249.
An efficient selection of indexes is indispensable for database performance. For large problem instances with hundreds of tables, existing approaches are not suitable: They either exhibit prohibitive runtimes or yield far from optimal index configurations by strongly limiting the set of index candidates or not handling index interaction explicitly. We introduce a novel recursive strategy that does not exclude index candidates in advance and effectively accounts for index interaction. Using large real-world workloads, we demonstrate the applicability of our approach. Further, we evaluate our solution end to end with a commercial database system using a reproducible setup. We show that our solutions are near-optimal for small index selection problems. For larger problems, our strategy outperforms state-of-the-art approaches in both scalability and solution quality.
[10]
Dreseler, M., Kossmann, J., Boissier, M., Klauck, S., Uflacker, M. and Plattner, H. Hyrise Re-engineered: An Extensible Database System for Research in Relational In-Memory Data Management22nd International Conference on Extending Database Technology (EDBT) (2019), 313–324.
Research in data management profits when the performance evaluation is based not only on individual components in isolation, but uses an actual DBMS end-to-end. Facilitating the integration and benchmarking of new concepts within a DBMS requires a simple setup process, well-documented code, and the possibility to execute both standard and custom benchmarks without tedious preparation. Fulfilling these requirements also makes it easy to reproduce the results later on. The relational open-source database Hyrise (VLDB, 2010) was presented to make the case for hybrid row- and column-format data storage. Since then, it has evolved from being a single- purpose research DBMS towards becoming a platform for various projects, including research in the areas of indexing, data partitioning, and non-volatile memory. With a growing diversity of topics, we have found that the original code base grew to a point where new experimentation became unnecessarily difficult. Over the last two years, we have re-written Hyrise from scratch and built an extensible multi-purpose research DBMS that can serve as an easy-to-extend platform for a variety of experiments and prototyping in database research. In this paper, we discuss how our learnings from the previous version of Hyrise have influenced our re-write. We describe the new architecture of Hyrise and highlight the main components. Afterwards, we show how our extensible plugin architecture facilitates research on diverse DBMS-related aspects without compromising the architectural tidiness of the code. In a first performance evaluation, we show that the execution time of most TPC-H queries is competitive to that of other research databases.
[11]
Ruhrländer, R.P., Boissier, M. and Uflacker, M. Improving Box Office Result Predictions for Movies Using Consumer-Centric ModelsProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD (2018), 655–664.
Recent progress in machine learning and related fields like recommender systems open up new possibilities for data-driven approaches. One example is the prediction of a movie’s box office revenue, which is highly relevant for optimizing production and marketing. We use individual recommendations and user-based forecast models in a system that forecasts revenue and additionally provides actionable insights for industry professionals. In contrast to most existing models that completely neglect user preferences, our approach allows us to model the most important source for movie success: moviegoer taste and behavior. We divide the problem into three distinct stages: (i) we use matrix factorization recommenders to model each user’s taste, (ii) we then predict the individual consumption behavior, and (iii) eventually aggregate users to predict the box office result. We compare our approach to the current industry standard and show that the inclusion of user rating data reduces the error by a factor of 2x and outperforms recently published research.
[12]
Schlosser, R. and Boissier, M. Dynamic Pricing under Competition on Online Marketplaces: A Data-Driven ApproachProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD (2018), 705–714.
Most online markets are characterized by competitive settings and limited demand information. Due to the complexity of such markets, efficient pricing strategies are hard to derive. We analyze stochastic dynamic pricing models in competitive markets with multiple offer dimensions, such as price, quality, and rating. In a first step, we use a simulated test market to study how sales probabilities are affected by specific customer behaviors and the strategic interaction of price reaction strategies. Further, we show how different state-of-the-art learning techniques can be used to estimate sales probabilities from partially observable market data. In a second step, we use a dynamic programming model to compute an effective pricing strategy which circumvents the curse of dimensionality. We demonstrate that the strategy is applicable even if the number of competitors is large and their strategies are unknown. We show that our heuristic can be tuned to smoothly balance profitability and speed of sales. Further, our approach is currently applied by a large seller on Amazon for the sale of used books. Sales results show that our data-driven strategy outperforms the rule-based strategy of an experienced seller by a profit increase of more than 20%.
[13]
Boissier, M., Schlosser, R. and Uflacker, M. Hybrid Data Layouts for Tiered HTAP Databases with Pareto-Optimal Data Placements34th IEEE International Conference on Data Engineering, ICDE (2018), 209–220.
Recent developments in database research intro- duced HTAP systems that are capable of handling both transactional and analytical workloads. These systems achieve their performance by storing the full data set in main memory. An open research question is how far one can reduce the main memory footprint without losing the performance superiority of main memory-resident databases. In this paper, we present a hybrid main memory-optimized database for mixed workloads that evicts cold data to less expensive storage tiers. It adapts the storage layout to mitigate the negative performance impact of secondary storage. A key challenge is to determine which data to place on which storage tier. We introduce a novel workload-driven model that determines Pareto-optimal allocations while also considering reallocation costs. We evaluate our concept for a production enterprise system as well as reproducible data sets.