Hurdelhey, B., Weisgut, M., Boissier, M.: Workload-Driven Data Placement for Tierless In-Memory Database Systems. BTW 2023. pp. 47–70. Gesellschaft für Informatik e.V (2023).
High main memory consumption is a significant cost factor for in-memory database systems. Tiering, i.e., placing parts of the data on memory or storage devices other than DRAM, reduces the main memory footprint. A controlled data placement can assign rarely accessed data to slow devices while frequently used data remains on fast devices, such as main memory, to maintain acceptable query latencies. We present an automatic data placement decision system for the in-memory database Hyrise. The system organizes the memory and storage devices in a tierless pool, with no fixed device class categorization or performance order. The system supports data placement use cases, such as minimizing end-to-end query latencies and making cost-optimal purchase recommendations in cloud environments. In this paper, we introduce an efficient calibration process to derive cost models for various storage devices. To determine data placements, we introduce a linear programming-based approach, which yields optimal configurations, and an efficient heuristic. With a set of main memory and SSD devices, we can reduce the main memory consumption for base table data of the TPC-DS benchmark by 74 percent when accepting a workload latency increase of 52 percent. In a comparison of data placement algorithms and cost models, we find that simplistic algorithms (e.g., greedy algorithms) can present viable alternatives to optimal linear programming algorithms, especially under cost prediction inaccuracies.
Richly, K., Schlosser, R., Boissier, M.: Budget-Conscious Fine-Grained Configuration Optimization for Spatio-Temporal Applications. Proceedings of the VLDB Endowment. pp. 4079–4092 (2022).
Based on the performance requirements of modern spatio-temporal data mining applications, in-memory database systems are often used to store and process the data. To efficiently utilize the scarce DRAM capacities, modern database systems support various tuning possibilities to reduce the memory footprint (e.g., data compression) or increase performance (e.g., additional indexes). However, the selection of cost and performance balancing configurations is challenging due to the vast number of possible setups consisting of mutually dependent individual decisions. In this paper, we introduce a novel approach to jointly optimize the compression, sorting, indexing, and tiering configuration for spatio-temporal workloads. Further, we consider horizontal data partitioning, which enables the independent application of different tuning options on a fine-grained level. We propose different linear programming (LP) models addressing cost dependencies at different levels of accuracy to compute optimized tuning configurations for a given workload and memory budgets. To yield maintainable and robust configurations, we extend our LP-based approach to incorporate reconfiguration costs as well as a worst-case optimization for potential workload scenarios. Further, we demonstrate on a real-world dataset that our models allow to significantly reduce the memory footprint with equal performance or increase the performance with equal memory size compared to existing tuning heuristics.
Boissier, M.: Robust and Budget-Constrained Encoding Configurations for In-Memory Database Systems. Proceedings of the VLDB Endowment. pp. 780–793 (2022).
Data encoding has been applied to database systems for decades as it mitigates bandwidth bottlenecks and reduces storage requirements. But even in the presence of these advantages, most in-memory database systems use data encoding only conservatively as the negative impact on runtime performance can be severe. Real-world systems with large parts being infrequently accessed and cost efficiency constraints in cloud environments require solutions that automatically and efficiently select encoding techniques, including heavy-weight compression. In this paper, we introduce workload-driven approaches to automatically determine memory budget-constrained encoding configurations using greedy heuristics and linear programming. We show for TPC-H, TPC-DS, and the Join Order Benchmark that optimized encoding configurations can reduce the main memory footprint significantly without a loss in runtime performance over state-of-the-art dictionary encoding. To yield robust selections, we extend the linear programming-based approach to incorporate query runtime constraints and mitigate unexpected performance regressions.
Heinzl, L., Hurdelhey, B., Boissier, M., Perscheid, M., Plattner, H.: Evaluating Lightweight Integer Compression Algorithms in Column-Oriented In-Memory DBMS. 12th International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures, ADMS@VLDB 2021, Copenhagen, Denmark, August 16, 2021 (2021).
Lightweight data compression algorithms are often used to decrease memory consumption of in-memory databases. In recent years, various integer compression techniques have been proposed that focus on sequential encoding and decoding and exploit modern CPUs' vectorization capabilities. Interestingly, another dominant access pattern in databases systems has seen little attention: random access decoding. In this paper, we compare end-to-end database performance for various integer compression codecs on three recent CPU architectures. Our evaluation suggests that random access performance is often more relevant than vectorization capabilities for sequential accesses. Before integrating selected encodings in the database core, we benchmarked seven libraries in an exhaustive standalone comparison. We integrated the most promising techniques into the relational in-memory database system Hyrise and evaluated their performance for TPC-H, TPC-DS, and the Join Order Benchmark on three different CPU architectures. Our results emphasize the importance of random access decoding. Compared to state-of-the-art dictionary encoding in TPC-H, alternatives allow reducing memory consumption of integer columns by up to 53 % while improving runtime performance by 5 % on an Intel CPU and over 16 % on an Apple M1.
Kossmann, J., Boissier, M., Dubrawski, A., Heseding, F., Mandel, C., Pigorsch, U., Schneider, M., Schniese, T., Sobhani, M., Tsayun, P., Wille, K., Perscheid, M., Uflacker, M., Plattner, H.: A Cockpit for the Development and Evaluation of Autonomous Database Systems. 37th IEEE International Conference on Data Engineering, ICDE. pp. 2685–2688 (2021).
Databases are highly optimized complex systems with a multitude of configuration options. Especially in cloud scenarios with thousands of database deployments, determining optimized database configurations in an automated fashion is of increasing importance for database providers. At the same time, due to increased system complexity, it becomes more challenging to identify well-performing configurations. Therefore, research interest in autonomous or self-driving database systems has increased enormously in recent years. Such systems promise both performance improvements and cost reductions. In the literature, various fully or partially autonomous optimization mechanisms exist that optimize single aspects, e.g., index selection. However, database administrators and developers often distrust autonomous approaches, and there is a lack of practical experimentation opportunities that could create a better understanding. Moreover, the interplay of different autonomous mechanisms under complex workloads remains an open question. The presented cockpit enables an interactive assessment of the impact of autonomous components for database systems by comparing (autonomous) systems with different configurations side by side. Thereby, the cockpit enables users to build trust in autonomous solutions by experimenting with such technologies and observing their effects in practice.
Richly, K., Schlosser, R., Boissier, M.: Joint Index, Sorting, and Compression Optimization for Memory-Efficient Spatio-Temporal Data Management. 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19-22, 2021. pp. 1901–1906 (2021).
The wide distribution of location-acquisition technologies has led to large volumes of spatio-temporal data, which are the foundation for a broad spectrum of applications. Based on these applications' performance requirements, in-memory databases are used to store and process the data. As DRAM capacities are limited and expensive, modern database systems apply various configuration optimizations (e.g., compression) to reduce the memory footprint. The selection of cost and performance balancing configurations is challenging due to the vast amount of possible setups consisting of mutually dependent individual decisions. In this paper, we present a linear programming approach to determine fine-grained configuration decisions for spatio-temporal workloads. By dividing the data into partitions of fixed size, we can apply the compression, sorting, and index selections on a fine-grained level to reflect spatio-temporal access patterns. Our approach jointly optimizes these configurations to maximize performance under a given memory budget. We demonstrate on a real-world dataset that models specifically optimized for spatio-temporal data characteristics allow us to reduce the memory footprint (up to 60% by equal performance) and increase the performance (up to 80% by equal memory size) compared to established rule-based heuristics.
Dreseler, M., Boissier, M., Rabl, T., Uflacker, M.: Quantifying TPC-H Choke Points and Their Optimizations. Proceedings of the VLDB Endowment. pp. 1206–1220 (2020).
Boissier, M., Jendruk, M.: Workload-Driven and Robust Selection of Compression Schemes for Column Stores. 22nd International Conference on Extending Database Technology (EDBT). pp. 674–677 (2019).
Schlosser, R., Kossmann, J., Boissier, M.: Efficient Scalable Multi-Attribute Index Selection Using Recursive Strategies. IEEE 35th International Conference on Data Engineering (ICDE 2019). pp. 1238–1249. IEEE (2019).
Dreseler, M., Kossmann, J., Boissier, M., Klauck, S., Uflacker, M., Plattner, H.: Hyrise Re-engineered: An Extensible Database System for Research in Relational In-Memory Data Management. 22nd International Conference on Extending Database Technology (EDBT). pp. 313–324 (2019).
Ruhrländer, R.P., Boissier, M., Uflacker, M.: Improving Box Office Result Predictions for Movies Using Consumer-Centric Models. KDD ’18 Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 655–664 (2018).
Recent progress in machine learning and related fields like recommender systems open up new possibilities for data-driven approaches. One example is the prediction of a movie’s box office revenue, which is highly relevant for optimizing production and marketing. We use individual recommendations and user-based forecast models in a system that forecasts revenue and additionally provides actionable insights for industry professionals. In contrast to most existing models that completely neglect user preferences, our approach allows us to model the most important source for movie success: moviegoer taste and behavior. We divide the problem into three distinct stages: (i) we use matrix factorization recommenders to model each user’s taste, (ii) we then predict the individual consumption behavior, and (iii) eventually aggregate users to predict the box office result. We compare our approach to the current industry standard and show that the inclusion of user rating data reduces the error by a factor of 2x and outperforms recently published research.
Schlosser, R., Boissier, M.: Dynamic Pricing under Competition on Online Marketplaces: A Data-Driven Approach. KDD ’18 Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 705–714 (2018).
Most online markets are characterized by competitive settings and limited demand information. Due to the complexity of such markets, efficient pricing strategies are hard to derive. We analyze stochastic dynamic pricing models in competitive markets with multiple offer dimensions, such as price, quality, and rating. In a first step, we use a simulated test market to study how sales probabilities are affected by specific customer behaviors and the strategic interaction of price reaction strategies. Further, we show how different state-of-the-art learning techniques can be used to estimate sales probabilities from partially observable market data. In a second step, we use a dynamic programming model to compute an effective pricing strategy which circumvents the curse of dimensionality. We demonstrate that the strategy is applicable even if the number of competitors is large and their strategies are unknown. We show that our heuristic can be tuned to smoothly balance profitability and speed of sales. Further, our approach is currently applied by a large seller on Amazon for the sale of used books. Sales results show that our data-driven strategy outperforms the rule-based strategy of an experienced seller by a profit increase of more than 20%.
Boissier, M., Schlosser, R., Uflacker, M.: Hybrid Data Layouts for Tiered HTAP Databases with Pareto-Optimal Data Placements. IEEE 34th International Conference on Data Engineering (ICDE 2018). pp. 209–220 (2018).
Recent developments in database research intro- duced HTAP systems that are capable of handling both transactional and analytical workloads. These systems achieve their performance by storing the full data set in main memory. An open research question is how far one can reduce the main memory footprint without losing the performance superiority of main memory-resident databases. In this paper, we present a hybrid main memory-optimized database for mixed workloads that evicts cold data to less expensive storage tiers. It adapts the storage layout to mitigate the negative performance impact of secondary storage. A key challenge is to determine which data to place on which storage tier. We introduce a novel workload-driven model that determines Pareto-optimal allocations while also considering reallocation costs. We evaluate our concept for a production enterprise system as well as reproducible data sets.