Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

About the Speaker

Viktor Leis is a Professor of Computer Science at the Technical University of Munich (TUM). His research revolves around designing high-performance data management systems and includes core database systems topics such as query processing, query optimization, transaction processing, index structures, and storage. Another major research area is designing cloud-native, cost-efficient information systems. Viktor Leis received his doctoral degree in 2016 from TUM and was a professor at the Friedrich-Schiller-Universität Jena and Friedrich-Alexander-Universität Erlangen-Nürnberg before returning to TUM in 2022. He received the ACM SIGMOD dissertation award, the IEEE TCDE Rising Star Award, best paper awards at IEEE ICDE and ACM SIGMOD, and an ERC Starting Grant.

About the Talk

Data analytics is moving into the cloud and many highly successful commercial cloud-native systems exist. However, existing systems are either expensive, lock in user data, or do both. In this talk, I will present ideas for how to commoditize large-scale data analytics in the cloud. Commoditization involves (a) reducing query processing cost to the theoretical hardware performance limits and (b) avoiding vendor lock-in by making it easy to move data between different clouds and systems. The architecture of an open and cost-efficient data analytics system can be split into three main components: First, an intelligent control component that automatically and transparently selects and manages the cheapest hardware instances for the given workload and makes migration to other cloud vendors possible. Second, a highly-efficient and scalable query processing engine that is capable of fully exploiting modern cloud hardware. Third, a data lake storage abstraction using open data formats that enables cheap storage as well as modularity and interoperability across different data systems.

Commoditizing Data Analytics in the Cloud

written by Konrad Nareike

Motivation

When it comes to designing a database system, the traditional approach was always to optimize it for performance. Since the hardware on a physical server cannot be switched easily, data analytics tools were optimized for a certain hardware that was expected to be used for a rather long lifespan. Examples are the HyPer DBMS which optimizes for database systems with fast main memory [2] or Leanstore which optimizes for database systems with fast SSDs [5].

In the past years, it became a trend to move data and computations to the cloud, leading to some previous assumptions and principles being no longer true. For instance, hardware is not considered a constant factor anymore. It takes only a few clicks to exchange the hardware on which data is stored or computations are running. Moreover, instead of having hardware costs, operating costs, and maintenance costs, nowadays database owners only have to pay the cloud provider for the amount of scanned data or the amount of time a query execution took, respectively. Thus, there is a need for raising once again the question on what makes up a good data analytics system with regard to the possibilities of cloud infrastructure that is currently available.

This report presents a new benchmark designed particularly for database systems in the cloud as well as cost optimization as an alternative design principle for designing analytic database systems in the cloud. As this practice is highly dependent on business models offered by cloud providers, it is necessary to use efficient open data formats instead of proprietary structures, so that cloud storage providers can be switched whenever a cheaper offer is available without having to optimize for a new data format.

Currently, pricing models of data analytics services in the cloud appear to be rather arbitrary. For example, scanning a terabyte of data costs several dollars when using data analytics services in the cloud such as Google BigQuery or Amazon Athena. However, a rough calculation on how expensive such a pure scanning operation would be on a pure cloud computing platform such as Amazon EC2 yields only a few cents, which is magnitudes cheaper than any data analytics service in the cloud. This raises the question, whether and how it is possible to actually achieve those results in a cheaper manner.

Benchmarking for Cloud Analytics

The issue with current benchmarks for data analytics is that they do not consider typical conditions that apply to the cloud. For instance, the majority of benchmarks optimize for runtime, assuming ideal conditions of a system that does not process any other computations than the query execution. However, a cloud infrastructure is usually shared between multiple clients. Therefore, it doesn't make sense to optimize for runtime but rather for latency.

Only recently did van Renen \& Leis introduce the Cloud Analytics Benchmark (CAB) which is specialized for data analytics systems in the cloud [7]. It is based on the TCP-H benchmark which is commonly known as a state-of-the-art benchmark for data analytics. Other than TCP-H, it introduces multi-tenancy, i.e. the hardware holds multiple different-sized instances of a TPC-H database which are queried by different clients, representing the users of the cloud service. On top of that, the single queries arrive at varying times and asynchronously rather than one after another. Thus, the metric that is to be minimized is latency for every query, as it is the cost-driving factor in the business models of most cloud service providers.

The CAB is also able to measure how much it would cost to run on a certain cloud system. This can be useful for testing whether a provider with smaller prices per time is actually cheaper than another provider who charges more but might have a faster running system, thus being eventually cheaper. It can be also used to compare different pricing models such as price per time or price per query. It is, however, pointless to compare two providers who both charge per query, as the number of queries in the benchmark is constant.

Cost-Efficient Cloud Computing

When using a cloud computing service for query processing, it is easy to switch out the hardware used with only a couple of clicks. This implies that system design is no longer constrained by the long-term decision of hardware acquisition. Moreover, instead of having fixed acquisition cost, cloud computing services are usually paid based on the duration of the computation.

There are multiple different virtual CPUs offered by cloud computing services such as Amazon EC2. While some of them are optimized for fast computing, others are optimized for high network traffic whereas there are also instances using fast SSDs to better deal with outsourcing large amounts of data that do not fit into main memory. As outlined by Leis \& Kuschewski, comparing these three kinds of hardware instances for different workload sizes shows that the cost per workload is CPU bound for smaller workloads, disc bound for medium workloads and for large workloads eventually network bound. Moreover, although network optimized hardware instances outperform computing optimized and storage optimized instances mainly for large workloads, they even come fairly close to the best performing instances for small workloads as well [6].

    In general, CPU bound instances are cheaper for small workloads, whereas disc bound instances (i3) are better for medium workloads and network bound instances (c5n) for high workloads [6].

      On the bottom line, there is no instance that can outperform all other instances for every workload size. Furthermore, future innovations in hardware engineering can quickly shift bounds to a different level and introduce new instances that outperform current instances. In fact, this has been shown in the past, such as with the introduction of 100 Gbit networks which lowered the cost per runtime of network bound instances by about 70\% [6]. The main advantage of using a cloud computing service is that committing to a certain instance of hardware is not necessary and that hardware can be switched easily based on workload sizes or as soon as cheaper hardware is available.

      Network bound instances did not always use to be the cheapest instance [6].

      A remaining question is what implications it has to optimize for cost instead of runtime or latency. Assuming that cost correlates with the resource consumption, the principle of cost optimization can eventually lead to more resource efficiency, for instance using less energy to compute. Thus, it is not unlikely that cost efficiency has even more benefitial side-effects than simply being cheap. Nonetheless, there might also be negative side effects. Say that the majority of customers is switching to the most cost efficient instance which is currently a network bound instance. First of all, the raising demand could shift prices again towards the opposite direction. Moreover, it could push the focus of future research on optimization of network bandwidth, neglegting the possibly yet to be discovered potential of CPU bound instances or disc bound instances.

      Exploiting Cloud Storage Bandwidth

      The previous considerations provided insights in how much cost efficiency can be reached in theory when processing queries on cloud computing services. Subsequently, the question arises whether this potential can be actually exploited when reading data from a cloud storage.

      Combining a network bound cloud computing instance on Amazon EC2 with a cloud storage such as Amazon S3, it can be shown that the network bandwidth can be exploited best when sending large amounts of concurrent requests [1]. The reason for that is that those requests are processed in parallel, increasing the amount of data sent to the cloud computing service at a time.

      However, this does not imply that requests for data should be partitioned as much as possible. The reason for this is the pricing model for Amazon S3 which imposes that prices are measured by the amount of requests send, while the size of requests is irrelevant. This leads to larger requests being cheaper when measured in cost per processed data. Once a request size of 16 MiB is reached, the processing cost for Amazon S3 are low enough to be dominated by the cost of the Amazon EC2 instance [1]. Therefore, it is in fact possible to cheaply read data from a cloud storage while exploiting as much bandwidth as possible. The only precondition is that requests are large enough for the reading costs to be negligible while also being partitioned in a way to send high amounts of request so that the bandwidth offered is fully exploited.

        The network bandwidth can be saturated with a high number of concurrent requests, while the processing cost for Amazon S3 becomes neglectible for request sizes of 16 MiB and above [1].

          Going one step further, Durner, Leis \& Neumann introduced a download manager called AnyBlob which facilitates the exploitation of bandwidth even more. In fact, AnyBlob can reduce CPU usage by about 30\% while increasing the average bandwidth by about 50\% compared to Amazon S3 native download managers [1].

          AnyBlob performs by far better than Amazon S3 native download managers [1].

          Data Encoding and Compression

          Cloud storages typically offer proprietary data formats. This can become an issue because the format an analytics tool requires the data to be in is usually different. However, as long as proprietary data formats are used, accessing the data is only possible through interfaces offered by the provider, causing an additional overhead if the encoding has to be changed again to be compatible with a certain analytics tool. This enhances the use of analytics tools which support the proprietary format. Such tools are usually offered by the same provider and their use can lead to a vendor lock-in. Also, proprietary data formats make migration to another cloud storage provider way harder, as they usually use different formats.

          The solution for this problem are open data formats such as Apache ORC or Apache Parquet. However, those two well-established data formats come with the downside of not compressing data very well as they feature only a few ways of encoding data. This becomes an issue if data is to be transferred from a cloud storage to a cloud computing service, since the possible bandwidth of information transfer is not exploited. Hence, open data formats are usually combined with a compression algorithm on top. While this allows a better exploitation of the network bandwidth, it adds more computational overhead due to the need of decompression. This results in query execution being CPU bound again, shadowing the advantages of a high network bandwidth.

          In order to address all issues at once, Kuschewski et al. developed a new open data format named BtrBlocks which has inherent compression aiming for low computational overhead [3]. Other than Apache ORC or Apache Parquet, it offers a large variety of encodings that can be used for a type of data, so that the best encoding can be chosen. Additionally, encodings can be recursed up to three times. For instance, if a sequence of integers is encoded in a run length encoding (RLE), the encoding consists of two sequences of integers, which are hopefully by far smaller in total size than the original sequence. These two sequences can then recursively be encoded again using RLE or any other encoding.

          BtrBlocks does in fact always recurse as soon there is a potential of compression. However, the amount of recursions also increases the amount of computing cost necessary for decoding. There might be cases where BtrBlocks actually could achieve better results if it stopped recursing instead of adding another layer of compression with a compression factor of 1.1.

          The only question remaining is how the appropriate encoding with the highest compression is selected. Obviously, trying out all possible encodings for the entire data is not feasible. Instead, BtrBlocks pursues a sampling approach. In oder to choose a sample that is not too large, while at the same time being representative and preserving spatial locality, each block of data is separated into non-overlapping parts (addressing representativeness). For each part, a run of consecutive values is randomly selected (addressing spatial locality). Each run is one hundredth of the size of the part it belongs to (addressing sample size). Larger sizes are possible and lead to better results, but they come at the price of more effort for encoding data. The runs of all parts together form the sample for this block of data which is then compressed using all available encodings to find most effective one that is then in return used to compress the entire block of data.

            Sampling approach of BtrBlocks [3]

            When being tested on the Public BI Benchmark, BtrBlocks reaches an average compression factor of almost 7.5 which is of the same magnitude as compressed data stored in Apache ORC or Apache Parquet. At the same time, the low computing overhead leads to BtrBlocks achieving more than twice the decoding bandwidth than the listed state-of-the-art open formats, be it with or without additional compression [3].

              BtrBlocks reaches far higher decompression bandwidth compared to Apache ORC or Apache Parquet compressed with snappy, zstd or plain [3].

              Depending on how important fast encoding is, BtrBlocks can be improved even more. As for right now, recursive encoding is done in a greedy way, i.e. recursive compression is tested after a decision for a first compression scheme is made. If the tree of possible compression schemes was traversed exhaustively, an even better compression scheme might be found, but at the cost of by far higher compression times.

              Summary

              The availability of cloud storage and cloud computing offers a variety of new ways to design a data analytics system. Benchmarks that purely optimize systems for runtimes do not measure the right metrics anymore since latency is the actual sign of quality when it comes to shared hardware. The possibility of quickly switching hardware for query processing enables new design principles such as cost efficiency. Nonetheless, there is still engineering necessary to exploit close to the full potential of cloud services that are currently available and to enable flexibility and low switching costs in case of technical breakthroughs and changes in the market offers.

              References

              1. D. Durner, V. Leis, and T. Neumann. "Exploiting Cloud Object Storage for High-Performance Analytics". In: Proceedings of the VLDB Endowment 16.11 (2023), pages 2769-2782.
              2. A. Kemper and T. Neumann. HyPer: HYbrid OLTP & OLAP High PERformance Database System. Technical report. 2010.
              3. M. Kuschewski, D. Sauerwein, A. Alhomssi, and V. Leis. "BtrBlocks: Efficient Columnar Compression for Data Lakes". In: Proceedings of the ACM on Management of Data 1.2 (2023), pages 1-26.
              4. V. Leis. Commoditizing Data Analytics in the Cloud. 2024.
              5. V. Leis, M. Haubenschild, A. Kemper, and T. Neumann. "LeanStore: In-memory data management beyond main memory". In: 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE. 2018, pages 185-196.
              6. V. Leis and M. Kuschewski. "Towards cost-optimal query processing in the cloud". In: Proceedings of the VLDB Endowment 14.9 (2021), pages 1606-1612.
              7. A. van Renen and V. Leis. "Cloud Analytics Benchmark". In: Proceedings of the VLDB Endowment 16.6 (2023), pages 1413-1425.
              8. TU München. (02/13/2024). Prof. Dr. Viktor Leis. https://www.professoren.tum.de/leis-viktor