1.
Menon, P., Qadah, T.M., Rabl, T., Sadoghi, M., Jacobsen, H.-A.: LogStore: A Workload-aware, Adaptable Key-Value Store on Hybrid Storage Systems. Transactions on Knowledge and Data Engineering. (2020).
2.
Silva, P., Wang, Y., Rabl, T.: Grand Challenge: Incremental Stream Query Analytics. Proceedings of the 14th ACM International Conference on Distributed and Event-based Systems (DEBS ’20). p. 6 (2020).
Applications in the Internet of Things (IoT) create many data processing challenges because they have to deal with massive amounts of data and low latency constraints. The DEBS Grand Challenge 2020 specifies an IoT problem whose objective is to identify special type of events in a stream of electricity smart meters data. In this work, we present the Sequential Incremental DBSCAN-based Event Detection Algorithm (SINBAD), a solution based on an incremental version of the clustering algorithm DBSCAN and scenario specific data processing optimizations. SINBAD manages to calculate solutions up to 7 times faster and up to 26% more accurate than the baseline provided by the DEBS Grand Challenge.
3.
Dreseler, M., Boissier, M., Rabl, T., Uflacker, M.: Quantifying TPC-H Choke Points and Their Optimizations [Experiments and Analyses]. Proceedings of the VLDB Endowment. pp. 1206–1220 (2020).
4.
Derakhshan, B., Mahdiraji, A.R., Abedjan, Z., Rabl, T., Markl, V.: Optimizing Machine Learning Workloads in Collaborative Environments. ACM SIGMOD/PODS International Conference on Management of Data, Portland, OR, USA (2020).
Effective collaboration among data scientists results in high-quality and efficient machine learning (ML) workloads. In a collaborative environment, such as Kaggle or Google Colabratory, users typically re-execute or modify published scripts to recreate or improve the result. This introduces many redundant data processing and model training operations. Reusing the data generated by the redundant operations leads to the more efficient execution of future workloads. However, existing collaborative environments lack a data management component for storing and reusing the result of previously executed operations. In this paper, we present a system to optimize the execution of ML workloads in collaborative environments by reusing previously performed operations and their results. We utilize a so-called Experiment Graph (EG) to store the artifacts, i.e., raw and intermediate data or ML models, as vertices and operations of ML workloads as edges. In theory, the size of EG can become unnecessarily large, while the storage budget might be limited. At the same time, for some artifacts, the overall storage and retrieval cost might outweigh the recomputation cost. To address this issue, we propose two algorithms for materializing artifacts based on their likelihood of future reuse. Given the materialized artifacts inside EG, we devise a linear-time reuse algorithm to find the optimal execution plan for incoming ML workloads. Our reuse algorithm only incurs a negligible overhead and scales for the high number of incoming ML workloads in collaborative environments. Our experiments show that we improve the run-time by one order of magnitude for repeated execution of the workloads and 50% for the execution of modified workloads in collaborative environments.
5.
Grulich, P.M., Breß, S., Zeuch, S., Traub, J., von Bleichert, J., Chen, Z., Rabl, T., Markl, V.: Grizzly: Efficient Stream Processing Through Adaptive Query Compilation. ACM SIGMOD/PODS International Conference on Management of Data, Portland, OR, USA (2020).
Stream Processing Engines (SPEs) execute long-running queries on unbounded data streams. They rely on managed runtimes, an interpretation-based processing model, and do not perform runtime optimizations. Recent research states that this limits the utilization of modern hardware and neglects changing data characteristics at runtime. In this paper, we present Grizzly, a novel adaptive query compilation-based SPE to enable highly efficient query execution on modern hardware. We extend query-compilation and task-based parallelization for the unique requirements of stream processing and apply adaptive compilation to enable runtime re-optimizations. The combination of light-weight statistic gathering with just-in-time compilation enables Grizzly to dynamically adjust to changing data-characteristics at runtime. Our experiments show that Grizzly achieves up to an order of magnitude higher throughput and lower latency compared to state-of-the-art interpretation-based SPEs.
6.
Lutz, C., Breß, S., Zeuch, S., Rabl, T., Markl, V.: Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects. ACM SIGMOD/PODS International Conference on Management of Data, Portland, OR, USA (2020).
GPUs have long been discussed as accelerators for database query processing because of their high processing power and memory bandwidth. However, two main challenges limit the utility of GPUs for large-scale data processing: (1) the onboard memory capacity is too small to store large data sets, yet (2) the interconnect bandwidth to CPU main-memory is insufficient for ad-hoc data transfers. As a result, GPU-based systems and algorithms run into a transfer bottleneck and do not scale to large data sets. In practice, CPUs process large-scale data faster than GPUs with current technology. In this paper, we investigate how a fast interconnect can resolve these scalability limitations using the example of NVLink 2.0. NVLink 2.0 is a new interconnect technology that links dedicated GPUs to a CPU. The high bandwidth of NVLink 2.0 enables us to overcome the transfer bottleneck and to efficiently process large data sets stored in main-memory on GPUs. We perform an in-depth analysis of NVLink 2.0 and show how we can scale a no-partitioning hash join beyond the limits of GPU memory. Our evaluation shows speedups of up to 18× over PCI-e 3.0 and up to 7.3× over an optimized CPU implementation. Fast GPU interconnects thus enable GPUs to efficiently accelerate query processing.
7.
Del Monte, B., Zeuch, S., Rabl, T., Markl, V.: Rhino: Efficient Management of Very Large Distributed State for Stream Processing Engines. ACM SIGMOD/PODS International Conference on Management of Data, Portland, OR, USA (2020).
Scale-out stream processing engines (SPEs) are powering large big data applications on high velocity data streams. Industrial setups require SPEs to sustain outages, varying data rates, and low-latency processing. SPEs need to transparently reconfigure stateful queries during runtime. However, state-of-the-art SPEs are not ready yet to handle on-the-fly reconfigurations of queries with terabytes of state due to three problems. These are network overhead for state migration, consistency, and overhead on data processing. In this paper, we propose Rhino, a library for efficient reconfigurations of running queries in the presence of very large distributed state. Rhino provides a handover protocol and a state migration protocol to consistently and efficiently migrate stream processing among servers. Overall, our evaluation shows that Rhino scales with state sizes of up to TBs, reconfigures a running query 15 times faster than the state-of- the-art, and reduces latency by three orders of magnitude upon a reconfiguration.
8.
Kaitoua, A., Rabl, T., Markl, V.: A Distributed Data Exchange Engine for Polystores. it-Information Technology. (2020).
There is an increasing interest in fusing data from heterogeneous sources. Combining data sources increases the utility of existing datasets, generating new information and creating services of higher quality. A central issue in working with heterogeneous sources is data migration: In order to share and process data in different engines, resource intensive and complex movements and transformations between computing engines, services, and stores are necessary. Muses is a distributed, high-performance data migration engine that is able to interconnect distributed data stores by forwarding, transforming, repartitioning, or broadcasting data among distributed engines’ instances in a resource-, cost-, and performance-adaptive manner. As such, it performs seamless information sharing across all participating resources in a standard, modular manner. We show an overall improvement of 30% for pipelining jobs across multiple engines, even when we count the overhead of Muses in the execution time. This performance gain implies that Muses can be used to optimise large pipelines that leverage multiple engines.
9.
Benson, L., Grulich, P.M., Zeuch, S., Markl, V., Rabl, T.: Disco: Efficient Distributed Window Aggregation. Proceedings of the 23rd International Conference on Extending Database Technology (EDBT). OpenProceedings.org (2020).
Many business applications benefit from fast analysis of online data streams. Modern stream processing engines (SPEs) provide complex window types and user-defined aggregation functions to analyze streams. While SPEs run in central data centers, wireless sensors networks (WSNs) perform distributed aggregations close to the data sources, which is beneficial especially in modern IoT setups. However, WSNs support only basic aggregations and windows. To bridge the gap between complex central aggregations and simple distributed analysis, we propose Disco, a distributed complex window aggregation approach. Disco processes complex window types on multiple independent nodes while efficiently aggregating incoming data streams. Our evaluation shows that Disco's throughput scales linearly with the number of nodes and that Disco already outperforms a centralized solution in a two-node setup. Furthermore, Disco reduces the network cost significantly compared to the centralized approach. Disco's tree-like topology handles thousands of nodes per level and scales to support future data-intensive streaming applications.
10.
Makait, H.: Rethinking Message Brokers on RDMA and NVM. Proceedings of the 2020 International Conference on Management of Data. ACM, Portland, OR, USA (2020).
11.
Karimov, J., Rabl, T., Markl, V.: AJoin: Ad-hoc Stream Joins at Scale. Proceedings of the VLDB Endowment (2020).
The processing model of state-of-the-art stream processing engines is designed to execute long-running queries one at a time. However, with the advance of cloud technologies and multi-tenant systems, multiple users share the same cloud for stream query processing. This results in many ad-hoc stream queries sharing common stream sources. Many of these queries include joins. There are two main limitations that hinder performing ad-hoc stream join processing. The first limitation is missed optimization potential both in stream data processing and query optimization layers. The second limitation is the lack of dynamicity in query execution plans. We present AJoin, a dynamic and incremental ad-hoc stream join framework. AJoin consists of an optimization layer and a stream data processing layer. The optimization layer periodically reoptimizes the query execution plan, performing join reordering and vertical and horizontal scaling at run-time without stopping the execution. The data processing layer implements pipeline-parallel join architecture. This layer enables incremental and consistent query processing supporting all the actions triggered by the optimizer. We implement AJoin on top of Apache Flink, an open-source data processing framework. AJoin outperforms Flink not only at ad-hoc multi-query workloads but also at single-query workloads.