Publications

We try to keep an up to date list of all our publications. If you are interested in a PDF that we have not uploaded yet, feel free to send us an email to get a copy. All recent publications you will find below. For older, please click appropriate year.

Publications of the years 2025, 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007

ACM SIG Proceedings ("et al." for 3+ authors)

{ "authors" : [{ "lastname":"Benson" , "initial":"L" , "url":"https://hpi.de/rabl/team/lawrence-benson.html" , "mail":"lawrence.benson(at)hpi.de" }, { "lastname":"Bollmeier" , "initial":"L" , "url":"https://hpi.de/rabl/team/lars-jonas-bollmeier.html" , "mail":"lars.bollmeier(at)student.hpi.uni-potsdam.de" }, { "lastname":"Böther" , "initial":"M" , "url":"https://hpi.de/rabl/team/maximilian-boether.html" , "mail":"maximilian.boether(at)student.hpi.uni-potsdam.de" }, { "lastname":"Ihde" , "initial":"N" , "url":"https://hpi.de/rabl/team/nina-ihde.html" , "mail":"nina.ihde(at)student.hpi.uni-potsdam.de" }, { "lastname":"Makait" , "initial":"H" , "url":"https://hpi.de/rabl/team/hendrik-makait.html" , "mail":"none" }, { "lastname":"Marienwald" , "initial":"H" , "url":"https://hpi.de/rabl/team/hannah-marienwald.html" , "mail":"hannah.marienwald(at)hpi.de" }, { "lastname":"Marten" , "initial":"P" , "url":"https://hpi.de/rabl/team/paula-marten.html" , "mail":"paula.marten(at)student.hpi.uni-potsdam.de" }, { "lastname":"Moczalla" , "initial":"R" , "url":"https://hpi.de/rabl/team/rafael-moczalla.html" , "mail":"rafael.moczalla(at)hpi.de" }, { "lastname":"Papke" , "initial":"L" , "url":"https://hpi.de/rabl/team/leon-papke.html" , "mail":"leon.papke(at)student.hpi.uni-potsdam.de" }, { "lastname":"Rabl" , "initial":"T" , "url":"https://hpi.de/rabl/team/prof-dr-tilmann-rabl.html" , "mail":"tilmann.rabl(at)hpi.de" }, { "lastname":"Tolovski" , "initial":"I" , "url":"https://hpi.de/rabl/team/ilin-tolovski.html" , "mail":"ilin.tolovski(at)hpi.de" }, { "lastname":"Silva" , "initial":"P" , "url":"https://hpi.de/rabl/team/dr-pedro-silva.html" , "mail":"pedro.silva(at)hpi.de" }, { "lastname":"Weisgut" , "initial":"M" , "url":"https://hpi.de/rabl/team/marcel-weisgut.html" , "mail":"marcel.weisgut(at)hpi.de" }, { "lastname":"Yue" , "initial":"W" , "url":"https://hpi.de/rabl/team/wang-yue.html" , "mail":"wang.yue(at)hpi.de" }]}

2025

[1]

Bodner, T., Boissier, M., Rabl, T., Salazar-Díaz, R., Schmeller, F., Strassenburg, N., Tolovski, I., Weisgut, M. and Yue, W. A Case for Ecological Efficiency in Database Server LifecyclesConference on Innovative Data Systems Research (CIDR) (2025).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

2024

[1]

Benson, L., Binnig, C., Bodensohn, J.-M., Lorenzi, F., Luo, J., Porobic, D., Rabl, T., Sanghi, A., Sears, R., Tözün, P. and Ziegler, T. Surprise Benchmarking: The Why, What, and HowProceedings of the Tenth International Workshop on Testing Database Systems (DBTest) (2024), 1–8.

[ BibTeX ] [ Download ]

[2]

Wang, Y., Moczalla, R., Luthra, M. and Rabl, T. Deco: Fast and Accurate Decentralized Aggregation of Count-Based Windows in Large-Scale IoT Applications27th International Conference on Extending Database Technology (EDBT ’24) (2024).

[ Abstract ] [ BibTeX ] [ Download ]

[3]

Salazar-Díaz, R., Glavic, B. and Rabl, T. InferDB: In-Database Machine Learning Inference Using IndexesProceedings of the VLDB Endowment 17 (8) (2024), 1830–1842.

[ Abstract ] [ BibTeX ] [ URL ] [ DOI ] [ Download ]

[4]

Riekenbrauck, N., Weisgut, M., Lindner, D. and Rabl, T. A Three-Tier Buffer Manager Integrating CXL Device Memory for Database SystemsJoint International Workshop on Big Data Management on Emerging Hardware and Data Management on Virtualized Active Systems @ ICDE 2024 (2024).

[ BibTeX ] [ URL ] [ Download ]

[5]

Schmeller, F., Nugroho, D.P.A., Zeuch, S. and Rabl, T. Towards A GPU-Accelerated Stream Processing Engine Through Query CompilationLernen, Wissen, Daten, Analysen. (LWDA ’24) (2024).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[6]

Hendrik, M., Del Monte, B. and Rabl, T. Ghostwriter: a Distributed Message Broker on RDMA and NVM15th International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures 15 (2024).

[ Abstract ] [ BibTeX ] [ Download ]

[7]

Wang, Y. Efficient Stream Processing in Decentralized NetworksPhD@ VLDB. 2024 (2024).

[ Abstract ] [ BibTeX ] [ Download ]

[8]

Tolovski, I. and Rabl, T. Addressing Data Management Challenges for Interoperable Data Science1st International Workshop on Data-driven AI (DATAI) @ VLDB ’24 (2024).

[ Abstract ] [ BibTeX ] [ Download ]

[9]

Wang, Y., Boissier, M. and Rabl, T. A Survey of Stream Processing System Benchmark16th TPC Technology Conference on Performance Evaluation & Benchmarking (TPCTC) @ VLDB ’24 (2024).

[ Abstract ] [ BibTeX ] [ Download ]

2023

[1]

Benson, L., Ebeling, R. and Rabl, T. Evaluating SIMD Compiler-Intrinsics for Database Systems14th International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processors and Storage Architectures (2023).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[2]

Brücke, C., Härtling, P., Escobar Palacios, R.D., Patel, H. and Rabl, T. TPCx-AI - An Industry Standard Benchmark for Artificial Intelligence and Machine Learning SystemsProceedings of the VLDB Endowment 16 (12) (2023), 3649–3661.

[ Abstract ] [ BibTeX ] [ URL ] [ DOI ] [ Download ]

@article{noauthororeditor,
  abstract = {Artificial intelligence (AI) and machine learning (ML) techniques have existed for years, but new hardware trends and advances in model training and inference have radically improved their perfor- mance. With an ever increasing amount of algorithms, systems, and hardware solutions, it is challenging to identify good deployments even for experts. Researchers and industry experts have observed this challenge and have created several benchmark suites for AI and ML applications and systems. While they are helpful in comparing several aspects of AI applications, none of the existing benchmarks measures end-to-end performance of ML deployments. Many have been rigorously developed in collaboration between academia and industry, but no existing benchmark is standardized. In this paper, we introduce the TPC Express Benchmark for Arti- ficial Intelligence (TPCx-AI), the first industry standard benchmark for end-to-end machine learning deployments. TPCx-AI is the first AI benchmark that represents the pipelines typically found in com- mon ML and AI workloads. TPCx-AI provides a full software kit, which includes data generator, driver, and two full workload imple- mentations, one based on Python libraries and one based on Apache Spark. We describe the complete benchmark and show benchmark results for various scale factors. TPCx-AI’s core contributions are a novel unified data set covering structured and unstructured data; a fully scalable data generator that can generate realistic data from GB up to PB scale; and a diverse and representative workload using different data types and algorithms, covering a wide range of as- pects of real ML workloads such as data integration, data processing, training, and inference.},
  author = {Brücke, Christoph and Härtling, Philipp and Escobar Palacios, Rodrigo D and Patel, Hamesh and Rabl, Tilmann},
  journal = {Proceedings of the VLDB Endowment},
  keywords = {ai benchmarking tpc},
  number = 12,
  pages = {3649 - 3661},
  title = {TPCx-AI - An Industry Standard Benchmark for Artificial Intelligence and Machine Learning Systems},
  volume = 16,
  year = 2023
}

[3]

Böther, M., Benson, L., Klimovic, A. and Rabl, T. Analyzing Vectorized Hash Tables Across CPU ArchitecturesProceedings of the VLDB Endowment 16 (11) (2023), 2755–2768.

[ Abstract ] [ BibTeX ] [ URL ] [ DOI ] [ Download ]

[4]

Wang, Y., Benson, L. and Rabl, T. Desis: Efficient Window Aggregation in Decentralized Networks26th International Conference on Extending Database Technology (EDBT ’23) (2023).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

@inproceedings{streamprocessing,
  abstract = {Stream processing is widely applied in industry as well as in research to process unbounded data streams. In many use cases, specific data streams are processed by multiple continuous queries. Current systems group events of an unbounded data stream into bounded windows to produce results of individual queries in a timely fashion. For multiple concurrent queries, multiple concurrent and usually overlapping windows are generated. To reduce redundant computations and share partial results, state-of-the-art solutions divide windows into slices and then share the results of those slices. However, this is only applicable for queries with the same aggregation function and window measure, as in the case of overlaps for sliding windows. For multiple queries on the same stream with different aggregation functions and window measures, partial results cannot be shared. Furthermore, data streams are produced from devices that are distributed in large decentralized networks. Current systems cannot process queries on decentralized data streams efficiently. All queries in a decentralized network are either computed centrally or processed individually without exploiting partial results across queries. We present Desis, a stream processing system that can efficiently process multiple stream aggregation queries. We propose an aggregation engine that can share partial results between multiple queries with different window types, measures, and aggregation functions. In decentralized networks, Desis moves computation to data sources and shares overlapping computation as early as possible between queries. Desis outperforms existing solutions by orders of magnitude in throughput when processing multiple queries and can scale to millions of queries. In a decentralized setup, Desis can save up to 99% of network traffic and scale performance linearly.},
  author = {Wang, Yue and Benson, Lawrence and Rabl, Tilmann},
  journal = {26th International Conference on Extending Database Technology (EDBT '23)},
  keywords = {myown streampocessing windowaggregation},
  title = {Desis: Efficient Window Aggregation in Decentralized Networks},
  year = 2023
}

[5]

Strassenburg, N., Kupfer, D., Kowal, J. and Rabl, T. Efficient Multi-Model Management26th International Conference on Extending Database Technology (EDBT ’23) (2023).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[6]

Ilic, I., Tolovski, I. and Rabl, T. RMG Sort: Radix-Partitioning-Based Multi-GPU SortingDatenbanksysteme für Business, Technologie und Web (BTW 2023) (2023).

[ Abstract ] [ BibTeX ] [ Download ]

[7]

Mahling, F., Rößler, P., Bodner, T. and Rabl, T. BabelMR: A Polyglot Framework for Serverless MapReduceWorkshop on Serverless Data Analytics (2023).

[ Abstract ] [ BibTeX ] [ URL ]

2022

[1]

Lutz, C., Breß, S., Zeuch, S., Rabl, T. and Markl, V. Triton Join: Efficiently Scaling the Operator State on GPUs with Fast InterconnectsACM SIGMOD International Conference on Management of Data (SIGMOD ’22) (2022).

[ Abstract ] [ BibTeX ] [ Download ]

[2]

Benson, L. and Rabl, T. Darwin: Scale-In Stream Processing12th Annual Conference on Innovative Data Systems Research (CIDR ’22) (2022).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

@inproceedings{benson_darwin_2022,
  abstract = {Companies increasingly rely on stream processing engines (SPEs) to quickly analyze data and monitor infrastructure. These systems enable continuous querying of data at high rates. Current production-level systems, such as Apache Flink and Spark, rely on clusters of servers to scale out processing capacity. Yet, these scale-out systems are resource inefficient and cannot fully utilize the hardware. As a solution, hardware-optimized, single-server, scale-up SPEs were developed. To get the best performance, they neglect essential features for industry adoption, such as larger-than-memory state and recovery. This requires users to choose between high performance or system availability. While some streaming workloads can afford to lose or reprocess large amounts of data, others cannot, forcing them to accept lower performance. Users also face a large performance drop once their workloads slightly exceed a single server and force them to use scale-out SPEs. To acknowledge that real-world stream processing setups have drastically varying performance and availability requirements, we propose scale-in processing. Scale-in processing is a new paradigm that adapts to various application demands by achieving high hardware utilization on a wide range of single- and multi-node hardware setups, reducing overall infrastructure requirements. In contrast to scaling-up or -out, it focuses on fully utilizing the given hardware instead of demanding more or ever-larger servers. We present Darwin, our scale-in SPE prototype that tailors its execution towards arbitrary target environments through compiling stream processing queries while recoverable larger-than-memory state management. Early results show that Darwin achieves an order of magnitude speed-up over current scale-out systems and matches processing rates of scale-up systems.},
  author = {Benson, Lawrence and Rabl, Tilmann},
  booktitle = {12th Annual Conference on Innovative Data Systems Research (CIDR ’22)},
  keywords = {cidr myown streamprocessing},
  title = {Darwin: Scale-In Stream Processing},
  year = 2022
}

[3]

Del Monte, B., Zeuch, S., Rabl, T. and Markl, V. Rethinking Stateful Stream Processing with RDMAACM SIGMOD International Conference on Management of Data (SIGMOD ’22) (2022).

[ Abstract ] [ BibTeX ] [ Download ]

[4]

Damme, P., Birkenbach, M., Bitsakos, C., Boehm, M., Bonnet, P., Ciorba, F., Dokter, M., Dowgiallo, P., Eleliemy, A., Faerber, C., Goumas, G., Habich, D., Hedam, N., Hofer, M., Huang, W., Innerebner, K., Karakostas, V., Kern, R., Kosar, T., Krause, A., Krems, D., Laber, A., Lehner, W., Mier, E., Rabl, T., Ratuszniak, P., Silva, P., Skuppin, N., Starzacher, A., Steinwender, B., Tolovski, I., Tözün, P., Ulatowski, W., Wang, Y., Wrosz, I., Zamuda, A., Zhang, C. and Xiang Zhu, X. DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines12th Annual Conference on Innovative Data Systems Research (CIDR ’22) (2022).

[ Abstract ] [ BibTeX ] [ Download ]

@inproceedings{damme2022daphne,
  abstract = {Integrated data analysis (IDA) pipelines—that combine data management (DM) and query processing, high-performance computing (HPC), and machine learning (ML) training and scoring—become increasingly common in practice. Interestingly, systems of these areas share many compilation and runtime techniques, and the used—increasingly heterogeneous—hardware infrastructure converges as well. Yet, the programming paradigms, cluster resource management, data formats and representations, as well as execution strategies differ substantially. DAPHNE is an open and extensible system infrastructure for such IDA pipelines, including language abstractions, compilation and runtime techniques, multi-level scheduling, hardware (HW) accelerators, and computational storage for increasing productivity and eliminating unnecessary overheads. In this paper, we make a case for IDA pipelines, describe the overall DAPHNE system architecture, its key components, and the design of a vectorized execution engine for computational storage, HW accelerators, as well as local and distributed operations. Preliminary experiments that compare DAPHNE with MonetDB, Pandas, DuckDB, and TensorFlow show promising results.},
  author = {Damme, Patrick and Birkenbach, Marius and Bitsakos, Constatinos and Boehm, Matthias and Bonnet, Philippe and Ciorba, Florina and Dokter, Mark and Dowgiallo, Pawel and Eleliemy, Ahmed and Faerber, Christian and Goumas, Georgios and Habich, Dirk and Hedam, Niclas and Hofer, Marlies and Huang, Wenjun and Innerebner, Kevin and Karakostas, Vasileios and Kern, Roman and Kosar, Tomaž and Krause, Alexander and Krems, Daniel and Laber, Andreas and Lehner, Wolfgang and Mier, Eric and Rabl, Tilmann and Ratuszniak, Piotr and Silva, Pedro and Skuppin, Nikolai and Starzacher, Andreas and Steinwender, Benjamin and Tolovski, Ilin and Tözün, Pinar and Ulatowski, Wojciech and Wang, Yuanyuan and Wrosz, Izajasz and Zamuda, Aleš and Zhang, Ce and Xiang Zhu, Xiao},
  booktitle = {12th Annual Conference on Innovative Data Systems Research (CIDR ’22)},
  keywords = {cidr daphne dataanalysis pipelines},
  title = {DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines},
  year = 2022
}

[5]

Maltenberger, T., Lehmann, T., Benson, L. and Rabl, T. Evaluating In-Memory Hash-Joins on Persistent Memory25th International Conference on Extending Database Technology (EDBT ’22) (2022).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[6]

Benson, L., Papke, L. and Rabl, T. PerMA-Bench: Benchmarking Persistent Memory AccessProceedings of the VLDB Endowment 15 (11) (2022), 2463–2476.

[ Abstract ] [ BibTeX ] [ URL ] [ DOI ] [ Download ]

[7]

Strassenburg, N., Tolovski, I. and Rabl, T. Efficiently Managing Deep Learning Models in a Distributed Environment25th International Conference on Extending Database Technology (EDBT ’22) (2022).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[8]

Maltenberger, T., Ilic, I., Tolovski, I. and Rabl, T. Evaluating Multi-GPU Sorting with Modern Interconnects2022 ACM SIGMOD International Conference on Management of Data (SIGMOD ’22) (2022).

[ Abstract ] [ BibTeX ] [ DOI ] [ Download ]

@inproceedings{maltenberger2022evaluating,
  abstract = {In recent years, GPUs have become a mainstream accelerator for database operations such as sorting. Most of the published GPU- based sorting algorithms are single-GPU approaches. Consequently, they neither harness the full computational power nor exploit the high-bandwidth P2P interconnects of modern multi-GPU platforms. In particular, the latest NVLink 2.0 and NVLink 3.0-based NVSwitch interconnects promise unparalleled multi-GPU acceleration. Re- garding multi-GPU sorting, there are two types of algorithms: GPU- only approaches, utilizing P2P interconnects, and heterogeneous strategies that employ the CPU and the GPUs. So far, both types have been evaluated at a time when PCIe 3.0 was state-of-the-art. In this paper, we conduct an extensive analysis of serial, parallel, and bidirectional data transfer rates to, from, and between multiple GPUs on systems with PCIe 3.0, PCIe 4.0, NVLink 2.0, and NVLink 3.0-based NVSwitch interconnects. We measure up to 35.3× higher parallel P2P copy throughput with NVLink 3.0-powered NVSwitch over PCIe 3.0 interconnects. To study multi-GPU sorting on today’s hardware, we implement a P2P-based (P2P sort) and a heteroge- neous (HET sort) multi-GPU sorting algorithm and evaluate them on three modern systems. We observe speedups over state-of-the- art parallel CPU-based radix sort of up to 14× for P2P sort and 9× for HET sort. On systems with high-speed P2P interconnects, we demonstrate that P2P sort outperforms HET sort by up to 1.65×. Finally, we show that overlapping GPU copy and compute opera- tions to mitigate the transfer bottleneck does not yield performance improvements on modern multi-GPU platforms.},
  author = {Maltenberger, Tobias and Ilic, Ivan and Tolovski, Ilin and Rabl, Tilmann},
  booktitle = {2022 ACM SIGMOD International Conference on Management of Data (SIGMOD ’22)},
  keywords = {evaluation gpu interconnect nvlink pcie sigmod sorting},
  title = {Evaluating Multi-GPU Sorting with Modern Interconnects},
  year = 2022
}

[9]

Gévay, G.E., Rabl, T., Breß, S., Madai-Tahy, L., Quiané-Ruiz, J.-A. and Markl, V. Imperative or Functional Control Flow Handling: Why not the Best of Both Worlds?ACM SIGMOD Record 51 (1) (2022), 1–8.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

2021

[1]

Ihde, N., Marten, P., Eleliemy, A., Poerwawinata, G., Silva, P., Tolovski, I., Ciorba, M.F. and Rabl, T. A Survey of Big Data, High Performance Computing, and Machine Learning BenchmarksLecture Notes in Computer Science 13169 (2021), 1–21.

[ Abstract ] [ BibTeX ] [ URL ] [ DOI ] [ Download ]

[2]

Böther, M., Kißig, O., Benson, L. and Rabl, T. Drop It In Like It’s Hot: An Analysis of Persistent Memory as a Drop-in Replacement for NVMe SSDsInternational Workshop on Data Managment on New Hardware (DAMON’21), June 20–25, 2021, Virtual Event, China. (2021).

[ Abstract ] [ BibTeX ] [ URL ] [ DOI ] [ Download ]

[3]

Benson, L., Makait, H. and Rabl, T. Viper: An Efficient Hybrid PMem-DRAM Key-Value StoreProceedings of the VLDB Endowment 14 (9) (2021), 1544–1556.

[ Abstract ] [ BibTeX ] [ URL ] [ DOI ] [ Download ]

[4]

Böther, M. and Rabl, T. Scale-Down Experiments on TPCx-HSBig Data in Emergent Distributed Environments (BiDEDE’21), June 20, 2021, Virtual Event, China (2021).

[ Abstract ] [ BibTeX ] [ URL ] [ DOI ] [ Download ]

[5]

Gévay, G.E., Rabl, T., Breß, S., Madai-Tahy, L., Quiané-Ruiz, J.-A. and Markl, V. Efficient Control Flow in Dataflow Systems: When Ease-of-Use Meets High Performance37th IEEE International Conference on Data Engineering (2021).

[ BibTeX ] [ URL ] [ Download ]

[6]

Daase, B., Bollmeier, L.J., Benson, L. and Rabl, T. Maximizing Persistent Memory Bandwidth Utilization for OLAP WorkloadsProceedings of the 2021 International Conference on Management of Data (SIGMOD ’21), June 20--25, 2021, Virtual Event, China (2021).

[ Abstract ] [ BibTeX ] [ URL ] [ DOI ] [ Download ]

[7]

Menon, P., M. Qadah, T., Rabl, T., Sadoghi, M. and Jacobsen, H.-A. LogStore: A Workload-aware, Adaptable Key-Value Store on Hybrid Storage Systems37th IEEE International Conference on Data Engineering (2021).

[ BibTeX ] [ Download ]

[8]

Traub, J., Grulich, P.M., Cuéllar, A.R., Breß, S., Katsifodimos, A., Rabl, T. and Markl, V. Scotty: General and Efficient Open-Source Window Aggregation for Stream Processing SystemsTransactions on Database Systems 46 (1) (2021), 46.

[ BibTeX ] [ Download ]

2020

[1]

Lutz, C., Breß, S., Zeuch, S., Rabl, T. and Markl, V. Pump Up the Volume: Processing Large Data on GPUs with Fast InterconnectsACM SIGMOD/PODS International Conference on Management of Data, Portland, OR, USA (2020).

[ Abstract ] [ BibTeX ] [ Download ]

[2]

Makait, H. Rethinking Message Brokers on RDMA and NVMProceedings of the 2020 International Conference on Management of Data (2020).

[ BibTeX ] [ DOI ] [ Download ]

[3]

Benson, L., Grulich, P.M., Zeuch, S., Markl, V. and Rabl, T. Disco: Efficient Distributed Window AggregationProceedings of the 23rd International Conference on Extending Database Technology (EDBT) (2020).

[ Abstract ] [ BibTeX ] [ URL ] [ DOI ] [ Download ]

[4]

Kaitoua, A., Rabl, T. and Markl, V. A Distributed Data Exchange Engine for Polystoresit-Information Technology (2020).

[ Abstract ] [ BibTeX ] [ Download ]

[5]

Del Monte, B., Zeuch, S., Rabl, T. and Markl, V. Rhino: Efficient Management of Very Large Distributed State for Stream Processing EnginesACM SIGMOD/PODS International Conference on Management of Data, Portland, OR, USA (2020).

[ Abstract ] [ BibTeX ] [ DOI ] [ Download ]

[6]

Karimov, J., Rabl, T. and Markl, V. AJoin: Ad-hoc Stream Joins at ScaleProceedings of the VLDB Endowment 13 (4) (2020).

[ Abstract ] [ BibTeX ] [ Download ]

[7]

Dreseler, M., Boissier, M., Rabl, T. and Uflacker, M. Quantifying TPC-H Choke Points and Their Optimizations [Experiments and Analyses]Proceedings of the VLDB Endowment 13 (8) (2020), 1206–1220.

[ BibTeX ] [ URL ] [ Download ]

[8]

Grulich, P.M., Breß, S., Zeuch, S., Traub, J., von Bleichert, J., Chen, Z., Rabl, T. and Markl, V. Grizzly: Efficient Stream Processing Through Adaptive Query CompilationACM SIGMOD/PODS International Conference on Management of Data, Portland, OR, USA (2020).

[ Abstract ] [ BibTeX ] [ DOI ] [ Download ]

[9]

Derakhshan, B., Mahdiraji, A.R., Abedjan, Z., Rabl, T. and Markl, V. Optimizing Machine Learning Workloads in Collaborative EnvironmentsACM SIGMOD/PODS International Conference on Management of Data, Portland, OR, USA (2020).

[ Abstract ] [ BibTeX ] [ Download ]

@inproceedings{derakhshan2020optimizing,
  abstract = {Effective collaboration among data scientists results in high-quality and efficient machine learning (ML) workloads. In a collaborative environment, such as Kaggle or Google Colabratory, users typically re-execute or modify published scripts to recreate or improve the result. This introduces many redundant data processing and model training operations. Reusing the data generated by the redundant operations leads to the more efficient execution of future workloads. However, existing collaborative environments lack a data management component for storing and reusing the result of previously executed operations. In this paper, we present a system to optimize the execution of ML workloads in collaborative environments by reusing previously performed operations and their results. We utilize a so-called Experiment Graph (EG) to store the artifacts, i.e., raw and intermediate data or ML models, as vertices and operations of ML workloads as edges. In theory, the size of EG can become unnecessarily large, while the storage budget might be limited. At the same time, for some artifacts, the overall storage and retrieval cost might outweigh the recomputation cost. To address this issue, we propose two algorithms for materializing artifacts based on their likelihood of future reuse. Given the materialized artifacts inside EG, we devise a linear-time reuse algorithm to find the optimal execution plan for incoming ML workloads. Our reuse algorithm only incurs a negligible overhead and scales for the high number of incoming ML workloads in collaborative environments. Our experiments show that we improve the run-time by one order of magnitude for repeated execution of the workloads and 50% for the execution of modified workloads in collaborative environments.},
  author = {Derakhshan, Behrouz and Mahdiraji, Alireza Rezaei and Abedjan, Ziawasch and Rabl, Tilmann and Markl, Volker},
  booktitle = {ACM SIGMOD/PODS International Conference on Management of Data, Portland, OR, USA},
  keywords = {sys:relevantfor:des mlsystems sigmod},
  title = {Optimizing Machine Learning Workloads in Collaborative Environments},
  year = 2020
}

[10]

Menon, P., Qadah, T.M., Rabl, T., Sadoghi, M. and Jacobsen, H.-A. LogStore: A Workload-aware, Adaptable Key-Value Store on Hybrid Storage SystemsTransactions on Knowledge and Data Engineering (2020).

[ BibTeX ] [ Download ]

[11]

Silva, P., Wang, Y. and Rabl, T. Grand Challenge: Incremental Stream Query AnalyticsProceedings of the 14th ACM International Conference on Distributed and Event-based Systems (DEBS ’20) (2020), 6.

[ Abstract ] [ BibTeX ] [ DOI ] [ Download ]

2019

[1]

Rabl, T., Brücke, C., Härtling, P., Stars, S., Palacios, R.E., Patel, H., Srivastava, S., Boden, C., Meiners, J. and Schelter, S. ADABench - Towards an Industry Standard Benchmark for Advanced Anayltics (2019).

[ Abstract ] [ BibTeX ] [ Download ]

[2]

Derakhshan, B., Mahdiraji, A.R., Rabl, T. and Markl, V. Continuous Deployment of Machine Learning PipelinesAdvances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT 2019, Lisbon, Portugal, March 26-29, 2019 (2019), 397–408.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

@inproceedings{DBLP:conf/edbt/DerakhshanMRM19,
  abstract = {Today machine learning is entering many business and scientific applications. The life cycle of machine learning applications consists of data preprocessing for transforming the raw data into features, training a model using the features, and deploying the model for answering prediction queries. In order to guarantee accurate predictions, one has to continuously monitor and update the deployed model and pipeline. Current deployment platforms update the model using online learning methods. When online learning alone is not adequate to guarantee the prediction accuracy, some deployment platforms provide a mechanism for automatic or manual retraining of the model. While the online training is fast, the retraining of the model is time-consuming and adds extra overhead and complexity to the process of deployment. We propose a novel continuous deployment approach for updating the deployed model using a combination of the incoming realtime data and the historical data.We utilize sampling techniques to include the historical data in the training process, thus eliminating the need for retraining the deployed model. We also other online statistics computation and dynamic materialization of the preprocessed features, which further reduces the total training and data preprocessing time. In our experiments, we design and deploy two pipelines and models to process two real-world datasets. The experiments show that continuous deployment reduces the total training cost up to 15 times while providing the same level of quality when compared to the state-of-the-art deployment approaches.},
  author = {Derakhshan, Behrouz and Mahdiraji, Alireza Rezaei and Rabl, Tilmann and Markl, Volker},
  booktitle = {Advances in Database Technology - 22nd International Conference on Extending Database Technology, {EDBT} 2019, Lisbon, Portugal, March 26-29, 2019},
  keywords = {edbt2019 modernhardware},
  pages = {397-408},
  title = {Continuous Deployment of Machine Learning Pipelines},
  year = 2019
}

[3]

Grulich, P.M., Traub, J., Breß, S., Katsifodimos, A., Markl, V. and Rabl, T. Poster: Generating Reproducible Out-of-Order Data Streams (2019), 256–257.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[4]

Rosenfeld, V., Breß, S., Zeuch, S., Rabl, T. and Markl, V. Performance Analysis and Automatic Tuning of Hash Aggregation on GPUsInternational Workshop on Data Management on New Hardware (DaMoN) (2019).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[5]

Karimov, J., Rabl, T. and Markl, V. AStream: Ad-hoc Shared Stream ProcessingProceedings of the 2019 International Conference on Management of Data (2019), 607–622.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

@inproceedings{karimov2019astream,
  abstract = {In the last decade, many distributed stream processing engines (SPEs) were developed to perform continuous queries on massive online data. The central design principle of these engines is to handle queries that potentially run forever on data streams with a query-at-a-time model, i.e., each query is optimized and executed separately. In many real applications, streams are not only processed with long-running queries, but also thousands of short-running ad-hoc queries. To support this efficiently, it is essential to share resources and computation for stream ad-hoc queries in a multi-user environment. The goal of this paper is to bridge the gap between stream processing and ad-hoc queries in SPEs by sharing computation and resources. We define three main requirements for ad-hoc shared stream processing: (1) Integration: Ad-hoc query processing should be a composable layer which can extend stream operators, such as join, aggregation, and window operators; (2) Consistency: Ad-hoc query creation and deletion must be performed in a consistent manner and ensure exactly-once semantics and correctness; (3) Performance: In contrast to state-of-the-art SPEs, ad-hoc SPE should not only maximize data throughput but also query throughout via incremental computation and resource sharing. Based on these requirements, we have developed AStream, an ad-hoc, shared computation stream processing framework. To the best of our knowledge, AStream is the first system that supports distributed ad-hoc stream processing. AStream is built on top of Apache Flink. Our experiments show that AStream shows comparable results to Flink for single query deployments and outperforms it in orders of magnitude with multiple queries.},
  author = {Karimov, Jeyhun and Rabl, Tilmann and Markl, Volker},
  booktitle = {Proceedings of the 2019 International Conference on Management of Data},
  keywords = {streamprocessing},
  pages = {607-622},
  publisher = {ACM},
  title = {AStream: Ad-hoc Shared Stream Processing},
  year = 2019
}

[6]

Margara, A. and Rabl, T. Definition of Data Streams. Encyclopedia of Big Data Technologies.

[ BibTeX ] [ Download ]

[7]

Kunft, A., Katsifodimos, A., Schelter, S., Breß, S., Rabl, T. and Markl, V. An Intermediate Representation for Optimizing Machine Learning PipelinesProceedings of the VLDB Endowment 12 (11) (2019), 1553–1567.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[8]

Traub, J., Grulich, P.M., Cuellar, A.R., Breß, S., Katsifodimos, A., Rabl, T. and Markl, V. Efficient Window Aggregation with General Stream SlicingAdvances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT 2019, Lisbon, Portugal, March 26-29, 2019 (2019), 97–108.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[9]

Zeuch, S., Breß, S., Rabl, T., Monte, B.D., Karimov, J., Lutz, C., Renz, M., Traub, J. and Markl, V. Analyzing Efficient Stream Processing on Modern HardwarePVLDB 12 (5) (2019), 516–530.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[10]

Esmailoghli, M., Redyuk, S., Martinez, R., Abedjan, Z., Rabl, T. and Markl, V. Explanation of Air Pollution Using External Data SourcesDatenbanksysteme für Business, Technologie und Web (BTW 2019), 18. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme" (DBIS), 4.-8. März 2019, Rostock, Germany, Workshopband (2019), 297–300.

[ BibTeX ] [ Download ]

[11]

Meyer, H.J., Grunert, H., Waizenegger, T., Woltmann, L., Hartmann, C., Wolfgang, L., Esmailoghli, M., Redyuk, S., Martinez, R., Abedjan, Z., Ziehn, A., Rabl, T., Markl, V., Schmitz, C., Devinder Serai, D. and Escobar Gava, T. Particulate Matter Matters - The Data Science Challenge @ BTW 2019 (2019).

[ BibTeX ] [ Download ]

[12]

Kaitoua, A., Rabl, T., Katsifodimos, A. and Markl, V. Muses: Distributed Data Migration System for Polystores35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, Macao, April 8-11, 2019 (2019), 1602–1605.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[13]

Breß, S., Funke, H., Zeuch, S., Rabl, T. and Markl, V. An Overview of Hawk: A Hardware-Tailored Code Generator for the Heterogeneous Many Core AgeDatenbanksysteme für Business, Technologie und Web (BTW 2019), 18. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme" (DBIS), 4.-8. März 2019, Rostock, Germany, Workshopband (2019), 87–90.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[14]

Bartnik, A., Monte, B.D., Rabl, T. and Markl, V. On-the-fly Reconfiguration of Query Plans for Stateful Stream Processing EnginesDatenbanksysteme für Business, Technologie und Web (BTW 2019), 18. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme" (DBIS), 4.-8. März 2019, Rostock, Germany, Proceedings (2019), 127–146.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

@inproceedings{bartnik2019onthefly,
  abstract = {Stream Processing Engines (SPEs) must tolerate the dynamic nature of unbounded data streams and provide means to quickly adapt to fluctuations in the data rate. Many major SPEs however provide very little functionality to adjust the execution of a potentially infinite streaming query at runtime. Each modification requires a complete query restart, which involves an expensive redistribution of the state of a query and may require external systems in order to guarantee correct processing semantics. This results in significant downtime, which increase the operational cost of those SPEs. We present a modification protocol that enables modifying specific operators as well as the data flow of a running query while ensuring exactly-once processing semantics. We provide an implementation for Apache Flink, which enables stateful operator migration across machines, the introduction of new operators into a running query, and changes to a specific operator based on external triggers. Our results on two benchmarks show that migrating operators for queries with small state is as fast as using the savepoint mechanism of Flink. Migrating operators in the presence of large state even outperforms the savepoint mechanism by a factor of more than 2.3. Introducing and replacing operators at runtime is performed in less than 10 seconds. Our modification protocol demonstrates the general feasibility of runtime modifications and opens the door for many other modification use cases, such as online algorithm tweaking and up- or down-scaling operator instances.},
  author = {Bartnik, Adrian and Monte, Bonaventura Del and Rabl, Tilmann and Markl, Volker},
  booktitle = {Datenbanksysteme für Business, Technologie und Web (BTW 2019), 18. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme" (DBIS), 4.-8. März 2019, Rostock, Germany, Proceedings},
  crossref = {DBLP:conf/btw/2019},
  keywords = {sys:relevantfor:des BTW streamprocessing},
  pages = {127-146},
  title = {On-the-fly Reconfiguration of Query Plans for Stateful Stream Processing Engines},
  year = 2019
}

2018

[1]

Gévay, G.E., Rabl, T., Breß, S., Madai-Tahy, L. and Markl, V. Labyrinth: Compiling Imperative Control Flow to Parallel DataflowsCoRR abs/1809.06845 (2018).

[ BibTeX ] [ URL ]

[2]

Poess, M., Ren, D.Q., Rabl, T. and Jacobsen, H.-A. Methods for Quantifying Energy Consumption in TPC-HProceedings of the 2018 ACM/SPEC International Conference on Performance Engineering, ICPE 2018, Berlin, Germany, April 09-13, 2018 (2018), 293–304.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

@inproceedings{DBLP:conf/wosp/PoessRRJ18,
  abstract = {Historically, performance and price-performance of computer systems have been the key purchasing arguments for customers. However, with rising energy costs and increasing power consumption due to the ever-growing demand for compute power (servers, storage, networks), electricity bills have become a significant expense for today’s data centers. In order to measure energy consumption in standardized ways, the Standard Performance Evaluation Corporation (SPEC) has developed a benchmark dedicated to measuring the power consumption of single servers (SPECpower_ssj2008), while the Transaction Processing Performance Council (TPC) and the Storage Performance Council (SPC) have developed general specifications that govern how energy is measured for any of its benchmarks. Energy reporting is optional in TPC and SPC results. While there are close to 600 SPECpower_ssj2008 results, there have been only three TPC and no SPC benchmark results published that report energy consumption. In this paper, we argue that the low number of TPC publications is due to the large setups required in TPC benchmarks and the, subsequently, complicated measurement setup. Running on a typical big data setup we evaluate two alternative methods to quantify energy consumption during TPC-H’s multi-user runs, namely by taking measurements of on-chip power sensors controlled through Intelligent Platform Management Interface and by estimating power consumption via the nameplate power consumption method. We compare these later two methods with power measurements taken from external power meters as required by SPEC and TPC benchmarks.},
  author = {Poess, Meikel and Ren, Da Qi and Rabl, Tilmann and Jacobsen, Hans{-}Arno},
  booktitle = {Proceedings of the 2018 {ACM/SPEC} International Conference on Performance Engineering, {ICPE} 2018, Berlin, Germany, April 09-13, 2018},
  crossref = {DBLP:conf/wosp/2018},
  keywords = {ICDE},
  pages = {293-304},
  title = {Methods for Quantifying Energy Consumption in {TPC-H}},
  year = 2018
}

[3]

Boden, C., Rabl, T. and Markl, V. The Berlin Big Data Center (BBDC)it-Information Technology 60 (5-6) (2018), 321–326.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[4]

Karimov, J., Rabl, T. and Markl, V. PolyBench: The First Benchmark for PolystoresPerformance Evaluation and Benchmarking for the Era of Artificial Intelligence (2018), 24–41.

[ Abstract ] [ BibTeX ] [ URL ]

[5]

Boden, C., Rabl, T., Schelter, S. and Markl, V. Benchmarking Distributed Data Processing Systems for Machine Learning WorkloadsPerformance Evaluation and Benchmarking for the Era of Artificial Intelligence - 10th TPC Technology Conference, TPCTC 2018, Rio de Janeiro, Brazil, August 27-31, 2018, Revised Selected Papers (2018), 42–57.

[ BibTeX ] [ URL ] [ DOI ] [ Download ]

[6]

Liu, Y., Guo, S., Hu, S., Rabl, T., Jacobsen, H.-A., Li, J. and Wang, J. Performance Evaluation and Optimization of Multi-Dimensional Indexes in HiveIEEE Trans. Services Computing 11 (5) (2018), 835–849.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[7]

Karimov, J., Rabl, T., Katsifodimos, A., Samarev, R., Heiskanen, H. and Markl, V. Benchmarking Distributed Stream Data Processing Engines34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018 (2018), 1507–1518.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[8]

Böhm, A. and Rabl, T. eds. Proceedings of the 7th International Workshop on Testing Database Systems, DBTest@SIGMOD 2018, Houston, TX, USA, June 15, 2018. ACM.

[ BibTeX ] [ URL ]

[9]

Traub, J., Grulich, P.M., Cuellar, A.R., Breß, S., Katsifodimos, A., Rabl, T. and Markl, V. Scotty: Efficient Window Aggregation for Out-of-Order Stream Processing34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018 (2018), 1300–1303.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[10]

Abedjan, Z., Breß, S., Markl, V., Rabl, T. and Soto, J. Data Management Systems Research at TU BerlinSIGMOD Record 47 (4) (2018), 23–28.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[11]

Grulich, P.M., Saitenmacher, R., Traub, J., Breß, S., Rabl, T. and Markl, V. Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive WindowingProceedings of the 21th International Conference on Extending Database Technology, EDBT 2018, Vienna, Austria, March 26-29, 2018. (2018), 477–480.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[12]

Lutz, C., Breß, S., Rabl, T., Zeuch, S. and Markl, V. Efficient k-Means on GPUsProceedings of the 14th International Workshop on Data Management on New Hardware, Houston, TX, USA, June 11, 2018 (2018), 1–3.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[13]

Kunft, A., Stadler, L., Bonetta, D., Basca, C., Meiners, J., Breß, S., Rabl, T., Fumero, J.J. and Markl, V. ScootR: Scaling R Dataframes on Dataflow SystemsProceedings of the ACM Symposium on Cloud Computing, SoCC 2018, Carlsbad, CA, USA, October 11-13, 2018 (2018), 288–300.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[14]

Breß, S., Köcher, B., Funke, H., Zeuch, S., Rabl, T. and Markl, V. Generating Custom Code for Efficient Query Execution on Heterogeneous ProcessorsVLDB J. 27 (6) (2018), 797–822.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[15]

Sakr, S., Rabl, T., Hirzel, M., Carbone, P. and Strohbach, M. Dagstuhl Seminar on Big Stream ProcessingSIGMOD Record 47 (3) (2018), 36–39.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[16]

Lutz, C., Breß, S., Rabl, T., Zeuch, S. and Markl, V. Efficient and Scalable k-Means on GPUsDatenbank-Spektrum 18 (3) (2018), 157–169.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[17]

Poess, M., Nambiar, R., Kulkarni, K., Narasimhadevara, C., Rabl, T. and Jacobsen, H.-A. Analysis of TPCx-IoT: The First Industry Standard Benchmark for IoT Gateway Systems34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018 (2018), 1519–1530.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

2017

[1]

Rabl, T., Sakr, S. and Hirzel, M. Big Stream Processing Systems (Dagstuhl Seminar 17441)Dagstuhl Reports 7 (10) (2017), 111–138.

[ BibTeX ] [ URL ] [ DOI ]

[2]

Boden, C., Alexandrov, A., Kunft, A., Rabl, T. and Markl, V. PEEL: A Framework for Benchmarking Distributed Systems and AlgorithmsPerformance Evaluation and Benchmarking for the Analytics Era - 9th TPC Technology Conference, TPCTC 2017, Munich, Germany, August 28, 2017, Revised Selected Papers (2017), 9–24.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[3]

Rabl, T. and Jacobsen, H.-A. Query Centric Partitioning and Allocation for Partially Replicated Database SystemsProceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017 (2017), 315–330.

[ Abstract ] [ BibTeX ] [ Download ]

[4]

Boden, C., Spina, A., Rabl, T. and Markl, V. Benchmarking Data Flow Systems for Scalable Machine LearningProceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR@SIGMOD 2017, Chicago, IL, USA, May 19, 2017 (2017), 1–10.

[ Abstract ] [ BibTeX ] [ Download ]

@inproceedings{DBLP:conf/sigmod/BodenSRM17,
  abstract = {Distributed data flow systems such as Apache Spark or Apache Flink are popular choices for scaling machine learning algorithms in production. Industry applications of large scale machine learning such as click through rate prediction rely on models trained on billions of data points which are both highly sparse and high dimensional. Existing Benchmarks attempt to assess the performance of data flow systems such as Apache Flink, Spark or Hadoop with non-representative workloads such as WordCount, Grep or Sort. They only evaluate scalability with respect to data set size and fail to address the crucial requirement of handling high dimensional data. We introduce a representative set of distributed machine learning algorithms suitable for large scale distributed settings which have close resemblance to industry-relevant applications and provide generalizable insights into system performance. We implement mathematically equivalent versions of these algorithms in Apache Flink and Apache Spark, tune relevant system parameters and run a comprehensive set of experiments to assess their scalability with respect to both: data set size and dimensionality of the data. We evaluate the systems for data up to four billion data points 100 million dimensions. Additionally we compare the performance to single-node implementations to put the scalability results into perspective. Our results indicate that while being able to robustly scale with increasing data set sizes, current state of the art data flow systems are surprisingly inefficient at coping with high dimensional data, which is a crucial requirement for large scale machine learning algorithms.},
  author = {Boden, Christoph and Spina, Andrea and Rabl, Tilmann and Markl, Volker},
  booktitle = {Proceedings of the 4th {ACM} {SIGMOD} Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR@SIGMOD 2017, Chicago, IL, USA, May 19, 2017},
  crossref = {DBLP:conf/sigmod/2017beyondmr},
  keywords = {SIGMOD benchmark},
  pages = {1-10},
  title = {Benchmarking Data Flow Systems for Scalable Machine Learning},
  year = 2017
}

[5]

Traub, J., Steenbergen, N., Grulich, P., Rabl, T. and Markl, V. I²: Interactive Real-Time Visualization for Streaming DataProceedings of the 20th International Conference on Extending Database Technology, EDBT 2017, Venice, Italy, March 21-24, 2017. (2017), 526–529.

[ Abstract ] [ BibTeX ] [ Download ]

[6]

Monte, B.D., Karimov, J., Mahdiraji, A.R., Rabl, T. and Markl, V. PROTEUS: Scalable Online Machine Learning for Predictive Analytics and Real-Time Interactive VisualizationProceedings of the Workshops of the EDBT/ICDT 2017 Joint Conference (EDBT/ICDT 2017), Venice, Italy, March 21-24, 2017. (2017).

[ Abstract ] [ BibTeX ] [ Download ]

[7]

Grulich, P., Rabl, T., Markl, V., Sidló, C.I. and Benczúr, A.A. STREAMLINE - Streamlined Analysis of Data at Rest and Data in MotionProceedings of the Workshops of the EDBT/ICDT 2017 Joint Conference (EDBT/ICDT 2017), Venice, Italy, March 21-24, 2017. (2017).

[ Abstract ] [ BibTeX ] [ Download ]

[8]

Traub, J., Breß, S., Rabl, T., Katsifodimos, A. and Markl, V. Optimized On-Demand Data Streaming from Sensor NodesProceedings of the 2017 Symposium on Cloud Computing, SoCC 2017, Santa Clara, CA, USA, September 24-27, 2017 (2017), 586–597.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[9]

Poess, M., Rabl, T. and Jacobsen, H.-A. Analysis of TPC-DS: the First Standard Benchmark for SQL-Based Big Data SystemsProceedings of the 2017 Symposium on Cloud Computing, SoCC 2017, Santa Clara, CA, USA, September 24-27, 2017 (2017), 573–585.

[ Abstract ] [ BibTeX ] [ Download ]

[10]

Rohrmann, T., Schelter, S., Rabl, T. and Markl, V. Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow SystemsDatenbanksysteme für Business, Technologie und Web (BTW 2017), 17. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme" (DBIS), 6.-10. März 2017, Stuttgart, Germany, Proceedings (2017), 269–288.

[ Abstract ] [ BibTeX ] [ Download ]

In recent years, the generated and collected data is increasing at an almost exponential rate. At the same time, the data’s value has been identified in terms of insights that can be provided. However, retrieving the value requires powerful analysis tools, since valuable insights are buried deep in large amounts of noise. Unfortunately, analytic capacities did not scale well with the growing data. Many existing tools run only on a single computer and are limited in terms of data size by its memory. A very promising solution to deal with large-scale data is scaling systems and exploiting parallelism. In this paper, we propose Gilbert, a distributed sparse linear algebra system, to decrease the imminent lack of analytic capacities. Gilbert offers a MATLAB-like programming language for linear algebra programs, which are automatically executed in parallel. Transparent parallelization is achieved by compiling the linear algebra operations first into an intermediate representation. This language-independent form enables high-level algebraic optimizations. Different optimization strategies are evaluated and the best one is chosen by a cost-based optimizer. The optimized result is then transformed into a suitable format for parallel execution. Gilbert generates execution plans for Apache Spark and Apache Flink, two massively parallel dataflow systems. Distributed matrices are represented by square blocks to guarantee a well-balanced trade-off between data parallelism and data granularity. An exhaustive evaluation indicates that Gilbert is able to process varying amounts of data exceeding the memory of a single computer on clusters of different sizes. Two well known machine learning (ML) algorithms, namely PageRank and Gaussian non-negative matrix factorization (GNMF), are implemented with Gilbert. The performance of these algorithms is compared to optimized implementations based on Spark and Flink. Even though Gilbert is not as fast as the optimized algorithms, it simplifies the development process significantly due to its high-level programming abstraction.

@inproceedings{DBLP:conf/btw/RohrmannSRG17,
  abstract = {In recent years, the generated and collected data is increasing at an almost exponential rate. At the same time, the data’s value has been identified in terms of insights that can be provided. However, retrieving the value requires powerful analysis tools, since valuable insights are buried deep in large amounts of noise. Unfortunately, analytic capacities did not scale well with the growing data. Many existing tools run only on a single computer and are limited in terms of data size by its memory. A very promising solution to deal with large-scale data is scaling systems and exploiting parallelism. In this paper, we propose Gilbert, a distributed sparse linear algebra system, to decrease the imminent lack of analytic capacities. Gilbert offers a MATLAB-like programming language for linear algebra programs, which are automatically executed in parallel. Transparent parallelization is achieved by compiling the linear algebra operations first into an intermediate representation. This language-independent form enables high-level algebraic optimizations. Different optimization strategies are evaluated and the best one is chosen by a cost-based optimizer. The optimized result is then transformed into a suitable format for parallel execution. Gilbert generates execution plans for Apache Spark and Apache Flink, two massively parallel dataflow systems. Distributed matrices are represented by square blocks to guarantee a well-balanced trade-off between data parallelism and data granularity. An exhaustive evaluation indicates that Gilbert is able to process varying amounts of data exceeding the memory of a single computer on clusters of different sizes. Two well known machine learning (ML) algorithms, namely PageRank and Gaussian non-negative matrix factorization (GNMF), are implemented with Gilbert. The performance of these algorithms is compared to optimized implementations based on Spark and Flink. Even though Gilbert is not as fast as the optimized algorithms, it simplifies the development process significantly due to its high-level programming abstraction.},
  author = {Rohrmann, Till and Schelter, Sebastian and Rabl, Tilmann and Markl, Volker},
  booktitle = {Datenbanksysteme f{{ü}}r Business, Technologie und Web {(BTW} 2017), 17. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme" (DBIS), 6.-10. M{{ä}}rz 2017, Stuttgart, Germany, Proceedings},
  crossref = {DBLP:conf/btw/2017},
  keywords = {BTW},
  pages = {269-288},
  title = {Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Systems},
  year = 2017
}

[11]

Kunft, A., Katsifodimos, A., Schelter, S., Rabl, T. and Markl, V. BlockJoin: Efficient Matrix Partitioning Through JoinsProceedings of the VLDB Endowment 10 (13) (2017), 2061–2072.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[12]

Boden, C., Rabl, T. and Markl, V. Distributed Machine Learning-but at what COST?Machine Learning Systems Workshop at the 2017 Conference on Neural Information Processing Systems (2017).

[ BibTeX ]

2016

[1]

Visengeriyeva, L., Akbik, A., Kaul, M., Rabl, T. and Markl, V. Improving Data Quality by Leveraging Statistical Relational LearningProceedings of the 21st International Conference on Information Quality, Ciudad Real, Spain, 22-23 June, 2016 (2016).

[ BibTeX ]

[2]

Baru, C. and Rabl, T. Application-Level Benchmarking of Big Data Systems. Big Data Analytics: Methods and Applications. Springer. 189–199.

[ Abstract ] [ BibTeX ] [ Download ]

[3]

Rabl, T., Traub, J., Katsifodimos, A. and Markl, V. Apache Flink in Current Researchit - Information Technology 58 (4) (2016), 157–165.

[ Abstract ] [ BibTeX ] [ Download ]

[4]

Cao, P., Gowda, B., Lakshmi, S., Narasimhadevara, C., Nguyen, P., Poelman, J., Poess, M. and Rabl, T. From BigBench to TPCx-BB: Standardization of a Big Data BenchmarkPerformance Evaluation and Benchmarking. Traditional - Big Data - Interest of Things - 8th TPC Technology Conference, TPCTC 2016, New Delhi, India, September 5-9, 2016, Revised Selected Papers (2016), 24–44.

[ Abstract ] [ BibTeX ] [ Download ]

[5]

Benczúr, A.A., Pálovics, R., Balassi, M., Markl, V., Rabl, T., Soto, J., Hovstadius, B., Dowling, J. and Haridi, S. Towards Streamlined Big Data AnalyticsERCIM News 2016 (107) (2016).

[ BibTeX ] [ URL ]

[6]

Rabl, T., Nambiar, R., Baru, C.K., Bhandarkar, M.A., Poess, M. and Pyne, S. eds. Big Data Benchmarking - 6th International Workshop, WBDB 2015, Toronto, ON, Canada, June 16-17, 2015 and 7th International Workshop, WBDB 2015, New Delhi, India, December 14-15, 2015, Revised Selected Papers. Springer.

[ BibTeX ] [ URL ] [ DOI ]

2015

[1]

Nambiar, R., Rabl, T., Kulkarni, K. and Frank, M. Enhancing Data Generation in TPCx-HS with a Non-Uniform Random DistributionPerformance Evaluation and Benchmarking: Traditional to Big Data to Internet of Things - 7th TPC Technology Conference, TPCTC 2015, Kohala Coast, HI, USA, August 31 - September 4, 2015. Revised Selected Papers (2015), 94–129.

[ Abstract ] [ BibTeX ] [ Download ]

[2]

Rabl, T., Sachs, K., Poess, M., Baru, C.K. and Jacobsen, H.-A. eds. Big Data Benchmarking - 5th International Workshop, WBDB 2014, Potsdam, Germany, August 5-6, 2014, Revised Selected Papers. Springer.

[ BibTeX ] [ URL ] [ DOI ]

[3]

Ivanov, T., Rabl, T., Poess, M., Queralt, A., Poelman, J., Poggi, N. and Buell, J. Big Data Benchmark CompendiumPerformance Evaluation and Benchmarking: Traditional to Big Data to Internet of Things - 7th TPC Technology Conference, TPCTC 2015, Kohala Coast, HI, USA, August 31 - September 4, 2015. Revised Selected Papers (2015), 135–155.

[ Abstract ] [ BibTeX ] [ Download ]

[4]

Rabl, T., Danisch, M., Frank, M., Schindler, S. and Jacobsen, H.-A. Just can’t get enough: Synthesizing Big DataProceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015 (2015), 1457–1462.

[ Abstract ] [ BibTeX ] [ Download ]

[5]

Rabl, T., Frank, M., Danisch, M., Jacobsen, H.-A. and Gowda, B. The Vision of BigBench 2.0Proceedings of the Fourth Workshop on Data analytics in the Cloud, DanaC 2015, Melbourne, VIC, Australia, May 31 - June 4, 2015 (2015), 1–4.

[ Abstract ] [ BibTeX ] [ Download ]

[6]

Traub, J., Rabl, T., Hueske, F., Rohrmann, T. and Markl, V. Die Apache Flink Plattform zur parallelen Analyse von Datenströmen und StapeldatenProceedings of the LWA 2015 Workshops: KDML, FGWM, IR, and FGDB, Trier, Germany, October 7-9, 2015. (2015), 403–408.

[ Abstract ] [ BibTeX ] [ Download ]

[7]

Hu, S., Liu, W., Rabl, T., Huang, S., Liang, Y., Xiao, Z., Jacobsen, H.-A., Pei, X. and Wang, J. DualTable: A Hybrid Storage Model for Update Optimization in Hive31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea, April 13-17, 2015 (2015), 1340–1351.

[ Abstract ] [ BibTeX ] [ Download ]

[8]

Song, D., Zhang, K., Rabl, T., Menon, P. and Jacobsen, H.-A. High Performance Stream Queries in ScalaProceedings of the 9th ACM International Conference on Distributed Event-Based Systems, DEBS ’15, Oslo, Norway, June 29 - July 3, 2015 (2015), 322–323.

[ Abstract ] [ BibTeX ] [ Download ]

2014

[1]

Baru, C.K., Bhandarkar, M.A., Curino, C., Danisch, M., Frank, M., Gowda, B., Jacobsen, H.-A., Jie, H., Kumar, D., Nambiar, R.O., Poess, M., Raab, F., Rabl, T., Ravi, N., Sachs, K., Sen, S., Yi, L. and Youn, C. Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big DataPerformance Characterization and Benchmarking. Traditional to Big Data - 6th TPC Technology Conference, TPCTC 2014, Hangzhou, China, September 1-5, 2014. Revised Selected Papers (2014), 44–63.

[ BibTeX ] [ URL ] [ DOI ]

[2]

Rabl, T., Jacobsen, H.-A., Nambiar, R., Poess, M., Bhandarkar, M.A. and Baru, C.K. eds. Advancing Big Data Benchmarks - Proceedings of the 2013 Workshop Series on Big Data Benchmarking, WBDB.cn, Xi’an, China, July 16-17, 2013 and WBDB.us, San José, CA, USA, October 9-10, 2013 Revised Selected Papers. Springer.

[ BibTeX ] [ URL ] [ DOI ]

[3]

Rabl, T. and Jacobsen, H.-A. Materialized Views in CassandraProceedings of 24th Annual International Conference on Computer Science and Software Engineering, CASCON 2014, Markham, Ontario, Canada, 3-5 November, 2014 (2014), 351–354.

[ Abstract ] [ BibTeX ] [ Download ]

[4]

Rabl, T., Frank, M., Danisch, M., Gowda, B. and Jacobsen, H.-A. Towards a Complete BigBench ImplementationBig Data Benchmarking - 5th International Workshop, WBDB 2014, Potsdam, Germany, August 5-6, 2014, Revised Selected Papers (2014), 3–11.

[ Abstract ] [ BibTeX ] [ Download ]

[5]

Menon, P., Rabl, T., Sadoghi, M. and Jacobsen, H.-A. Optimizing Key-Value Stores for Hybrid Storage ArchitecturesProceedings of 24th Annual International Conference on Computer Science and Software Engineering, CASCON 2014, Markham, Ontario, Canada, 3-5 November, 2014 (2014), 355–358.

[ BibTeX ] [ URL ]

[6]

Rabl, T., Poess, M., Baru, C.K. and Jacobsen, H.-A. eds. Specifying Big Data Benchmarks - First Workshop, WBDB 2012, San Jose, CA, USA, May 8-9, 2012, and Second Workshop, WBDB 2012, Pune, India, December 17-18, 2012, Revised Selected Papers. Springer.

[ BibTeX ] [ URL ] [ DOI ]

[7]

Menon, P., Rabl, T., Sadoghi, M. and Jacobsen, H.-A. CaSSanDra: An SSD Boosted Key-Value StoreIEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014 (2014), 1162–1167.

[ Abstract ] [ BibTeX ] [ Download ]

[8]

Poess, M., Rabl, T., Jacobsen, H.-A. and Caufield, B. TPC-DI: The First Industry Benchmark for Data IntegrationPVLDB 7 (13) (2014), 1367–1378.

[ Abstract ] [ BibTeX ] [ Download ]

[9]

Liu, Y., Hu, S., Rabl, T., Liu, W., Jacobsen, H.-A., Wu, K., Chen, J. and Li, J. DGFIndex for Smart Grid: Enhancing Hive with a Cost-Effective Multidimensional Range IndexPVLDB 7 (13) (2014), 1496–1507.

[ Abstract ] [ BibTeX ] [ Download ]

@article{DBLP:journals/pvldb/LiuHRLJWCL14,
  abstract = {In Smart Grid applications, as the number of deployed electric smart meters increases, massive amounts of valuable meter data is generated and collected every day. To enable reliable data collection and make business decisions fast, high throughput storage and high-performance analysis of massive meter data become crucial for grid companies. Considering the advantage of high efficiency, fault tolerance, and price-performance of Hadoop and Hive systems, they are frequently deployed as underlying platform for big data processing. However, in real business use cases, these data analysis applications typically involve multidimensional range queries (MDRQ) as well as batch reading and statistics on the meter data. While Hive is high-performance at complex data batch reading and analysis, it lacks efficient indexing techniques for MDRQ. In this paper, we propose DGFIndex, an index structure for Hive that efficiently supports MDRQ for massive meter data. DGFIndex divides the data space into cubes using the grid file technique. Unlike the existing indexes in Hive, which stores all combinations of multiple dimensions, DGFIndex only stores the information of cubes. This leads to smaller index size and faster query processing. Furthermore, with pre-computing user-defined aggregations of each cube, DGFIndex only needs to access the boundary region for aggregation query. Our comprehensive experiments show that DGFIndex can save significant disk space in comparison with the existing indexes in Hive and the query performance with DGFIndex is 2-63 times faster than existing indexes in Hive, 2-94 times faster than HadoopDB, 2-75 times faster than scanning the whole table in different query selectivity.},
  author = {Liu, Yue and Hu, Songlin and Rabl, Tilmann and Liu, Wantao and Jacobsen, Hans{-}Arno and Wu, Kaifeng and Chen, Jian and Li, Jintao},
  journal = {{PVLDB}},
  keywords = {PVLDB},
  number = 13,
  pages = {1496-1507},
  title = {DGFIndex for Smart Grid: Enhancing Hive with a Cost-Effective Multidimensional Range Index},
  volume = 7,
  year = 2014
}

[10]

Zhang, K., Rabl, T., Sun, Y.P., Kumar, R., Zen, N. and Jacobsen, H.-A. PSBench: A Benchmark for Content- and Topic-Based Publish/Subscribe SystemsProceedings of the Middleware ’14 Posters & Demos Session, Bordeaux, France, December 8-12, 2014 (2014), 17–18.

[ BibTeX ] [ URL ] [ DOI ]

2013

[1]

Chowdhury, B., Rabl, T., Saadatpanah, P., Du, J. and Jacobsen, H.-A. A BigBench Implementation in the Hadoop EcosystemAdvancing Big Data Benchmarks - Proceedings of the 2013 Workshop Series on Big Data Benchmarking, WBDB.cn, Xi’an, China, July 16-17, 2013 and WBDB.us, San José, CA, USA, October 9-10, 2013 Revised Selected Papers (2013), 3–18.

[ Abstract ] [ BibTeX ] [ Download ]

[2]

Baru, C., Bhandarkar, M., Nambiar, R., Poess, M. and Rabl, T. Benchmarking Big Data Systems and the BigData Top100 ListBig Data 1 (1) (2013), 60–64.

[ Abstract ] [ BibTeX ] [ Download ]

[3]

Rabl, T., Poess, M., Jacobsen, H.-A., O’Neil, P.E. and O’Neil, E.J. Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query PerformanceACM/SPEC International Conference on Performance Engineering, ICPE’13, Prague, Czech Republic - April 21 - 24, 2013 (2013), 361–372.

[ Abstract ] [ BibTeX ] [ Download ]

[4]

Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A. and Jacobsen, H.-A. BigBench: Towards an Industry Standard Benchmark for Big Data AnalyticsProceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013 (2013), 1197–1208.

[ Abstract ] [ BibTeX ] [ Download ]

@inproceedings{DBLP:conf/sigmod/GhazalRHRPCJ13,
  abstract = {There is a tremendous interest in big data by academia, industry and a large user base. Several commercial and open source providers unleashed a variety of products to support big data storage and processing. As these products mature, there is a need to evaluate and compare the performance of these systems. In this paper, we present BigBench, an end-to-end big data benchmark proposal. The underlying business model of BigBench is a product retailer. The proposal covers a data model and synthetic data generator that addresses the variety, velocity and volume aspects of big data systems containing structured, semi-structured and unstructured data. The structured part of the BigBench data model is adopted from the TPC-DS benchmark, which is enriched with semi-structured and unstructured data components. The semi-structured part captures registered and guest user clicks on the retailer’s website. The unstructured data captures product reviews submitted online. The data generator designed for BigBench provides scalable volumes of raw data based on a scale factor. The BigBench workload is designed around a set of queries against the data model. From a business prospective, the queries cover the different categories of big data analytics proposed by McKinsey. From a technical prospective, the queries are designed to span three different dimensions based on data sources, query processing types and analytic techniques. We illustrate the feasibility of BigBench by implementing it on the Teradata Aster Database. The test includes generating and loading a 200 Gigabyte BigBench data set and testing the workload by executing the BigBench queries (written using Teradata Aster SQL-MR) and reporting their response times.},
  author = {Ghazal, Ahmad and Rabl, Tilmann and Hu, Minqing and Raab, Francois and Poess, Meikel and Crolotte, Alain and Jacobsen, Hans{-}Arno},
  booktitle = {Proceedings of the {ACM} {SIGMOD} International Conference on Management of Data, {SIGMOD} 2013, New York, NY, USA, June 22-27, 2013},
  crossref = {DBLP:conf/sigmod/2013},
  keywords = {SIGMOD},
  pages = {1197-1208},
  title = {BigBench: Towards an Industry Standard Benchmark for Big Data Analytics},
  year = 2013
}

[5]

Rabl, T., Poess, M., Danisch, M. and Jacobsen, H.-A. Rapid Development of Data Generators Using Meta Generators in PDGFProceedings of the Sixth International Workshop on Testing Database Systems, DBTest 2013, New York, NY, USA, June 24, 2013 (2013), 1–6.

[ Abstract ] [ BibTeX ] [ Download ]

[6]

Rabl, T., Sadoghi, M., Zhang, K. and Jacobsen, H.-A. Poster: MADES - A Multi-Layered, Adaptive, Distributed Event StoreThe 7th ACM International Conference on Distributed Event-Based Systems, DEBS ’13, Arlington, TX, USA - June 29 - July 03, 2013 (2013), 343–344.

[ Abstract ] [ BibTeX ]

[7]

Jacobsen, H.-A., Mokhtarian, K., Rabl, T., Sadoghi, M., Kazemzadeh, R.S., Yoon, Y. and Zhang, K. Grand Challenge: The Bluebay Soccer Monitoring EngineThe 7th ACM International Conference on Distributed Event-Based Systems, DEBS ’13, Arlington, TX, USA - June 29 - July 03, 2013 (2013), 295–300.

[ Abstract ] [ BibTeX ] [ Download ]

2012

[1]

Anisetti, M., Ardagna, C.A., Bellandi, V., Damiani, E., Döller, M., Stegmaier, F., Rabl, T., Kosch, H. and Brunie, L. Landmark-Assisted Location and Tracking in Outdoor Mobile NetworkMultimedia Tools Appl. 59 (1) (2012), 89–111.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[2]

Rabl, T., Sadoghi, M., Jacobsen, H.-A., Gómez-Villamor, S., Muntés-Mulero, V. and Mankowskii, S. Solving Big Data Challenges for Enterprise Application Performance ManagementPVLDB 5 (12) (2012), 1724–1735.

[ Abstract ] [ BibTeX ] [ Download ]

@article{DBLP:journals/pvldb/RablSJGMM12,
  abstract = {As the complexity of enterprise systems increases, the need for monitoring and analyzing such systems also grows. A number of companies have built sophisticated monitoring tools that go far beyond simple resource utilization reports. For example, based on instrumentation and specialized APIs, it is now possible to monitor single method invocations and trace individual transactions across geographically distributed systems. This high-level of detail enables more precise forms of analysis and prediction but comes at the price of high data rates (i.e., big data). To maximize the benefit of data monitoring, the data has to be stored for an extended period of time for ulterior analysis. This new wave of big data analytics imposes new challenges especially for the application performance monitoring systems. The monitoring data has to be stored in a system that can sustain the high data rates and at the same time enable an up-to-date view of the underlying infrastructure. With the advent of modern key-value stores, a variety of data storage systems have emerged that are built with a focus on scalability and high data rates as predominant in this monitoring use case. In this work, we present our experience and a comprehensive performance evaluation of six modern (open-source) data stores in the context of application performance monitoring as part of CA Technologies initiative. We evaluated these systems with data and workloads that can be found in application performance monitoring, as well as, on-line advertisement, power monitoring, and many other use cases. We present our insights not only as performance results but also as lessons learned and our experience relating to the setup and configuration complexity of these data stores in an industry setting.},
  author = {Rabl, Tilmann and Sadoghi, Mohammad and Jacobsen, Hans{-}Arno and G{{ó}}mez{-}Villamor, Sergio and Munt{{é}}s{-}Mulero, Victor and Mankowskii, Serge},
  journal = {{PVLDB}},
  keywords = {PVLDB},
  number = 12,
  pages = {1724-1735},
  title = {Solving Big Data Challenges for Enterprise Application Performance Management},
  volume = 5,
  year = 2012
}

[3]

Rabl, T., Zhang, K., Sadoghi, M., Pandey, N.K., Nigam, A., Wang, C. and Jacobsen, H.-A. Solving manufacturing equipment monitoring through efficient complex event processing: DEBS grand challengeProceedings of the Sixth ACM International Conference on Distributed Event-Based Systems, DEBS 2012, Berlin, Germany, July 16-20, 2012 (2012), 335–340.

[ BibTeX ] [ URL ] [ DOI ]

[4]

Baru, C.K., Bhandarkar, M.A., Nambiar, R.O., Poess, M. and Rabl, T. Setting the Direction for Big Data Benchmark StandardsSelected Topics in Performance Evaluation and Benchmarking - 4th TPC Technology Conference, TPCTC 2012, Istanbul, Turkey, August 27, 2012, Revised Selected Papers (2012), 197–208.

[ BibTeX ] [ URL ] [ DOI ]

[5]

Rabl, T. and Jacobsen, H.-A. Big Data GenerationSpecifying Big Data Benchmarks - First Workshop, WBDB 2012, San Jose, CA, USA, May 8-9, 2012, and Second Workshop, WBDB 2012, Pune, India, December 17-18, 2012, Revised Selected Papers (2012), 20–27.

[ Abstract ] [ BibTeX ] [ Download ]

[6]

Doblander, C., Rabl, T. and Jacobsen, H.-A. Processing Big Events with Showers and StreamsSpecifying Big Data Benchmarks - First Workshop, WBDB 2012, San Jose, CA, USA, May 8-9, 2012, and Second Workshop, WBDB 2012, Pune, India, December 17-18, 2012, Revised Selected Papers (2012), 60–71.

[ Abstract ] [ BibTeX ] [ Download ]

[7]

Rabl, T., Ghazal, A., Hu, M., Crolotte, A., Raab, F., Poess, M. and Jacobsen, H.-A. BigBench Specification V0.1 - BigBench: An Industry Standard Benchmark for Big Data AnalyticsSpecifying Big Data Benchmarks - First Workshop, WBDB 2012, San Jose, CA, USA, May 8-9, 2012, and Second Workshop, WBDB 2012, Pune, India, December 17-18, 2012, Revised Selected Papers (2012), 164–201.

[ Abstract ] [ BibTeX ] [ Download ]

[8]

Frank, M., Poess, M. and Rabl, T. Efficient Update Data Generation for DBMS BenchmarksThird Joint WOSP/SIPEW International Conference on Performance Engineering, ICPE’12, Boston, MA, USA - April 22 - 25, 2012 (2012), 169–180.

[ Abstract ] [ BibTeX ] [ Download ]

2011

[1]

Rabl, T. Efficiency in Cluster Database Systems - Dynamic and Workload-Aware Scaling and Allocation PhD thesis (2011).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

Database systems have been vital in all forms of data processing for a long time. In recent years, the amount of processed data has been growing dramatically, even in small projects. Nevertheless, database management systems tend to be static in terms of size and performance which makes scaling a difficult and expensive task. Because of performance and especially cost advantages more and more installed systems have a shared nothing cluster architecture. Due to the massive parallelism of the hardware programming paradigms from high performance computing are translated into data processing. Database research struggles to keep up with this trend. A key feature of traditional database systems is to provide transparent access to the stored data. This introduces data dependencies and increases system complexity and inter process communication. Therefore, many developers are exchanging this feature for a better scalability. However, explicitly managing the data distribution and data flow requires a deep understanding of the distributed system and reduces the possibilities for automatic and autonomic optimization. In this thesis we present an approach for database system scaling and allocation that features good scalability although it keeps the data distribution transparent. The first part of this thesis analyzes the challenges and opportunities for self-scaling database management systems in cluster environments. Scalability is a major concern of Internet based applications. Access peaks that overload the application are a financial risk. Therefore, systems are usually configured to be able to process peaks at any given moment. As a result, server systems often have a very low utilization. In distributed systems the efficiency can be increased by adapting the number of nodes to the current workload. We propose a processing model and an architecture that allows efficient self-scaling of cluster database systems. In the second part we consider different allocation approaches. To increase the efficiency we present a workload-aware, query-centric model. The approach is formalized; optimal and heuristic algorithms are presented. The algorithms optimize the data distribution for local query execution and balance the workload according to the query history. We present different query classification schemes for different forms of partitioning. The approach is evaluated for OLTP and OLAP style workloads. It is shown that variants of the approach scale well for both fields of application. The third part of the thesis considers benchmarks for large, adaptive systems. First, we present a data generator for cloud-sized applications. Due to its architecture the data generator can easily be extended and configured. A key feature is the high degree of parallelism that makes linear speedup for arbitrary numbers of nodes possible. To simulate systems with user interaction, we have analyzed a productive online e-learning management system. Based on our findings, we present a model for workload generation that considers the temporal dependency of user interaction.

@phdthesis{DBLP:phd/dnb/Rabl11,
  abstract = {Database systems have been vital in all forms of data processing for a long time. In recent years, the amount of processed data has been growing dramatically, even in small projects. Nevertheless, database management systems tend to be static in terms of size and performance which makes scaling a difficult and expensive task. Because of performance and especially cost advantages more and more installed systems have a shared nothing cluster architecture. Due to the massive parallelism of the hardware programming paradigms from high performance computing are translated into data processing. Database research struggles to keep up with this trend. A key feature of traditional database systems is to provide transparent access to the stored data. This introduces data dependencies and increases system complexity and inter process communication. Therefore, many developers are exchanging this feature for a better scalability. However, explicitly managing the data distribution and data flow requires a deep understanding of the distributed system and reduces the possibilities for automatic and autonomic optimization. In this thesis we present an approach for database system scaling and allocation that features good scalability although it keeps the data distribution transparent. The first part of this thesis analyzes the challenges and opportunities for self-scaling database management systems in cluster environments. Scalability is a major concern of Internet based applications. Access peaks that overload the application are a financial risk. Therefore, systems are usually configured to be able to process peaks at any given moment. As a result, server systems often have a very low utilization. In distributed systems the efficiency can be increased by adapting the number of nodes to the current workload. We propose a processing model and an architecture that allows efficient self-scaling of cluster database systems. In the second part we consider different allocation approaches. To increase the efficiency we present a workload-aware, query-centric model. The approach is formalized; optimal and heuristic algorithms are presented. The algorithms optimize the data distribution for local query execution and balance the workload according to the query history. We present different query classification schemes for different forms of partitioning. The approach is evaluated for OLTP and OLAP style workloads. It is shown that variants of the approach scale well for both fields of application. The third part of the thesis considers benchmarks for large, adaptive systems. First, we present a data generator for cloud-sized applications. Due to its architecture the data generator can easily be extended and configured. A key feature is the high degree of parallelism that makes linear speedup for arbitrary numbers of nodes possible. To simulate systems with user interaction, we have analyzed a productive online e-learning management system. Based on our findings, we present a model for workload generation that considers the temporal dependency of user interaction.},
  author = {Rabl, Tilmann},
  keywords = {Cluster DatabaseSystems},
  school = {University of Passau},
  title = {Efficiency in Cluster Database Systems - Dynamic and Workload-Aware Scaling and Allocation},
  year = 2011
}

[2]

Poess, M., Rabl, T., Frank, M. and Danisch, M. A PDGF Implementation for TPC-HTopics in Performance Evaluation, Measurement and Characterization - Third TPC Technology Conference, TPCTC 2011, Seattle, WA, USA, August 29-September 3, 2011, Revised Selected Papers (2011), 196–212.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[3]

Rabl, T. and Poess, M. Parallel Data Generation for Performance Analysis of Large, Complex RDBMSProceedings of the Fourth International Workshop on Testing Database Systems, DBTest 2011, Athens, Greece, June 13, 2011 (2011), 5.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[4]

Rabl, T., Sergieh, H.M., Frank, M. and Kosch, H. Demonstration des Parallel Data Generation FrameworkDatenbanksysteme für Business, Technologie und Web (BTW), 14. Fachtagung des GI-Fachbereichs "Datenbanken und Informationssysteme" (DBIS), 2.-4.3.2011 in Kaiserslautern, Germany (2011), 730–733.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[5]

Rabl, T., Stegmaier, F., Döller, M. and Vang, T.T. A Protocol for Disaster Data EvacuationProceedings of the ACM SIGCOMM 2011 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Toronto, ON, Canada, August 15-19, 2011 (2011), 448–449.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

2010

[1]

Rabl, T., Frank, M., Sergieh, H.M. and Kosch, H. A Data Generator for Cloud-Scale BenchmarkingPerformance Evaluation, Measurement and Characterization of Complex Systems - Second TPC Technology Conference, TPCTC 2010, Singapore, September 13-17, 2010. Revised Selected Papers (2010), 41–56.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[2]

Rabl, T., Dellwo, C. and Kosch, H. Introducing Scalileo: A Java Based Scaling FrameworkProceedings of the 1st International Conference on Energy-Efficient Computing and Networking, e-Energy 2010, Passau, Germany, April 13-15, 2010 (2010), 205–214.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

2009

[1]

Rabl, T., Lang, A., Hackl, T., Sick, B. and Kosch, H. Generating Shifting Workloads to Benchmark Adaptability in Relational Database SystemsPerformance Evaluation and Benchmarking, First TPC Technology Conference, TPCTC 2009, Lyon, France, August 24-28, 2009, Revised Selected Papers (2009), 116–131.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[2]

Rabl, T., Koch, C., Hölbling, G. and Kosch, H. Design and Implementation of the Fast Send ProtocolJDIM 7 (2) (2009), 120–127.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

2008

[1]

Rabl, T., Pfeffer, M. and Kosch, H. Dynamic Allocation in a Self-Scaling Cluster DatabaseConcurrency and Computation: Practice and Experience 20 (17) (2008), 2025–2038.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[2]

Hölbling, G., Rabl, T., Coquil, D. and Kosch, H. Interactive TV Services on Mobile DevicesIEEE MultiMedia 15 (2) (2008), 72–76.

[ BibTeX ] [ URL ] [ DOI ] [ Download ]

[3]

Hölbling, G., Rabl, T. and Kosch, H. Overview of Open Standards for Interactive TV (iTV) (2008).

[ BibTeX ] [ URL ] [ Download ]

[4]

Stephan, A., Hölbling, G., Rabl, T., Lehner, F. and Kosch, H. Autorentool für interaktive Videos im E-Learning (2008).

[ BibTeX ] [ URL ] [ Download ]

2007

[1]

Koch, C., Rabl, T., Hölbling, G. and Kosch, H. Fast Send Protocol - Minimizing Sending Time in High-Speed Bulk Data TransfersSecond IEEE International Conference on Digital Information Management (ICDIM), December 11-13, 2007, Lyon, France, Proceedings (2007), 173–179.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

[2]

Hölbling, G., Rabl, T. and Kosch, H. IntertainmentProceedings of the 15th International Conference on Multimedia 2007, Augsburg, Germany, September 24-29, 2007 (2007), 475–476.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

Publications

Chair

News

20.11.2024 | Paper on Ecological Efficiency of Database Servers accepted at CIDR 2025

09.08.2024 | Paper on Query Compilation for GPUs accepted at LWDA '24

18.07.2024 | Stork paper accepted at DATAI '24

08.03.2024 | CXL Buffer Management Paper Accepted at HardBD & Active '24

01.02.2024 | InferDB paper accepted at VLDB '24

Events

24.03.2022 | FG DB Symposium

Directions