We try to keep an up to date list of all our publications. If you are interested in a PDF that we have not uploaded yet, feel free to send us an email to get a copy. All recent publications you will find below. For older, please click appropriate year.
The digital revolution, rapidly decreasing storage cost, and remarkable results achieved by state of the art machine learning (ML) methods are driving widespread adoption of ML approaches. While notable recent efforts to benchmark ML methods for canonical tasks exist, none of them address the challenges arising with the increasing pervasiveness of end-to-end ML deployments. The challenges involved in successfully applying ML methods in diverse enterprise settings extend far beyond efficient model training. In this paper, we present our work in benchmarking advanced data analytics systems and lay the foundation towards an industry standard machine learning benchmark. Unlike previous approaches, we aim to cover the complete end-to-end ML pipeline for diverse, industry-relevant application domains rather than evaluating only training performance. To this end, we present reference implementations of complete ML pipelines including corresponding metrics and run rules, and evaluate them at different scales in terms of hardware, software, and problem size.
Kaitoua, A., Rabl, T., Katsifodimos, A., Markl, V.: Muses: Distributed Data Migration System for Polystores.35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, Macao, April 8-11, 2019. pp. 1602-1605. IEEE (2019).
Large datasets can originate from various sources and are being stored in heterogeneous formats, schemas, and locations. Typical data science tasks need to combine those datasets in order to increase their value and extract knowledge. This is done in various data processing systems with diverse execution engines. In order to take advantage of each execution engine’s characteristics and APIs data scientists need to migrate and transform their datasets at a very high computational cost and manual labor. Data migration is challenging for two main reasons: i) execution engines expect specific types/shapes of the data as input; ii) there are various physical representations of the data (e.g., partitions). Therefore, migrating data efficiently requires knowledge of systems internals and assumptions. In this paper we present Muses, a distributed, high-performance data migration engine that is able to forward, transform, repartition, and broadcast data between distributed engines’ instances efficiently. Muses does not require any changes in the underlying execution engines. In an experimental evaluation, we show that migrating data from one execution engine to another (in order to take advantage of faster, native operations) can increase a pipeline’s performance by 30%.
Karimov, J., Rabl, T., Markl, V.: AStream: Ad-hoc Shared Stream Processing.Proceedings of the 2019 International Conference on Management of Data. pp. 607-622. ACM (2019).
In the last decade, many distributed stream processing engines (SPEs) were developed to perform continuous queries on massive online data. The central design principle of these engines is to handle queries that potentially run forever on data streams with a query-at-a-time model, i.e., each query is optimized and executed separately. In many real applications, streams are not only processed with long-running queries, but also thousands of short-running ad-hoc queries. To support this efficiently, it is essential to share resources and computation for stream ad-hoc queries in a multi-user environment. The goal of this paper is to bridge the gap between stream processing and ad-hoc queries in SPEs by sharing computation and resources. We define three main requirements for ad-hoc shared stream processing: (1) Integration: Ad-hoc query processing should be a composable layer which can extend stream operators, such as join, aggregation, and window operators; (2) Consistency: Ad-hoc query creation and deletion must be performed in a consistent manner and ensure exactly-once semantics and correctness; (3) Performance: In contrast to state-of-the-art SPEs, ad-hoc SPE should not only maximize data throughput but also query throughout via incremental computation and resource sharing. Based on these requirements, we have developed AStream, an ad-hoc, shared computation stream processing framework. To the best of our knowledge, AStream is the first system that supports distributed ad-hoc stream processing. AStream is built on top of Apache Flink. Our experiments show that AStream shows comparable results to Flink for single query deployments and outperforms it in orders of magnitude with multiple queries.
Kunft, A., Katsifodimos, A., Schelter, S., Breß, S., Rabl, T., Markl, V.: An Intermediate Representation for Optimizing Machine Learning Pipelines.Proceedings of the VLDB Endowment.12,1553-1567 (2019).
Machine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domain-specific language for collections and matrices. Lara's intermediate representation (IR) reflects on the complete program, i.e., UDFs, control flow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domain-specific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of our proposed optimizations on dense and sparse data, which achieve speedups of up to an order of magnitude.
Rosenfeld, V., Breß, S., Zeuch, S., Rabl, T., Markl, V.: Performance Analysis and Automatic Tuning of Hash Aggregation on GPUs.International Workshop on Data Management on New Hardware (DaMoN). ACM (2019).
Hash aggregation is an important data processing primitive whichcan be significantly accelerated by modern graphics processors(GPUs). Previous work derived heuristics for GPU-accelerated hashaggregation from the study of a particular GPU. In this paper, weexamine the influence of different execution parameters on GPU-accelerated hash aggregation on four NVIDIA and two AMD GPUsbased on six different microarchitectures. While we are able toreplicate some of the previous results, our main finding is thatoptimal execution parameters are highly GPU-dependent. Mostimportantly, execution parameters optimized for a specific GPU areup to21×slower on other GPUs. Given this hardware dependency,we present an algorithm to optimize execution parameters at run-time. On average, our algorithm converges on a result in less than1% of the time required for a full evaluation of the search space. Inthis time, it finds execution parameters that are at most 1% slowerthan the optimum in 90% of our experiments. In the worst case, ouralgorithm finds execution parameters that are at most1.29×slowerthan the optimum.
Grulich, P.M., Traub, J., Breß, S., Katsifodimos, A., Markl, V., Rabl, T.: Poster: Generating Reproducible Out-of-Order Data Streams. Presented at the (2019).
Evaluating modern stream processing systems in a reproduciblemanner requires data streams with different data distributions,data rates, and real-world characteristics such as delayed and out-of-order tuples. In this paper, we present an open source streamgenerator which generates reproducible and deterministic out-of-order streams based on real data files, simulating arbitrary fractionsof out-of-order tuples and their respective delays.
Margara, A., Rabl, T.: Definition of Data Streams.Encyclopedia of Big Data Technologies (2019).
Today machine learning is entering many business and scientific applications. The life cycle of machine learning applications consists of data preprocessing for transforming the raw data into features, training a model using the features, and deploying the model for answering prediction queries. In order to guarantee accurate predictions, one has to continuously monitor and update the deployed model and pipeline. Current deployment platforms update the model using online learning methods. When online learning alone is not adequate to guarantee the prediction accuracy, some deployment platforms provide a mechanism for automatic or manual retraining of the model. While the online training is fast, the retraining of the model is time-consuming and adds extra overhead and complexity to the process of deployment. We propose a novel continuous deployment approach for updating the deployed model using a combination of the incoming realtime data and the historical data.We utilize sampling techniques to include the historical data in the training process, thus eliminating the need for retraining the deployed model. We also other online statistics computation and dynamic materialization of the preprocessed features, which further reduces the total training and data preprocessing time. In our experiments, we design and deploy two pipelines and models to process two real-world datasets. The experiments show that continuous deployment reduces the total training cost up to 15 times while providing the same level of quality when compared to the state-of-the-art deployment approaches.
Traub, J., Grulich, P.M., Cuellar, A.R., Breß, S., Katsifodimos, A., Rabl, T., Markl, V.: Efficient Window Aggregation with General Stream Slicing.Advances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT 2019, Lisbon, Portugal, March 26-29, 2019. pp. 97-108 (2019).
Window aggregation is a core operation in data stream processing. Existing aggregation techniques focus on reducing latency, eliminating redundant computations, and minimizing memory usage. However, each technique operates under different assumptions with respect to workload characteristics such as properties of ag-gregation functions (e.g., invertible, associative), window types (e.g., sliding, sessions), windowing measures (e.g., time-or count-based), and stream (dis)order. Violating the assumptions of a technique can deem it unusable or drastically reduce its performance. In this paper, we present the first general stream slicing technique for window aggregation. General stream slicing automatically adapts to workload characteristics to improve performance without sacrificing its general applicability. As a prerequisite, we identify workload characteristics which affect the performance and applicability of aggregation techniques. Our experiments show that general stream slicing outperforms alternative concepts by up to one order of magnitude.