We try to keep an up to date list of all our publications. If you are interested in a PDF that we have not uploaded yet, feel free to send us an email to get a copy. All recent publications you will find below. For older, please click appropriate year.
Large datasets can originate from various sources and are being stored in heterogeneous formats, schemas, and locations. Typical data science tasks need to combine those datasets in order to increase their value and extract knowledge. This is done in various data processing systems with diverse execution engines. In order to take advantage of each execution engine’s characteristics and APIs data scientists need to migrate and transform their datasets at a very high computational cost and manual labor. Data migration is challenging for two main reasons: i) execution engines expect specific types/shapes of the data as input; ii) there are various physical representations of the data (e.g., partitions). Therefore, migrating data efficiently requires knowledge of systems internals and assumptions. In this paper we present Muses, a distributed, high-performance data migration engine that is able to forward, transform, repartition, and broadcast data between distributed engines’ instances efficiently. Muses does not require any changes in the underlying execution engines. In an experimental evaluation, we show that migrating data from one execution engine to another (in order to take advantage of faster, native operations) can increase a pipeline’s performance by 30%.
n the last decade, many distributed stream processing engines (SPEs) were developed to perform continuous queries on massive online data. The central design principle of these engines is to handle queries that potentially run forever on data streams with a query-at-a-time model, i.e., each query is optimized and executed separately. In many real applications, streams are not only processed with long-running queries, but also thousands of short-running ad-hoc queries. To support this efficiently, it is essential to share resources and computation for stream ad-hoc queries in a multi-user environment. The goal of this paper is to bridge the gap between stream processing and ad-hoc queries in SPEs by sharing computation and resources. We define three main requirements for ad-hoc shared stream processing: (1) Integration: Ad-hoc query processing should be a composable layer which can extend stream operators, such as join, aggregation, and window operators; (2) Consistency: Ad-hoc query creation and deletion must be performed in a consistent manner and ensure exactly-once semantics and correctness; (3) Performance: In contrast to state-of-the-art SPEs, ad-hoc SPE should not only maximize data throughput but also query throughout via incremental computation and resource sharing. Based on these requirements, we have developed AStream, an ad-hoc, shared computation stream processing framework. To the best of our knowledge, AStream is the first system that supports distributed ad-hoc stream processing. AStream is built on top of Apache Flink. Our experiments show that AStream shows comparable results to Flink for single query deployments and outperforms it in orders of magnitude with multiple queries.
An Intermediate Representation for Optimizing Machine Learning Pipelines.Kunft, Andreas; Katsifodimos, Asterios; Schelter, Sebastian; Breß, Sebastian; Rabl, Tilmann; Markl, Volker in Proceedings of the VLDB Endowment (2019). 12(11) 1553-1567.
Machine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domain-specific language for collections and matrices. Lara's intermediate representation (IR) reflects on the complete program, i.e., UDFs, control flow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domain-specific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of our proposed optimizations on dense and sparse data, which achieve speedups of up to an order of magnitude.
Performance Analysis and Automatic Tuning of Hash Aggregation on GPUs.Rosenfeld, Viktor; Breß, Sebastian; Zeuch, Steffen; Rabl, Tilmann; Markl, Volker (2019).
Hash aggregation is an important data processing primitive whichcan be significantly accelerated by modern graphics processors(GPUs). Previous work derived heuristics for GPU-accelerated hashaggregation from the study of a particular GPU. In this paper, weexamine the influence of different execution parameters on GPU-accelerated hash aggregation on four NVIDIA and two AMD GPUsbased on six different microarchitectures. While we are able toreplicate some of the previous results, our main finding is thatoptimal execution parameters are highly GPU-dependent. Mostimportantly, execution parameters optimized for a specific GPU areup to21×slower on other GPUs. Given this hardware dependency,we present an algorithm to optimize execution parameters at run-time. On average, our algorithm converges on a result in less than1% of the time required for a full evaluation of the search space. Inthis time, it finds execution parameters that are at most 1% slowerthan the optimum in 90% of our experiments. In the worst case, ouralgorithm finds execution parameters that are at most1.29×slowerthan the optimum.
Poster: Generating Reproducible Out-of-Order Data Streams.Grulich, Philipp M.; Traub, Jonas; Breß, Sebastian; Katsifodimos, Asterios; Markl, Volker; Rabl, Tilmann (2019). 256-257.
Evaluating modern stream processing systems in a reproduciblemanner requires data streams with different data distributions,data rates, and real-world characteristics such as delayed and out-of-order tuples. In this paper, we present an open source streamgenerator which generates reproducible and deterministic out-of-order streams based on real data files, simulating arbitrary fractionsof out-of-order tuples and their respective delays.
Definition of Data Streams.Margara, Alessandro; Rabl, Tilmann in Encyclopedia of Big Data Technologies. (2019).
Today machine learning is entering many business and scientific applications. The life cycle of machine learning applications consists of data preprocessing for transforming the raw data into features, training a model using the features, and deploying the model for answering prediction queries. In order to guarantee accurate predictions, one has to continuously monitor and update the deployed model and pipeline. Current deployment platforms update the model using online learning methods. When online learning alone is not adequate to guarantee the prediction accuracy, some deployment platforms provide a mechanism for automatic or manual retraining of the model. While the online training is fast, the retraining of the model is time-consuming and adds extra overhead and complexity to the process of deployment. We propose a novel continuous deployment approach for updating the deployed model using a combination of the incoming realtime data and the historical data.We utilize sampling techniques to include the historical data in the training process, thus eliminating the need for retraining the deployed model. We also other online statistics computation and dynamic materialization of the preprocessed features, which further reduces the total training and data preprocessing time. In our experiments, we design and deploy two pipelines and models to process two real-world datasets. The experiments show that continuous deployment reduces the total training cost up to 15 times while providing the same level of quality when compared to the state-of-the-art deployment approaches.
Efficient Window Aggregation with General Stream Slicing.Traub, Jonas; Grulich, Philipp M.; Cuellar, Alejandro Rodriguez; Breß, Sebastian; Katsifodimos, Asterios; Rabl, Tilmann; Markl, Volker (2019). 97-108.
Window aggregation is a core operation in data stream processing. Existing aggregation techniques focus on reducing latency, eliminating redundant computations, and minimizing memory usage. However, each technique operates under different assumptions with respect to workload characteristics such as properties of ag-gregation functions (e.g., invertible, associative), window types (e.g., sliding, sessions), windowing measures (e.g., time-or count-based), and stream (dis)order. Violating the assumptions of a technique can deem it unusable or drastically reduce its performance. In this paper, we present the first general stream slicing technique for window aggregation. General stream slicing automatically adapts to workload characteristics to improve performance without sacrificing its general applicability. As a prerequisite, we identify workload characteristics which affect the performance and applicability of aggregation techniques. Our experiments show that general stream slicing outperforms alternative concepts by up to one order of magnitude.
Explanation of Air Pollution Using External Data Sources.Esmailoghli, Mahdi; Redyuk, Sergey; Martinez, Ricardo; Abedjan, Ziawasch; Rabl, Tilmann; Markl, Volker (2019). 297-300.
Stream Processing Engines (SPEs) must tolerate the dynamic nature of unbounded data streams and provide means to quickly adapt to fluctuations in the data rate. Many major SPEs however provide very little functionality to adjust the execution of a potentially infinite streaming query at runtime. Each modification requires a complete query restart, which involves an expensive redistribution of the state of a query and may require external systems in order to guarantee correct processing semantics. This results in significant downtime, which increase the operational cost of those SPEs. We present a modification protocol that enables modifying specific operators as well as the data flow of a running query while ensuring exactly-once processing semantics. We provide an implementation for Apache Flink, which enables stateful operator migration across machines, the introduction of new operators into a running query, and changes to a specific operator based on external triggers. Our results on two benchmarks show that migrating operators for queries with small state is as fast as using the savepoint mechanism of Flink. Migrating operators in the presence of large state even outperforms the savepoint mechanism by a factor of more than 2.3. Introducing and replacing operators at runtime is performed in less than 10 seconds. Our modification protocol demonstrates the general feasibility of runtime modifications and opens the door for many other modification use cases, such as online algorithm tweaking and up- or down-scaling operator instances.
An Overview of Hawk: A Hardware-Tailored Code Generator for the Heterogeneous Many Core Age.Breß, Sebastian; Funke, Henning; Zeuch, Steffen; Rabl, Tilmann; Markl, Volker (2019). 87-90.
Processor manufacturers build increasingly specialized processors to mitigate the effectsof the power wall in order to deliver improved performance. Currently, database engines have to bemanually optimized for each processor which is a costly and error prone process. In this paper, weprovide a summary of our recent VLDB Journal publication, where we propose concepts to adaptto performance enhancements of modern processors and to exploit their capabilities automatically.Our key idea is to create processor-specific code variants and to learn a well-performing code variantfor each processor. These code variants leverage various parallelization strategies and apply bothgeneric and processor-specific code transformations. We observe that performance of code variantsmay diverge up to two orders of magnitude. Thus, we need to generate custom code for each processorfor peak performance. Hawk automatically finds efficient code variants for CPUs, GPUs, and MIC
Analyzing Efficient Stream Processing on Modern Hardware.Zeuch, Steffen; Breß, Sebastian; Rabl, Tilmann; Monte, Bonaventura Del; Karimov, Jeyhun; Lutz, Clemens; Renz, Manuel; Traub, Jonas; Markl, Volker in PVLDB (2019). 12(5) 516-530.
ModernStream Processing Engines(SPEs) process largedata volumes under tight latency constraints. Many SPEsexecute processing pipelines using message passing on shared-nothing architectures and apply a partition-basedscale-outstrategy to handle high-velocity input streams. Further-more, many state-of-the-art SPEs rely on a Java Virtual Ma-chine to achieve platform independence and speed up systemdevelopment by abstracting from the underlying hardware.In this paper, we show that taking the underlying hard-ware into account is essential to exploit modern hardwareefficiently. To this end, we conduct an extensive experimen-tal analysis of current SPEs and SPE design alternativesoptimized for modern hardware. Our analysis highlights po-tential bottlenecks and reveals that state-of-the-art SPEs arenot capable of fully exploiting current and emerging hard-ware trends, such as multi-core processors and high-speednetworks. Based on our analysis, we describe a set of designchanges to the common architecture of SPEs toscale-uponmodern hardware. We show that the single-node throughputcan be increased by up to two orders of magnitude comparedto state-of-the-art SPEs by applying specialized code genera-tion, fusing operators, batch-style parallelization strategies,and optimized windowing. This speedup allows for deploy-ing typical streaming applications on a single or a few nodesinstead of large clusters.