1.
Meyer, H.J., Grunert, H., Waizenegger, T., Woltmann, L., Hartmann, C., Wolfgang, L., Esmailoghli, M., Redyuk, S., Martinez, R., Abedjan, Z., Ziehn, A., Rabl, T., Markl, V., Schmitz, C., Devinder Serai, D., Escobar Gava, T.: Particulate Matter Matters - The Data Science Challenge @ BTW 2019. (2019).
2.
Rabl, T., Brücke, C., Härtling, P., Stars, S., Palacios, R.E., Patel, H., Srivastava, S., Boden, C., Meiners, J., Schelter, S.: ADABench - Towards an Industry Standard Benchmark for Advanced Anayltics. (2019).
The digital revolution, rapidly decreasing storage cost, and remarkable results achieved by state of the art machine learning (ML) methods are driving widespread adoption of ML approaches. While notable recent efforts to benchmark ML methods for canonical tasks exist, none of them address the challenges arising with the increasing pervasiveness of end-to-end ML deployments. The challenges involved in successfully applying ML methods in diverse enterprise settings extend far beyond efficient model training. In this paper, we present our work in benchmarking advanced data analytics systems and lay the foundation towards an industry standard machine learning benchmark. Unlike previous approaches, we aim to cover the complete end-to-end ML pipeline for diverse, industry-relevant application domains rather than evaluating only training performance. To this end, we present reference implementations of complete ML pipelines including corresponding metrics and run rules, and evaluate them at different scales in terms of hardware, software, and problem size.
3.
Kaitoua, A., Rabl, T., Katsifodimos, A., Markl, V.: Muses: Distributed Data Migration System for Polystores. 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, Macao, April 8-11, 2019. pp. 1602–1605. IEEE (2019).
Large datasets can originate from various sources and are being stored in heterogeneous formats, schemas, and locations. Typical data science tasks need to combine those datasets in order to increase their value and extract knowledge. This is done in various data processing systems with diverse execution engines. In order to take advantage of each execution engine’s characteristics and APIs data scientists need to migrate and transform their datasets at a very high computational cost and manual labor. Data migration is challenging for two main reasons: i) execution engines expect specific types/shapes of the data as input; ii) there are various physical representations of the data (e.g., partitions). Therefore, migrating data efficiently requires knowledge of systems internals and assumptions. In this paper we present Muses, a distributed, high-performance data migration engine that is able to forward, transform, repartition, and broadcast data between distributed engines’ instances efficiently. Muses does not require any changes in the underlying execution engines. In an experimental evaluation, we show that migrating data from one execution engine to another (in order to take advantage of faster, native operations) can increase a pipeline’s performance by 30%.
4.
Karimov, J., Rabl, T., Markl, V.: AStream: Ad-hoc Shared Stream Processing. Proceedings of the 2019 International Conference on Management of Data. pp. 607–622. ACM (2019).
In the last decade, many distributed stream processing engines (SPEs) were developed to perform continuous queries on massive online data. The central design principle of these engines is to handle queries that potentially run forever on data streams with a query-at-a-time model, i.e., each query is optimized and executed separately. In many real applications, streams are not only processed with long-running queries, but also thousands of short-running ad-hoc queries. To support this efficiently, it is essential to share resources and computation for stream ad-hoc queries in a multi-user environment. The goal of this paper is to bridge the gap between stream processing and ad-hoc queries in SPEs by sharing computation and resources. We define three main requirements for ad-hoc shared stream processing: (1) Integration: Ad-hoc query processing should be a composable layer which can extend stream operators, such as join, aggregation, and window operators; (2) Consistency: Ad-hoc query creation and deletion must be performed in a consistent manner and ensure exactly-once semantics and correctness; (3) Performance: In contrast to state-of-the-art SPEs, ad-hoc SPE should not only maximize data throughput but also query throughout via incremental computation and resource sharing. Based on these requirements, we have developed AStream, an ad-hoc, shared computation stream processing framework. To the best of our knowledge, AStream is the first system that supports distributed ad-hoc stream processing. AStream is built on top of Apache Flink. Our experiments show that AStream shows comparable results to Flink for single query deployments and outperforms it in orders of magnitude with multiple queries.
5.
Kunft, A., Katsifodimos, A., Schelter, S., Breß, S., Rabl, T., Markl, V.: An Intermediate Representation for Optimizing Machine Learning Pipelines. Proceedings of the VLDB Endowment. 12, 1553–1567 (2019).
Machine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domain-specific language for collections and matrices. Lara's intermediate representation (IR) reflects on the complete program, i.e., UDFs, control flow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domain-specific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of our proposed optimizations on dense and sparse data, which achieve speedups of up to an order of magnitude.
6.
Rosenfeld, V., Breß, S., Zeuch, S., Rabl, T., Markl, V.: Performance Analysis and Automatic Tuning of Hash Aggregation on GPUs. International Workshop on Data Management on New Hardware (DaMoN). ACM (2019).
Hash aggregation is an important data processing primitive whichcan be significantly accelerated by modern graphics processors(GPUs). Previous work derived heuristics for GPU-accelerated hashaggregation from the study of a particular GPU. In this paper, weexamine the influence of different execution parameters on GPU-accelerated hash aggregation on four NVIDIA and two AMD GPUsbased on six different microarchitectures. While we are able toreplicate some of the previous results, our main finding is thatoptimal execution parameters are highly GPU-dependent. Mostimportantly, execution parameters optimized for a specific GPU areup to21×slower on other GPUs. Given this hardware dependency,we present an algorithm to optimize execution parameters at run-time. On average, our algorithm converges on a result in less than1% of the time required for a full evaluation of the search space. Inthis time, it finds execution parameters that are at most 1% slowerthan the optimum in 90% of our experiments. In the worst case, ouralgorithm finds execution parameters that are at most1.29×slowerthan the optimum.
7.
Grulich, P.M., Traub, J., Breß, S., Katsifodimos, A., Markl, V., Rabl, T.: Poster: Generating Reproducible Out-of-Order Data Streams. Presented at the (2019).
Evaluating modern stream processing systems in a reproduciblemanner requires data streams with different data distributions,data rates, and real-world characteristics such as delayed and out-of-order tuples. In this paper, we present an open source streamgenerator which generates reproducible and deterministic out-of-order streams based on real data files, simulating arbitrary fractionsof out-of-order tuples and their respective delays.