Distributed Machine Learning-but at what COST?Boden, Christoph; Rabl, Tilmann; Markl, Volker in Machine Learning Systems Workshop at the 2017 Conference on Neural Information Processing Systems (2017).
Big Stream Processing Systems (Dagstuhl Seminar 17441).Rabl, Tilmann; Sakr, Sherif; Hirzel, Martin in Dagstuhl Reports (2017). 7(10) 111--138.
PEEL: A Framework for Benchmarking Distributed Systems and Algorithms.Boden, Christoph; Alexandrov, Alexander; Kunft, Andreas; Rabl, Tilmann; Markl, Volker (2017). 9-24.
During the last decade, a multitude of novel systems for scalable and distributed data processing has been proposed in both academia and industry. While there are published results of experimental evaluations for nearly all systems, it remains a challenge to objectively compare different system’s performance. It is thus imperative to enable and establish benchmarks for these systems. However, even if workloads and data sets or data generators are fixed, orchestrating and executing benchmarks can be a major obstacle. Worse, many systems come with hardware-dependent parameters that have to be tuned and spawn a diverse set of configuration files. This impedes portability and reproducibility of benchmarks. To address these problems and to foster reproducible and portable experiments and benchmarks of distributed data processing systems, we present PEEL, a framework to define, execute, analyze, and share experiments. PEEL enables the transparent specification of benchmarking workloads and system configuration parameters. It orchestrates the systems involved and automatically runs and collects all associated logs of experiments. PEEL currently supports Apache HDFS, Hadoop, Flink, and Spark and can easily be extended to include further systems.
Query Centric Partitioning and Allocation for Partially Replicated Database Systems.Rabl, Tilmann; Jacobsen, Hans-Arno (2017). 315-330.
A key feature of database systems is to provide transparent access to stored data. In distributed database systems, this includes data allocation and fragmentation. Transparent access introduces data dependencies and increases system complexity and inter-process communication. Therefore, many developers are exchanging transparency for better scalability using sharding and similar techniques. However, explicitly managing data distribution and data flow re-quires a deep understanding of the distributed system and the data access, and it reduces the possibilities for optimizations. To address this problem, we present an approach for efficient data allocation that features good scalability while keeping the data distribution transparent. We propose a workload-aware, query-centric, heterogeneity-aware analytical model. We formalize our approach and present an efficient allocation algorithm. The algorithm optimizes the partitioning and data layout for local query execution and balances the workload on homogeneous and heterogeneous systems according to the query history. In the evaluation, we demonstrate that our approach scales well in performance for OLTP- and OLAP-style workloads and reduces storage requirements significantly over replicated systems while guaranteeing configurable availability.
Benchmarking Data Flow Systems for Scalable Machine Learning.Boden, Christoph; Spina, Andrea; Rabl, Tilmann; Markl, Volker (2017). 1-10.
Distributed data flow systems such as Apache Spark or Apache Flink are popular choices for scaling machine learning algorithms in production. Industry applications of large scale machine learning such as click through rate prediction rely on models trained on billions of data points which are both highly sparse and high dimensional. Existing Benchmarks attempt to assess the performance of data flow systems such as Apache Flink, Spark or Hadoop with non-representative workloads such as WordCount, Grep or Sort. They only evaluate scalability with respect to data set size and fail to address the crucial requirement of handling high dimensional data. We introduce a representative set of distributed machine learning algorithms suitable for large scale distributed settings which have close resemblance to industry-relevant applications and provide generalizable insights into system performance. We implement mathematically equivalent versions of these algorithms in Apache Flink and Apache Spark, tune relevant system parameters and run a comprehensive set of experiments to assess their scalability with respect to both: data set size and dimensionality of the data. We evaluate the systems for data up to four billion data points 100 million dimensions. Additionally we compare the performance to single-node implementations to put the scalability results into perspective. Our results indicate that while being able to robustly scale with increasing data set sizes, current state of the art data flow systems are surprisingly inefficient at coping with high dimensional data, which is a crucial requirement for large scale machine learning algorithms.
I²: Interactive Real-Time Visualization for Streaming Data.Traub, Jonas; Steenbergen, Nikolaas; Grulich, Philipp; Rabl, Tilmann; Markl, Volker (2017). 526-529.
Developing scalable real-time data analysis programs is a challenging task. Developers need insights from the data to define meaningful analysis flows, which often makes the development a trial and error process. Data visualization techniques can provide insights to aid the development, but the sheer amount of available data frequently makes it impossible to visualize all data points at the same time. We present I², an interactive development environment that coordinates running cluster applications and corresponding visualizations such that only the currently depicted data points are processed and transferred. To this end, we present an algorithm for the real-time visualization of time series, which is proven to be correct and minimal in terms of transferred data. Moreover, we show how cluster programs can adapt to changed visualization properties at runtime to allow interactive data exploration on data streams.
PROTEUS: Scalable Online Machine Learning for Predictive Analytics and Real-Time Interactive Visualization.Monte, Bonaventura Del; Karimov, Jeyhun; Mahdiraji, Alireza Rezaei; Rabl, Tilmann; Markl, Volker (2017).
Big data analytics is a critical and unavoidable process in any business and industrial environment. Nowadays, companies that do exploit big data’s inner value get more economic revenue than the ones which do not. Once companies have determined their big data strategy, they face another serious problem: in-house designing and building of a scalable system that runs their business intelligence is difficult. The PROTEUS project aims to design, develop, and provide an open ready-to-use big data software architecture which is able to handle extremely large historical data and data streams and supports online machine learning predictive analytics and real-time interactive visualization. The overall evaluation of PROTEUS is carried out using a real industrial scenario.
STREAMLINE - Streamlined Analysis of Data at Rest and Data in Motion.Grulich, Philipp; Rabl, Tilmann; Markl, Volker; Sidló, Csaba István; Benczúr, András A. (2017).
STREAMLINE aims for improving the overall workflow of big data analytics systems. For this goal, it combines research in different areas to reduce the complexity of the work with data at rest and data in motion in a unified fashion. As a foundation STREAMLINE offers a uniform programming model on top of Apache Flink, for which it drives innovations in a wide range of areas, such as interactive data in motion visualization and advanced window aggregation techniques.
Optimized On-Demand Data Streaming from Sensor Nodes.Traub, Jonas; Breß, Sebastian; Rabl, Tilmann; Katsifodimos, Asterios; Markl, Volker (2017). 586-597.
Real-time sensor data enables diverse applications such as smart metering, traffic monitoring, and sport analysis. In the Internet of Things, billions of sensor nodes form a sensor cloud and offer data streams to analysis systems. However, it is impossible to transfer all available data with maximal frequencies to all applications. Therefore, we need to tailor data streams to the demand of applications. We contribute a technique that optimizes communication costs while maintaining the desired accuracy. Our technique schedules reads across huge amounts of sensors based on the data-demands of a huge amount of concurrent queries. We introduce user-defined sampling functions that define the data-demand of queries and facilitate various adaptive sampling techniques, which decrease the amount of transferred data. Moreover, we share sensor reads and data transfers among queries. Our experiments with real-world data show that our approach saves up to 87% in data transmissions.
Analysis of TPC-DS: the First Standard Benchmark for SQL-Based Big Data Systems.Poess, Meikel; Rabl, Tilmann; Jacobsen, Hans-Arno (2017). 573-585.
The advent of Web 2.0 companies, such as Facebook, Google, and Amazon with their insatiable appetite for vast amounts of structured, semi-structured, and unstructured data, triggered the development of Hadoop and related tools, e.g., YARN, MapReduce, and Pig, as well as NoSQL databases. These tools form an open source software stack to support the processing of large and diverse data sets on clustered systems to perform decision support tasks. Recently, SQL is resurrecting in many of these solutions, e.g., Hive, Stinger, Impala, Shark, and Presto. At the same time, RDBMS vendors are adding Hadoop support into their SQL engines, e.g., IBM’s Big SQL, Actian’s Vortex, Oracle’s Big Data SQL, and SAP’s HANA. Because there was no industry standard benchmark that could measure the performance of SQL-based big data solutions, marketing claims were mostly based on “cherry picked” subsets of the TPC-DS benchmark to suit individual companies strengths, while blending out their weaknesses. In this paper, we present and analyze our work on modifying TPC-DS to fill the void for an industry standard benchmark that is able to measure the performance of SQL-based big data solutions. The new benchmark was ratified by the TPC in early 2016. To show the significance of the new benchmark, we analyze performance data obtained on four different systems running big data, traditional RDBMS, and columnar in-memory architectures.
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Systems.Rohrmann, Till; Schelter, Sebastian; Rabl, Tilmann; Markl, Volker (2017). 269-288.
In recent years, the generated and collected data is increasing at an almost exponential rate. At the same time, the data’s value has been identified in terms of insights that can be provided. However, retrieving the value requires powerful analysis tools, since valuable insights are buried deep in large amounts of noise. Unfortunately, analytic capacities did not scale well with the growing data. Many existing tools run only on a single computer and are limited in terms of data size by its memory. A very promising solution to deal with large-scale data is scaling systems and exploiting parallelism. In this paper, we propose Gilbert, a distributed sparse linear algebra system, to decrease the imminent lack of analytic capacities. Gilbert offers a MATLAB-like programming language for linear algebra programs, which are automatically executed in parallel. Transparent parallelization is achieved by compiling the linear algebra operations first into an intermediate representation. This language-independent form enables high-level algebraic optimizations. Different optimization strategies are evaluated and the best one is chosen by a cost-based optimizer. The optimized result is then transformed into a suitable format for parallel execution. Gilbert generates execution plans for Apache Spark and Apache Flink, two massively parallel dataflow systems. Distributed matrices are represented by square blocks to guarantee a well-balanced trade-off between data parallelism and data granularity. An exhaustive evaluation indicates that Gilbert is able to process varying amounts of data exceeding the memory of a single computer on clusters of different sizes. Two well known machine learning (ML) algorithms, namely PageRank and Gaussian non-negative matrix factorization (GNMF), are implemented with Gilbert. The performance of these algorithms is compared to optimized implementations based on Spark and Flink. Even though Gilbert is not as fast as the optimized algorithms, it simplifies the development process significantly due to its high-level programming abstraction.
BlockJoin: Efficient Matrix Partitioning Through Joins.Kunft, Andreas; Katsifodimos, Asterios; Schelter, Sebastian; Rabl, Tilmann; Markl, Volker in Proceedings of the VLDB Endowment (2017). 10(13) 2061-2072.
Linear algebra operations are at the core of many Machine Learning (ML) programs. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-to-end ML pipelines often consist of (i) relational operators used for joining the input data, (ii) user defined functions used for feature extraction and vectorization, and (iii) linear algebra operators used for model training and cross-validation. Often, these pipelines need to scale out to large datasets. In this case, these pipelines are usually implemented on top of dataflow engines like Hadoop, Spark, or Flink. These dataflow engines implement relational operators on row-partitioned datasets. However, efficient linear algebra operators use block-partitioned matrices. As a result, pipelines combining both kinds of operators require rather expensive changes to the physical representation, in particular re-partitioning steps. In this paper, we investigate the potential of reducing shuffling costs by fusing relational and linear algebra operations into specialized physical operators. We present BlockJoin, a distributed join algorithm which directly produces block-partitioned results. To minimize shuffling costs, BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines. Our experimental evaluation shows speedups up to 6× and the skew resistance of BlockJoin compared to state-of-the-art pipelines implemented in Spark.