Distributed Machine Learning-but at what COST?. Boden, Christoph; Rabl, Tilmann; Markl, Volker in Machine Learning Systems Workshop at the 2017 Conference on Neural Information Processing Systems (2017).
Big Stream Processing Systems (Dagstuhl Seminar 17441). Rabl, Tilmann; Sakr, Sherif; Hirzel, Martin in Dagstuhl Reports (2017). 7(10) 111–138.
PEEL: A Framework for Benchmarking Distributed Systems and Algorithms. Boden, Christoph; Alexandrov, Alexander; Kunft, Andreas; Rabl, Tilmann; Markl, Volker (2017). 9–24.
During the last decade, a multitude of novel systems for scalable and distributed data processing has been proposed in both academia and industry. While there are published results of experimental evaluations for nearly all systems, it remains a challenge to objectively compare different system’s performance. It is thus imperative to enable and establish benchmarks for these systems. However, even if workloads and data sets or data generators are fixed, orchestrating and executing benchmarks can be a major obstacle. Worse, many systems come with hardware-dependent parameters that have to be tuned and spawn a diverse set of configuration files. This impedes portability and reproducibility of benchmarks. To address these problems and to foster reproducible and portable experiments and benchmarks of distributed data processing systems, we present PEEL, a framework to define, execute, analyze, and share experiments. PEEL enables the transparent specification of benchmarking workloads and system configuration parameters. It orchestrates the systems involved and automatically runs and collects all associated logs of experiments. PEEL currently supports Apache HDFS, Hadoop, Flink, and Spark and can easily be extended to include further systems.
Query Centric Partitioning and Allocation for Partially Replicated Database Systems. Rabl, Tilmann; Jacobsen, Hans-Arno (2017). 315–330.
A key feature of database systems is to provide transparent access to stored data. In distributed database systems, this includes data allocation and fragmentation. Transparent access introduces data dependencies and increases system complexity and inter-process communication. Therefore, many developers are exchanging transparency for better scalability using sharding and similar techniques. However, explicitly managing data distribution and data flow re-quires a deep understanding of the distributed system and the data access, and it reduces the possibilities for optimizations. To address this problem, we present an approach for efficient data allocation that features good scalability while keeping the data distribution transparent. We propose a workload-aware, query-centric, heterogeneity-aware analytical model. We formalize our approach and present an efficient allocation algorithm. The algorithm optimizes the partitioning and data layout for local query execution and balances the workload on homogeneous and heterogeneous systems according to the query history. In the evaluation, we demonstrate that our approach scales well in performance for OLTP- and OLAP-style workloads and reduces storage requirements significantly over replicated systems while guaranteeing configurable availability.
Benchmarking Data Flow Systems for Scalable Machine Learning. Boden, Christoph; Spina, Andrea; Rabl, Tilmann; Markl, Volker (2017). 1–10.
Distributed data flow systems such as Apache Spark or Apache Flink are popular choices for scaling machine learning algorithms in production. Industry applications of large scale machine learning such as click through rate prediction rely on models trained on billions of data points which are both highly sparse and high dimensional. Existing Benchmarks attempt to assess the performance of data flow systems such as Apache Flink, Spark or Hadoop with non-representative workloads such as WordCount, Grep or Sort. They only evaluate scalability with respect to data set size and fail to address the crucial requirement of handling high dimensional data. We introduce a representative set of distributed machine learning algorithms suitable for large scale distributed settings which have close resemblance to industry-relevant applications and provide generalizable insights into system performance. We implement mathematically equivalent versions of these algorithms in Apache Flink and Apache Spark, tune relevant system parameters and run a comprehensive set of experiments to assess their scalability with respect to both: data set size and dimensionality of the data. We evaluate the systems for data up to four billion data points 100 million dimensions. Additionally we compare the performance to single-node implementations to put the scalability results into perspective. Our results indicate that while being able to robustly scale with increasing data set sizes, current state of the art data flow systems are surprisingly inefficient at coping with high dimensional data, which is a crucial requirement for large scale machine learning algorithms.
I²: Interactive Real-Time Visualization for Streaming Data. Traub, Jonas; Steenbergen, Nikolaas; Grulich, Philipp; Rabl, Tilmann; Markl, Volker (2017). 526–529.
Developing scalable real-time data analysis programs is a challenging task. Developers need insights from the data to define meaningful analysis flows, which often makes the development a trial and error process. Data visualization techniques can provide insights to aid the development, but the sheer amount of available data frequently makes it impossible to visualize all data points at the same time. We present I², an interactive development environment that coordinates running cluster applications and corresponding visualizations such that only the currently depicted data points are processed and transferred. To this end, we present an algorithm for the real-time visualization of time series, which is proven to be correct and minimal in terms of transferred data. Moreover, we show how cluster programs can adapt to changed visualization properties at runtime to allow interactive data exploration on data streams.
PROTEUS: Scalable Online Machine Learning for Predictive Analytics and Real-Time Interactive Visualization. Monte, Bonaventura Del; Karimov, Jeyhun; Mahdiraji, Alireza Rezaei; Rabl, Tilmann; Markl, Volker (2017).
Big data analytics is a critical and unavoidable process in any business and industrial environment. Nowadays, companies that do exploit big data’s inner value get more economic revenue than the ones which do not. Once companies have determined their big data strategy, they face another serious problem: in-house designing and building of a scalable system that runs their business intelligence is difficult. The PROTEUS project aims to design, develop, and provide an open ready-to-use big data software architecture which is able to handle extremely large historical data and data streams and supports online machine learning predictive analytics and real-time interactive visualization. The overall evaluation of PROTEUS is carried out using a real industrial scenario.