[1]
Gévay, G.E., Rabl, T., Breß, S., Madai-Tahy, L. and Markl, V. Labyrinth: Compiling Imperative Control Flow to Parallel DataflowsCoRR abs/1809.06845 (2018).
[2]
Poess, M., Ren, D.Q., Rabl, T. and Jacobsen, H.-A. Methods for Quantifying Energy Consumption in TPC-HProceedings of the 2018 ACM/SPEC International Conference on Performance Engineering, ICPE 2018, Berlin, Germany, April 09-13, 2018 (2018), 293–304.
Historically, performance and price-performance of computer systems have been the key purchasing arguments for customers. However, with rising energy costs and increasing power consumption due to the ever-growing demand for compute power (servers, storage, networks), electricity bills have become a significant expense for today’s data centers. In order to measure energy consumption in standardized ways, the Standard Performance Evaluation Corporation (SPEC) has developed a benchmark dedicated to measuring the power consumption of single servers (SPECpower_ssj2008), while the Transaction Processing Performance Council (TPC) and the Storage Performance Council (SPC) have developed general specifications that govern how energy is measured for any of its benchmarks. Energy reporting is optional in TPC and SPC results. While there are close to 600 SPECpower_ssj2008 results, there have been only three TPC and no SPC benchmark results published that report energy consumption. In this paper, we argue that the low number of TPC publications is due to the large setups required in TPC benchmarks and the, subsequently, complicated measurement setup. Running on a typical big data setup we evaluate two alternative methods to quantify energy consumption during TPC-H’s multi-user runs, namely by taking measurements of on-chip power sensors controlled through Intelligent Platform Management Interface and by estimating power consumption via the nameplate power consumption method. We compare these later two methods with power measurements taken from external power meters as required by SPEC and TPC benchmarks.
[3]
Boden, C., Rabl, T. and Markl, V. The Berlin Big Data Center (BBDC)it-Information Technology 60 (5-6) (2018), 321–326.
The last decade has been characterized by the collection and availability of unprecedented amounts of data due to rapidly decreasing storage costs and the omnipresence of sensors and data-producing global online-services. In order to process and analyze this data deluge, novel distributed data processing systems resting on the paradigm of data flow such as Apache Hadoop, Apache Spark, or Apache Flink were built and have been scaled to tens of thousands of machines. However, writing efficient implementations of data analysis programs on these systems requires a deep understanding of systems programming, prohibiting large groups of data scientists and analysts from efficiently using this technology. In this article, we present some of the main achievements of the research carried out by the Berlin Big Data Cente (BBDC). We introduce the two domain-specific languages Emma and LARA, which are deeply embedded in Scala and enable declarative specification and the automatic parallelization of data analysis programs, the PEEL Framework for transparent and reproducible benchmark experiments of distributed data processing systems, approaches to foster the interpretability of machine learning models and finally provide an overview of the challenges to be addressed in the second phase of the BBDC.
[4]
Karimov, J., Rabl, T. and Markl, V. PolyBench: The First Benchmark for PolystoresPerformance Evaluation and Benchmarking for the Era of Artificial Intelligence (2018), 24–41.
Modern business intelligence requires data processing not only across a huge variety of domains but also across different paradigms, such as relational, stream, and graph models. This variety is a challenge for existing systems that typically only support a single or few different data models. Polystores were proposed as a solution for this challenge and received wide attention both in academia and in industry. These are systems that integrate different specialized data processing engines to enable fast processing of a large variety of data models. Yet, there is no standard to assess the performance of polystores. The goal of this work is to develop the first benchmark for polystores. To capture the flexibility of polystores, we focus on high level features in order to enable an execution of our benchmark suite on a large set of polystore solutions.
[5]
Boden, C., Rabl, T., Schelter, S. and Markl, V. Benchmarking Distributed Data Processing Systems for Machine Learning WorkloadsPerformance Evaluation and Benchmarking for the Era of Artificial Intelligence - 10th TPC Technology Conference, TPCTC 2018, Rio de Janeiro, Brazil, August 27-31, 2018, Revised Selected Papers (2018), 42–57.
[6]
Liu, Y., Guo, S., Hu, S., Rabl, T., Jacobsen, H.-A., Li, J. and Wang, J. Performance Evaluation and Optimization of Multi-Dimensional Indexes in HiveIEEE Trans. Services Computing 11 (5) (2018), 835–849.
Apache Hive has been widely used for big data processing over large scale clusters by many companies. It provides adeclarative query language called HiveQL. The efficiency of filtering out query-irrelevant data from HDFS closely affects theperformance of query processing. This is especially true for multi-dimensional, high-selective, and few columns involving queries,which provides sufficient information to reduce the amount of bytes read. Indexing (Compact Index, Aggregate Index, Bitmap Index,DGFIndex, and the index in ORC file) and columnar storage (RCFile, ORC file, and Parquet) are powerful techniques to achieve this.However, it is not trivial to choosing a suitable index and columnar storage based on data and query features. In this paper, wecompare the data filtering performance of the above indexes with different columnar storage formats by conducting comprehensiveexperiments using uniform and skew TPC-H data sets and various multi-dimensional queries, and suggest the best practices ofimproving multi-dimensional queries in Hive under different conditions.
[7]
Karimov, J., Rabl, T., Katsifodimos, A., Samarev, R., Heiskanen, H. and Markl, V. Benchmarking Distributed Stream Data Processing Engines34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018 (2018), 1507–1518.
The need for scalable and efficient stream analysis has led to the development of many open-source streaming data processing systems (SDPSs) with highly diverging capabilities and performance characteristics. While first initiatives try to compare the systems for simple workloads, there is a clear gap of detailed analyses of the systems’ performance characteristics. In this paper, we propose a framework for benchmarking distributed stream processing engines. We use our suite to evaluate the performance of three widely used SDPSs in detail, namely Apache Storm, Apache Spark, and Apache Flink. Our evaluation focuses in particular on measuring the throughput and latency of windowed operations, which are the basic type of operations in stream analytics. For this benchmark, we design workloads based on real-life, industrial use-cases inspired by the online gaming industry. The contribution of our work is threefold. First, we give a definition of latency and throughput for stateful operators. Second, we carefully separate the system under test and driver, in order to correctly represent the open world model of typical stream processing deployments and can, therefore, measure system performance under realistic conditions. Third, we build the first benchmarking framework to define and test the sustainable performance of streaming systems. Our detailed evaluation highlights the individual characteristics and use-cases of each system.
[8]
Böhm, A. and Rabl, T. eds. Proceedings of the 7th International Workshop on Testing Database Systems, DBTest@SIGMOD 2018, Houston, TX, USA, June 15, 2018. ACM.
[9]
Traub, J., Grulich, P.M., Cuellar, A.R., Breß, S., Katsifodimos, A., Rabl, T. and Markl, V. Scotty: Efficient Window Aggregation for Out-of-Order Stream Processing34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018 (2018), 1300–1303.
Computing aggregates over windows is at the core of virtually every stream processing job. Typical stream processing applications involve overlapping windows and, therefore, cause redundant computations. Several techniques prevent this redundancy by sharing partial aggregates among windows. However, these techniques do not support out-of-order processing and session windows. Out-of-order processing is a key requirement to deal with delayed tuples in case of source failures such as temporary sensor outages. Session windows are widely used to separate different periods of user activity from each other. In this paper, we present Scotty, a high throughput operator for window discretization and aggregation. Scotty splits streams into non-overlapping slices and computes partial aggregates per slice. These partial aggregates are shared among all concurrent queries with arbitrary combinations of tumbling, sliding, and session windows. Scotty introduces the first slicing technique which (1) enables stream slicing for session windows in addition to tumbling and sliding windows and (2) processes out-of-order tuples efficiently. Our technique is generally applicable to a broad group of dataflow systems which use a unified batch and stream processing model. Our experiments show that we achieve a throughput an order of magnitude higher than alternative state-of-the-art solutions.
[10]
Abedjan, Z., Breß, S., Markl, V., Rabl, T. and Soto, J. Data Management Systems Research at TU BerlinSIGMOD Record 47 (4) (2018), 23–28.
Data management systems research at TU Berlin is spearheaded by the Database Systems and Information Management (DIMA) Group, the Big Data Management (BigDaMa) Group, as well as the affiliated Intelligent Analytics for Massive Data (IAM) Research Group at the German Research Center for Artificial Intelligence (DFKI). Jointly, our research activities encompass a wide variety of database topics, including benchmarking, data integration, modern hardware, and scalable data processing. As of Fall 2018, the current team is comprised of three university professors, thirteen senior and postdoc researchers, twenty PhD students, and several research assistants. Among our notable accomplishments is the DFG-funded Stratosphere Research Unit, which laid the groundwork for what would later become Apache Flink. DIMA has also been leading the Berlin Big Data Center, one of only two BMBF-funded Big Data Competence Centers in Germany since 2014. In addition, DIMA is co-directing the Berlin Center for Machine Learning, one of four BMBF-funded Machine Learning Competence Centers in Germany.
[11]
Grulich, P.M., Saitenmacher, R., Traub, J., Breß, S., Rabl, T. and Markl, V. Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive WindowingProceedings of the 21th International Conference on Extending Database Technology, EDBT 2018, Vienna, Austria, March 26-29, 2018. (2018), 477–480.
Machine learning techniques for data stream analysis suffer from concept drifts such as changed user preferences, varying weather conditions, or economic changes. These concept drifts cause wrong predictions and lead to incorrect business decisions. Concept drift detection methods such as adaptive windowing (Adwin) allow for adapting to concept drifts on the fly. In this paper, we examine Adwin in detail and point out its throughput bottlenecks. We then introduce several parallelization alternatives to address these bottlenecks. Our optimizations lead to a speedup of two orders of magnitude over the original Adwin implementation. Thus, we explore parallel adaptive windowing to provide scalable concept detection for high-velocity data streams with millions of tuples per second.
[12]
Lutz, C., Breß, S., Rabl, T., Zeuch, S. and Markl, V. Efficient k-Means on GPUsProceedings of the 14th International Workshop on Data Management on New Hardware, Houston, TX, USA, June 11, 2018 (2018), 1–3.
k-Means is a versatile clustering algorithm widely-used in practice. To cluster large data sets, state-of-the-art implementations use GPUs to shorten the data to knowledge time. These implementations commonly assign points on a GPU and update centroids on a CPU. We show that this approach has two main drawbacks. First, it separates the two algorithm phases over different processors, which requires an expensive data exchange between devices. Second, even when both phases are computed on the GPU, the same data are read twice per iteration, leading to inefficient use of memory bandwidth. In this paper, we describe a new approach that executes k-means in a single data pass per iteration. We propose a new algorithm to updates centroids that allows us to perform both phases efficiently on GPUs. Thereby, we remove data transfers within each iteration. We fuse both phases to eliminate artificial synchronization barriers, and thus compute k-means in a single data pass. Overall, we achieve up to 20×higher throughput compared to the state-of-the-art approach.
[13]
Kunft, A., Stadler, L., Bonetta, D., Basca, C., Meiners, J., Breß, S., Rabl, T., Fumero, J.J. and Markl, V. ScootR: Scaling R Dataframes on Dataflow SystemsProceedings of the ACM Symposium on Cloud Computing, SoCC 2018, Carlsbad, CA, USA, October 11-13, 2018 (2018), 288–300.
To cope with today’s large scale of data, parallel dataflow enginessuch as Hadoop, and more recently Spark and Flink, have beenproposed. They offer scalability and performance, but require datascientists to develop analysis pipelines in unfamiliar programminglanguages and abstractions. To overcome this hurdle, dataflow en-gines have introduced some forms of multi-language integrations,e.g., for Python and R. However, this results in data exchange be-tween the dataflow engine and the integrated language runtime,which requires inter-process communication and causes high run-time overheads. In this paper, we present ScootR, a novel approachto execute R in dataflow systems. ScootR tightly integrates thedataflow and R language runtime by using the Truffle frameworkand the Graal compiler. As a result, ScootR executes R scripts di-rectly in the Flink data processing engine, without serialization andinter-process communication. Our experimental study reveals thatScootR outperforms state-of-the-art systems by up to an order ofmagnitude.
[14]
Breß, S., Köcher, B., Funke, H., Zeuch, S., Rabl, T. and Markl, V. Generating Custom Code for Efficient Query Execution on Heterogeneous ProcessorsVLDB J. 27 (6) (2018), 797–822.
Processor manufacturers build increasingly specialized processors to mitigate the effects of the power wall in order to deliver improved performance. Currently, database engines have to be manually optimizedfor each processor which is a costly and error proneprocess. In this paper, we propose concepts to adapt toand to exploit the performance enhancements of mod-ern processors automatically. Our core idea is to cre-ate processor-specific code variants and to learn a well-performing code variant for each processor. These codevariants leverage various parallelization strategies andapply both generic and processor-specific code transformations. Our experimental results show that the performance of code variants may diverge up to two ordersof magnitude. In order to achieve peak performance, wegenerate custom code for each processor. We show thatour approach finds an efficient custom code variant formulti-core CPUs, GPUs, and MICs.
[15]
Sakr, S., Rabl, T., Hirzel, M., Carbone, P. and Strohbach, M. Dagstuhl Seminar on Big Stream ProcessingSIGMOD Record 47 (3) (2018), 36–39.
Stream processing can generate insights from big data in real time as it is being produced. This paper reports findings from a 2017 seminar on big stream processing, focusing on applications, systems, and languages.
[16]
Lutz, C., Breß, S., Rabl, T., Zeuch, S. and Markl, V. Efficient and Scalable k-Means on GPUsDatenbank-Spektrum 18 (3) (2018), 157–169.
k-Means is a versatile clustering algorithm widely used in practice. To cluster large data sets, state-of-the-art implementations use GPUs to shorten the data to knowledgetime. These implementations commonly assign points on a GPU and update centroids on a CPU. We identify two main shortcomings of this approach. First, it requires expensive data exchange between processors when switching between the two processing steps point assignment and centroid update. Second, even when processing both steps of k-means on the same processor, points still need to be read two times within an iteration, leading to inefficient use of memory bandwidth. In this paper, we present a novel approach for centroid update that allows us to efficiently process both phases of k-means on GPUs. We fuse point assignment and centroid update to execute one iteration with a single pass over thepoints. Our evaluation shows that our k-means approach scales to very large data sets. Overall, we achieve up to 20×higher throughput compared to the state-of-the-art approach.
[17]
Poess, M., Nambiar, R., Kulkarni, K., Narasimhadevara, C., Rabl, T. and Jacobsen, H.-A. Analysis of TPCx-IoT: The First Industry Standard Benchmark for IoT Gateway Systems34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018 (2018), 1519–1530.
By 2020 it is estimated that 20 billion devices will be connected to the Internet. While the initial hype around this Internet of Things (IoT) stems from consumer use cases, the number of devices and data from enterprise use cases is significant in terms of market share. With companies being challenged to choose the right digital infrastructure from different providers, there is an pressing need to objectively measure the hardware, operating system, data storage, and data management systems that can ingest, persist, and process the massive amounts of data arriving from sensors (edge devices). The Transaction Processing Performance Council (TPC) recently released the first industry standard benchmark for measuring the performance of gateway systems, TPCx-IoT. In this paper, we provide a detailed description of TPCx-IoT, mention design decisions behind key elements of this benchmark, and experimentally analyze how TPCx-IoT measures the performance of IoT gateway systems.