Big Data Benchmarking - 5th International Workshop, WBDB 2014, Potsdam, Germany, August 5-6, 2014, Revised Selected Papers Rabl, Tilmann; Sachs, Kai; Poess, Meikel; Baru, Chaitanya K.; Jacobsen, Hans-Arno in Lecture Notes in Computer Science (2015). (Vol. 8991) Springer.
Big Data Benchmark Compendium. Ivanov, Todor; Rabl, Tilmann; Poess, Meikel; Queralt, Anna; Poelman, John; Poggi, Nicolás; Buell, Jeffrey (2015). 135–155.
The field of Big Data and related technologies is rapidly evolving. Consequently, many benchmarks are emerging, driven by academia and industry alike. As these benchmarks are emphasizing different aspects of Big Data and, in many cases, covering different technical platforms and uses cases, it is extremely difficult to keep up with the pace of benchmark creation. Also with the combinations of large volumes of data, heterogeneous data formats and the changing processing velocity, it becomes complex to specify an architecture which best suits all application requirements. This makes the investigation and standardization of such systems very difficult. Therefore, the traditional way of specifying a standardized benchmark with pre-defined workloads, which have been in use for years in the transaction and analytical processing systems, is not trivial to employ for Big Data systems. This document provides a summary of existing benchmarks and those that are in development, gives a side-by-side comparison of their characteristics and discusses their pros and cons. The goal is to understand the current state in Big Data benchmarking and guide practitioners in their approaches and use cases.
Enhancing Data Generation in TPCx-HS with a Non-Uniform Random Distribution. Nambiar, Raghunath; Rabl, Tilmann; Kulkarni, Karthik; Frank, Michael (2015). 94–129.
Developed by the Transaction Processing Performance Council, the TPC Express Benchmark HS (TPCx-HS) is the industry’s first standard for benchmarking big data systems. It is designed to provide an objective measure of hardware, operating system and commercial Apache Hadoop File System API compatible software distributions, and to provide the industry with verifiable performance, price-performance and availability metrics [1][2]. It can be used to compare a broad range of system topologies and implementation methodologies of big data systems in a technically rigorous and directly comparable and vendor-neutral manner. The modeled application is simple and the results are highly relevant to hardware and software dealing with Big Data systems in general. The data generation is derived from TeraGen[3] which uses uniform distribution of data. In this paper the authors propose normal distribution (Gaussian distribution) which may be more representative of real life datasets. The modified TeraGen and complete changes required to the TPCx-HS kit are included as part of this paper.
Just can’t get enough: Synthesizing Big Data. Rabl, Tilmann; Danisch, Manuel; Frank, Michael; Schindler, Sebastian; Jacobsen, Hans-Arno (2015). 1457–1462.
With the rapidly decreasing prices for storage and storage systems ever larger data sets become economical. While only few years ago only successful transactions would be recorded in sales systems, today every user interaction will be stored for ever deeper analysis and richer user modeling. This has led to the development of big data systems, which offer high scalability and novel forms of analysis. Due to the rapid development and ever increasing variety of the big data landscape, there is a pressing need for tools for testing and benchmarking. Vendors have little options to showcase the performance of their systems but to use trivial data sets like TeraSort or WordCount. Since customers’ real data is typically subject to privacy regulations and rarely can be utilized, simplistic proof-of-concepts have to be used, leaving both, customers and vendors, unclear of the target use-case performance. As a solution, we present an automatic approach to data synthetization from existing data sources. Our system enables a fully automatic generation of large amounts of complex, realistic, synthetic data.
The Vision of BigBench 2.0. Rabl, Tilmann; Frank, Michael; Danisch, Manuel; Jacobsen, Hans-Arno; Gowda, Bhaskar (2015). 1–4.
Data is one of the most important resources for modern enterprises. Better analytics allow for a better understanding of customer requirements and market dynamics. The more data is collected, the more information can be extracted. However, information value extraction is limited by data processing speeds. Due to fast technological advances in big data management there is an abundance of big data systems. This leaves users in the dilemma of choosing a system that features good end-to-end performance for the use case. To get a good understanding of the actual performance of a system, realistic application level workloads are required. To this end, we have developed BigBench, an application level benchmark focused only on big data analytics. In this paper, we present the vision of BigBench 2.0, a suite of benchmarks for all major aspects of big data processing in common business use cases. Unlike other efforts, BigBench 2.0 will have completely consistent and integrated model and workload, which will allow realistic end-to-end benchmarking of big data systems.
Die Apache Flink Plattform zur parallelen Analyse von Datenströmen und Stapeldaten. Traub, Jonas; Rabl, Tilmann; Hueske, Fabian; Rohrmann, Till; Markl, Volker (2015). 403–408.
Die Menge an analysierbaren Daten steigt aufgrund fallender Preise für Speicherlösungen und der Erschließung neuer Datenquellen rasant. Da klassische Datenbanksysteme nicht ausreichend parallelisierbar sind, können sie die heute anfallenden Datenmengen häufig nicht mehr verarbeiten. Hierdurch ist es notwendig spezielle Programme zur parallelen Datenanalyse zu verwenden. Die Entwicklung solcher Programme für Computercluster ist selbst für erfahrene Systemprogrammierer eine komplexe Herausforderung. Frameworks wie Apache Hadoop MapReduce sind zwar skalierbar, aber im Vergleich zu SQL schwer zu programmieren. Die Open-Source Plattform Apache Flink schließt die Lücke zwischen herkömmlichen Datenbanksystemen und Big-Data Analyseframeworks. Das Top Level Projekt der Apache Software Foundation basiert auf einer fehlertoleranten Laufzeitumgebung zur Datenstromverarbeitung, welche die Datenverteilung und Kommunikation im Cluster übernimmt. Verschiedene Schnittstellen erlauben die Implementierung von Datenanalyseabläufen für unterschiedlichste Anwendungsfälle. Die Plattform wird von einer aktiven Community kontinuierlich weiter entwickelt. Sie ist gleichzeitig Produkt und Basis vieler Forschungsarbeiten im Bereich Datenbanken und Informationsmanagement.
DualTable: A Hybrid Storage Model for Update Optimization in Hive. Hu, Songlin; Liu, Wantao; Rabl, Tilmann; Huang, Shuo; Liang, Ying; Xiao, Zheng; Jacobsen, Hans-Arno; Pei, Xubin; Wang, Jiye (2015). 1340–1351.
Hive is the most mature and prevalent data warehouse tool providing SQL-like interface in the Hadoop ecosystem. It is successfully used in many Internet companies and shows its value for big data processing in traditional industries. However, enterprise big data processing systems as in Smart Grid applications usually require complicated business logics and involve many data manipulation operations like updates and deletes. Hive cannot offer sufficient support for these while preserving high query performance. Hive using the Hadoop Distributed File System (HDFS) for storage cannot implement data manipulation efficiently and Hive on HBase suffers from poor query performance even though it can support faster data manipulation. There is a project based on Hive issue Hive-5317 to support update operations, but it has not been finished in Hive’s latest version. Since this ACID compliant extension adopts same data storage format on HDFS, the update performance problem is not solved. In this paper, we propose a hybrid storage model called DualTable, which combines the efficient streaming reads of HDFS and the random write capability of HBase. Hive on DualTable provides better data manipulation support and preserves query performance at the same time. Experiments on a TPC-H data set and on a real smart grid data set show that Hive on DualTable is up to 10 times faster than Hive when executing update and delete operations.
High Performance Stream Queries in Scala. Song, Dantong; Zhang, Kaiwen; Rabl, Tilmann; Menon, Prashanth; Jacobsen, Hans-Arno (2015). 322–323.
Traffic monitoring is an important stream processing application, which is highly dynamic and requires aggregation of spatially collocated data. Inspired by this, the DEBS 2015 Grand Challenge uses publicly available taxi transportation information to compute online the most frequent routes and most profitable areas. We describe our solution to the DEBS 2015 Grand Challenge, which can process events at a 10 ms latency and at a throughput of 114,000 events per second.