We try to keep an up to date list of all our publications. If you are interested in a PDF that we have not uploaded yet, feel free to send us an email to get a copy. All recent publications you will find below. For older, please click appropriate year.
‘Big data’’ has become a major force of innovation across enterprises of all sizes. New platforms with increasingly more features for managing big datasets are being announced almost on a weekly basis. Yet, there is currently a lack of any means of comparability among such platforms. While the performance of traditional database systems is well understood and measured by long-established institutions such as the Transaction Processing Performance Council (TCP), there is neither a clear definition of the performance of big data systems nor a generally agreed upon metric for comparing these systems. In this article, we describe a community-based effort for defining a big data benchmark. Over the past year, a Big Data Benchmarking Community has become established in order to fill this void. The effort focuses on defining an end-to-end application-layer benchmark for measuring the performance of big data applications, with the ability to easily adapt the benchmark specification to evolving challenges in the big data space. This article describes the efforts that have been undertaken thus far toward the definition of a BigData Top100 List. While highlighting the major technical as well as organizational challenges, through this article, we also solicit community input into this process.
Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance.Rabl, Tilmann; Poess, Meikel; Jacobsen, Hans-Arno; O'Neil, Patrick E.; O'Neil, Elizabeth J. (2013). 361-372.
The Star Schema Benchmark (SSB), has been widely used to evaluate the performance of database management systems when executing star schema queries. SSB, based on the well known industry standard benchmark TPC-H, shares some of its drawbacks, most notably, its uniform data distributions. Today’s systems rely heavily on sophisticated cost-based query optimizers to generate the most efficient query execution plans. A benchmark that evaluates optimizer’s capability to generate optimal execution plans under all circumstances must provide the rich data set details on which optimizers rely (uniform and non-uniform distributions, data sparsity, etc.). This is also true for other database system parts, such as indices and operators, and ultimately holds for an end-to-end benchmark as well. SSB’s data generator, based on TPC-H’s dbgen, is not easy to adapt to different data distributions as its meta data and actual data generation implementations are not separated. In this paper, we motivate the need for a new revision of SSB that includes non-uniform data distributions. We list what specific modifications are required to SSB to implement non-uniform data sets and we demonstrate how to implement these modifications in the Parallel Data Generator Framework to generate both the data and query sets.
A BigBench Implementation in the Hadoop Ecosystem.Chowdhury, Badrul; Rabl, Tilmann; Saadatpanah, Pooya; Du, Jiang; Jacobsen, Hans-Arno (2013). 3-18.
BigBench is the first proposal for an end to end big data analytics benchmark. It features a rich query set with complex, realistic queries. BigBench was developed based on the decision support benchmark TPC-DS. The first proof-of-concept implementation was built for the Teradata Aster parallel database system and the queries were formulated in the proprietary SQL-MR query language. To test other other systems, the queries have to be translated. In this paper, an alternative implementation of BigBench for the Hadoop ecosystem is presented. All 30 queries of BigBench were realized using Apache Hive, Apache Hadoop, Apache Mahout, and NLTK. We will present the different design choices we took and show a proof of concept evaluation.
BigBench: Towards an Industry Standard Benchmark for Big Data Analytics.Ghazal, Ahmad; Rabl, Tilmann; Hu, Minqing; Raab, Francois; Poess, Meikel; Crolotte, Alain; Jacobsen, Hans-Arno (2013). 1197-1208.
There is a tremendous interest in big data by academia, industry and a large user base. Several commercial and open source providers unleashed a variety of products to support big data storage and processing. As these products mature, there is a need to evaluate and compare the performance of these systems. In this paper, we present BigBench, an end-to-end big data benchmark proposal. The underlying business model of BigBench is a product retailer. The proposal covers a data model and synthetic data generator that addresses the variety, velocity and volume aspects of big data systems containing structured, semi-structured and unstructured data. The structured part of the BigBench data model is adopted from the TPC-DS benchmark, which is enriched with semi-structured and unstructured data components. The semi-structured part captures registered and guest user clicks on the retailer’s website. The unstructured data captures product reviews submitted online. The data generator designed for BigBench provides scalable volumes of raw data based on a scale factor. The BigBench workload is designed around a set of queries against the data model. From a business prospective, the queries cover the different categories of big data analytics proposed by McKinsey. From a technical prospective, the queries are designed to span three different dimensions based on data sources, query processing types and analytic techniques. We illustrate the feasibility of BigBench by implementing it on the Teradata Aster Database. The test includes generating and loading a 200 Gigabyte BigBench data set and testing the workload by executing the BigBench queries (written using Teradata Aster SQL-MR) and reporting their response times.
Rapid Development of Data Generators Using Meta Generators in PDGF.Rabl, Tilmann; Poess, Meikel; Danisch, Manuel; Jacobsen, Hans-Arno (2013). 1-6.
Generating data sets for the performance testing of database systems on a particular hardware configuration and application domain is a very time consuming and tedious process. It is time consuming, because of the large amount of data that needs to be generated and tedious, because new data generators might need to be developed or existing once adjusted. The difficulty in generating this data is amplified by constant advances in hardware and software that allow the testing of ever larger and more complicated systems. In this paper, we present an approach for rapidly developing customized data generators. Our approach, which is based on the Parallel Data Generator Framework (PDGF), deploys a new concept of so called meta generators. Meta generators extend the concept of column-based generators in PDGF. Deploying meta generators in PDGF significantly reduces the development effort of customized data generators, it facilitates their debugging and eases their maintenance.
Application performance monitoring (APM) is shifting towards capturing and analyzing every event that arises in an enterprise infrastructure. Current APM systems, for example, make it possible to monitor enterprise applications at the granularity of tracing each method invocation (i.e., an event). Naturally, there is great interest in monitoring these events in real-time to react to system and application failures and in storing the captured information for an extended period of time to enable detailed system analysis, data analytics, and future auditing of trends in the historic data. However, the high insertion-rates (up to millions of events per second) and the purposely limited resource, a small fraction of all enterprise resources (i.e., 1-2% of the overall system resources), dedicated to APM are the key challenges for applying current data management solutions in this context. Emerging distributed key-value stores, often positioned to operate at this scale, induce additional storage overhead when dealing with relatively small data points (e.g., method invocation events) inserted at a rate of millions per second. Thus, they are not a promising solution for such an important class of workloads given APM’s highly constrained resource budget. In this paper, to address these shortcomings, we present Multi-layered, Adaptive, Distributed Event Store (MADES): a massively distributed store for collecting, querying, and storing event data at a rate of millions of events per second.
This paper presents the design and implementation of a custom-built event processing engine called BlueBay developed for live monitoring of soccer games. We experimentally evaluated our system using a real workload and report on its performance. Our results indicate that BlueBay achieves a throughput of up to 790k events per second, therefore processing the game’s input sensor stream about 60 times faster than real-time. In addition to our custom implementation, we also investigated the applicability of off-the-shelf general-purpose event processing engines to address the soccer monitoring problem. This effort resulted in two additional and fully functional implementations based on Esper and Storm.