Efficient Update Data Generation for DBMS Benchmarks. Frank, Michael; Poess, Meikel; Rabl, Tilmann (2012). 169–180.
It is without doubt that industry standard benchmarks have been proven to be crucial to the innovation and productivity of the computing industry. They are important to the fair and standardized assessment of performance across different vendors, different system versions from the same vendor and across different architectures. Good benchmarks are even meant to drive industry and technology forward. Since at some point, after all reasonable advances have been made using a particular benchmark even good benchmarks become obsolete over time. This is why standard consortia periodically overhaul their existing benchmarks or develop new benchmarks. An extremely time and resource consuming task in the creation of new benchmarks is the development of benchmark generators, especially because benchmarks tend to become more and more complex. The first version of the Parallel Data Generation Framework (PDGF), a generic data generator, was capable of generating data for the initial load of arbitrary relational schemas. It was, however, not able to generate data for the actual workload, i.e. input data for transactions (insert, delete and update), incremental load etc., mainly because it did not understand the notion of updates. Updates are data changes that occur over time, e.g. a customer changes address, switches job, gets married or has children. Many benchmarks, need to reflect these changes during their workloads. In this paper we present PDGF Version 2, which contains extensions enabling the generation of update data.
BigBench Specification V0.1 - BigBench: An Industry Standard Benchmark for Big Data Analytics. Rabl, Tilmann; Ghazal, Ahmad; Hu, Minqing; Crolotte, Alain; Raab, Francois; Poess, Meikel; Jacobsen, Hans-Arno (2012). 164–201.
In this article, we present the specification of BigBench, an end-to-end big data benchmark proposal. BigBench models a retail product supplier. The benchmark proposal covers a data model and a set of big data specific queries. BigBench’s synthetic data generator addresses the variety, velocity and volume aspects of big data workloads. The structured part of the BigBench data model is adopted from the TPC-DS benchmark. In addition, the structured schema is enriched with semi-structured and unstructured data components that are common in a retail product supplier environment. This specification contains the full query set as well as the data model.
Processing Big Events with Showers and Streams. Doblander, Christoph; Rabl, Tilmann; Jacobsen, Hans-Arno (2012). 60–71.
Emerging use cases derived from the area of cloud computing, smart power grids, and business process management require a set of capabilities not met by traditional event processing systems. These use cases were chosen to illustrate the capabilities required from systems that are able to process what we refer to as Big Events, that is Big Data in motion. To further illustrate Big Events, we identify three use cases and analyze the characteristics of the events involved. Based on this analysis, we specify requirements regarding the event schema, event query language, historic event processing needs, event timing, and result accuracy. Collectively, we refer to the constellation of state changes in a given system that exhibits these characteristics as event showers, referring to the collective of these events, similar to the notion of an event stream in the context of event stream processing. We call systems that offer capabilities for meeting these requirements event shower processing systems in contrast to traditional event (stream) processing systems. The use cases we picked, demonstrate that additional value can be captured by having shower processing systems in place. The benefits lie in the new possibilities to gain additional insights, increase observability, and to further exert control and opportunities for optimizations in the given applications.
Big Data Generation. Rabl, Tilmann; Jacobsen, Hans-Arno (2012). 20–27.
Big data challenges are end-to-end problems. When handling big data it usually has to be preprocessed, moved, loaded, processed, and stored many times. This has led to the creation of big data pipelines. Current benchmarks related to big data only focus on isolated aspects of this pipeline, usually the processing, storage and loading aspects. To this date, there has not been any benchmark presented covering the end-to-end aspect for big data systems. In this paper, we discuss the necessity of ETL like tasks in big data benchmarking and propose the Parallel Data Generation Framework (PDGF) for its data generation. PDGF is a generic data generator that was implemented at the University of Passau and is currently adopted in TPC benchmarks.
Setting the Direction for Big Data Benchmark Standards. Baru, Chaitanya K.; Bhandarkar, Milind A.; Nambiar, Raghunath Othayoth; Poess, Meikel; Rabl, Tilmann (2012). 197–208.
Solving manufacturing equipment monitoring through efficient complex event processing: DEBS grand challenge. Rabl, Tilmann; Zhang, Kaiwen; Sadoghi, Mohammad; Pandey, Navneet Kumar; Nigam, Aakash; Wang, Chen; Jacobsen, Hans-Arno (2012). 335–340.
Solving Big Data Challenges for Enterprise Application Performance Management. Rabl, Tilmann; Sadoghi, Mohammad; Jacobsen, Hans-Arno; Gómez-Villamor, Sergio; Muntés-Mulero, Victor; Mankowskii, Serge in PVLDB (2012). 5(12) 1724–1735.
As the complexity of enterprise systems increases, the need for monitoring and analyzing such systems also grows. A number of companies have built sophisticated monitoring tools that go far beyond simple resource utilization reports. For example, based on instrumentation and specialized APIs, it is now possible to monitor single method invocations and trace individual transactions across geographically distributed systems. This high-level of detail enables more precise forms of analysis and prediction but comes at the price of high data rates (i.e., big data). To maximize the benefit of data monitoring, the data has to be stored for an extended period of time for ulterior analysis. This new wave of big data analytics imposes new challenges especially for the application performance monitoring systems. The monitoring data has to be stored in a system that can sustain the high data rates and at the same time enable an up-to-date view of the underlying infrastructure. With the advent of modern key-value stores, a variety of data storage systems have emerged that are built with a focus on scalability and high data rates as predominant in this monitoring use case. In this work, we present our experience and a comprehensive performance evaluation of six modern (open-source) data stores in the context of application performance monitoring as part of CA Technologies initiative. We evaluated these systems with data and workloads that can be found in application performance monitoring, as well as, on-line advertisement, power monitoring, and many other use cases. We present our insights not only as performance results but also as lessons learned and our experience relating to the setup and configuration complexity of these data stores in an industry setting.
Landmark-Assisted Location and Tracking in Outdoor Mobile Network. Anisetti, Marco; Ardagna, Claudio Agostino; Bellandi, Valerio; Damiani, Ernesto; Döller, Mario; Stegmaier, Florian; Rabl, Tilmann; Kosch, Harald; Brunie, Lionel in Multimedia Tools Appl. (2012). 59(1) 89–111.
Technical enhancements of mobile technologies and the integration of multi-sensors, like accelerometer and camera, within mobile devices are paving the way to the definition of high quality and accurate geolocation solutions based on the informations acquired by multimodal sensors, and data collected and managed by GSM/3G networks. In this paper, we present a technique that provides geolocation and mobility prediction of a mobile devices, mixing the location information acquired with the GSM/3G infrastructure and a landmark matching obtainable thanks to the camera integrated on the mobile devices. We first present our geolocation approach based on an advanced Time-Forwarding algorithm and on database correlation technique over Received Signal Strength Indication (RSSI) data. Then, we integrate it with a landmark recognition infrastructure, to enhance our algorithm in those areas with poor signal and low accurate geolocation. The radio signal-based location is thus improved integrating the information gettable via landmark recognition infrastructure directly in the geolocation algorithm. Finally, the performances of the geolocation algorithm are carefully validated by an extensive experimentation, carried out on real data collected from the mobile network antennas of a complex urban environment.