A PDGF Implementation for TPC-H. Poess, Meikel; Rabl, Tilmann; Frank, Michael; Danisch, Manuel (2011). 196–212.
With 182 benchmark results from 20 hardware vendors, TPC-H has established itself as the industry standard benchmark to measure performance of decision support systems. The release of TPC-H twelve years ago by the Transaction Processing Performance Council’s (TPC) was based on an earlier decision support benchmark, called TPC-D, which was released 1994. TPC-H inherited TPC-D’s data and query generators, DBgen and Qgen. As systems evolved over time, maintenance of these tools has become a major burden for the TPC. DBgen and Qgen need to be ported on new hardware architectures and adapted as the system grew in size to multiple terabytes. In this paper we demonstrate how Parallel Data Generation Framework (PDGF), a generic data generator, developed at the University of Passau for massively parallel data generation, can be adapted for TPC-H.
Parallel Data Generation for Performance Analysis of Large, Complex RDBMS. Rabl, Tilmann; Poess, Meikel (2011). 5.
The exponential growth in the amount of data retained by today’s systems is fostered by a recent paradigm shift towards cloud computing and the vast deployment of data-hungry applications, such as social media sites. At the same time systems are capturing more sophisticated data. Running realistic benchmarks to test the performance and robustness of these applications is becoming increasingly difficult, because of the amount of data that needs to be generated, the number of systems that need to generate the data and the complex structure of the data. These three reasons are intrinsically connected. Whenever large amounts of data are needed, its generation process needs to be highly parallel, in many cases across-systems. Since the structure of the data is becoming more and more complex, its parallel generation is extremely challenging. Over the years there have been many papers about data generators, but there has not been a comprehensive overview of the requirements of today’s data generators covering the most complex problems to be solved. In this paper we present such an overview by analyzing the requirements of today’s data generators and either explaining how the problems have been solved in existing data generators, or showing why the problems have not been solved yet.
A Protocol for Disaster Data Evacuation. Rabl, Tilmann; Stegmaier, Florian; Döller, Mario; Vang, The Thong (2011). 448–449.
Data is the basis of the modern information society. However, recent natural catastrophes have shown that it is not possible to definitively secure a data storage location. Even if the storage location is not destroyed itself the access may quickly become impossible, due to the breakdown of connections or power supply. However, this rarely happens without any warning. While floods have hours or days of warning time, tsunamis usually leave only minutes for reaction and for earthquakes there are only seconds. In such situations, timely evacuation of important data is the key challenge. Consequently, the focus lies on minimizing the time to move away all data from the storage location whereas the actual time to arrival remains less (but still) important. This demonstration presents the dynamic fast send protocol (DFSP), a new bulk data transfer protocol. It employs striping to dynamic intermediate nodes in order to minimize sending time and to utilize the sender’s resources to a high extent.
Demonstration des Parallel Data Generation Framework. Rabl, Tilmann; Sergieh, Hatem Mousselly; Frank, Michael; Kosch, Harald (2011). 730–733.
In vielen akademischen und wirtschaftlichen Anwendungen durchbrechen die Datenmengen die Petabytegrenze. Dies stellt die Datenbankforschung vor neue Aufgaben und Forschungsfelder. Petabytes an Daten werden gewöhnlich in großen Clustern oder Clouds gespeichert. Auch wenn Clouds in den letzten Jahren sehr popul̈är geworden sind, gibt es dennoch wenige Arbeiten zum Benchmarking von Anwendungen in Clouds. In diesem Beitrag stellen wir einen Datengenerator vor, der für die Generierung von Daten in Clouds entworfen wurde. Die Architektur des Generators ist auf einfache Erweiterbarkeit und Konfigurierbarkeit ausgelegt. Die wichtigste Eigenschaft ist die vollständige Parallelverarbeitung, die einen optimalen Speedup auf einer beliebigen Anzahl an Rechnerknoten erlaubt. Die Demonstration umfasst sowohl die Erstellung eines Schemas, als auch die Generierung mit verschiedenen Parallelisierungsgraden. Um Interessenten die Definition eigener Datenbanken zu ermöglichen, ist das Framework auch online verfügbar.
Efficiency in Cluster Database Systems - Dynamic and Workload-Aware Scaling and Allocation. Technical Report (PhD dissertation), Rabl, Tilmann (2011).
Database systems have been vital in all forms of data processing for a long time. In recent years, the amount of processed data has been growing dramatically, even in small projects. Nevertheless, database management systems tend to be static in terms of size and performance which makes scaling a difficult and expensive task. Because of performance and especially cost advantages more and more installed systems have a shared nothing cluster architecture. Due to the massive parallelism of the hardware programming paradigms from high performance computing are translated into data processing. Database research struggles to keep up with this trend. A key feature of traditional database systems is to provide transparent access to the stored data. This introduces data dependencies and increases system complexity and inter process communication. Therefore, many developers are exchanging this feature for a better scalability. However, explicitly managing the data distribution and data flow requires a deep understanding of the distributed system and reduces the possibilities for automatic and autonomic optimization. In this thesis we present an approach for database system scaling and allocation that features good scalability although it keeps the data distribution transparent. The first part of this thesis analyzes the challenges and opportunities for self-scaling database management systems in cluster environments. Scalability is a major concern of Internet based applications. Access peaks that overload the application are a financial risk. Therefore, systems are usually configured to be able to process peaks at any given moment. As a result, server systems often have a very low utilization. In distributed systems the efficiency can be increased by adapting the number of nodes to the current workload. We propose a processing model and an architecture that allows efficient self-scaling of cluster database systems. In the second part we consider different allocation approaches. To increase the efficiency we present a workload-aware, query-centric model. The approach is formalized; optimal and heuristic algorithms are presented. The algorithms optimize the data distribution for local query execution and balance the workload according to the query history. We present different query classification schemes for different forms of partitioning. The approach is evaluated for OLTP and OLAP style workloads. It is shown that variants of the approach scale well for both fields of application. The third part of the thesis considers benchmarks for large, adaptive systems. First, we present a data generator for cloud-sized applications. Due to its architecture the data generator can easily be extended and configured. A key feature is the high degree of parallelism that makes linear speedup for arbitrary numbers of nodes possible. To simulate systems with user interaction, we have analyzed a productive online e-learning management system. Based on our findings, we present a model for workload generation that considers the temporal dependency of user interaction.