Advancing Big Data Benchmarks - Proceedings of the 2013 Workshop Series on Big Data Benchmarking, WBDB.cn, Xi’an, China, July 16-17, 2013 and WBDB.us, San José, CA, USA, October 9-10, 2013 Revised Selected Papers Rabl, Tilmann; Jacobsen, Hans-Arno; Nambiar, Raghunath; Poess, Meikel; Bhandarkar, Milind A.; Baru, Chaitanya K. in Lecture Notes in Computer Science (2014). (Vol. 8585) Springer.
Specifying Big Data Benchmarks - First Workshop, WBDB 2012, San Jose, CA, USA, May 8-9, 2012, and Second Workshop, WBDB 2012, Pune, India, December 17-18, 2012, Revised Selected Papers Rabl, Tilmann; Poess, Meikel; Baru, Chaitanya K.; Jacobsen, Hans-Arno in Lecture Notes in Computer Science (2014). (Vol. 8163) Springer.
Towards a Complete BigBench Implementation. Rabl, Tilmann; Frank, Michael; Danisch, Manuel; Gowda, Bhaskar; Jacobsen, Hans-Arno (2014). 3–11.
BigBench was the first proposal for an end-to-end big data analytics benchmark. It features a set of 30 realistic queries based on real big data use cases. It was fully specified and completely implemented on the Hadoop stack. In this paper, we present updates on our development of a complete implementation on the Hadoop ecosystem. We will focus on the changes that we have made to data set, scaling, refresh process, and metric.
Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data. Baru, Chaitanya K.; Bhandarkar, Milind A.; Curino, Carlo; Danisch, Manuel; Frank, Michael; Gowda, Bhaskar; Jacobsen, Hans-Arno; Jie, Huang; Kumar, Dileep; Nambiar, Raghunath Othayoth; Poess, Meikel; Raab, Francois; Rabl, Tilmann; Ravi, Nishkam; Sachs, Kai; Sen, Saptak; Yi, Lan; Youn, Choonhan (2014). 44–63.
PSBench: A Benchmark for Content- and Topic-Based Publish/Subscribe Systems. Zhang, Kaiwen; Rabl, Tilmann; Sun, Yi Ping; Kumar, Rushab; Zen, Nayeem; Jacobsen, Hans-Arno (2014). 17–18.
CaSSanDra: An SSD Boosted Key-Value Store. Menon, Prashanth; Rabl, Tilmann; Sadoghi, Mohammad; Jacobsen, Hans-Arno (2014). 1162–1167.
With the ever growing size and complexity of enterprise systems there is a pressing need for more detailed application performance management. Due to the high data rates, traditional database technology cannot sustain the required performance. Alternatives are the more lightweight and, thus, more performant key-value stores. However, these systems tend to sacrifice read performance in order to obtain the desired write throughput by avoiding random disk access in favor of fast sequential accesses. With the advent of SSDs, built upon the philosophy of no moving parts, the boundary between sequential vs. random access is now becoming blurred. This provides a unique opportunity to extend the storage memory hierarchy using SSDs in key-value stores. In this paper, we extensively evaluate the benefits of using SSDs in commercialized key-value stores. In particular, we investigate the performance of hybrid SSD-HDD systems and demonstrate the benefits of our SSD caching and our novel dynamic schema model.
Optimizing Key-Value Stores for Hybrid Storage Architectures. Menon, Prashanth; Rabl, Tilmann; Sadoghi, Mohammad; Jacobsen, Hans-Arno (2014). 355–358.
Materialized Views in Cassandra. Rabl, Tilmann; Jacobsen, Hans-Arno (2014). 351–354.
Many web companies deal with enormous data sizes and request rates beyond the capabilities of traditional database systems. This has led to the development of modern Big Data Platforms (BDPs). BDPs handle large amounts of data and activity through massively distributed infrastructures. To achieve performance and availability at Internet scale, BDPs restrict querying capability, and provide weaker consistency guarantees than traditional ACID transactions. The reduced functionality as found in key-value stores is sufficient for many web applications. An important requirement of many big data systems is an online view of the current status of the data and activity. Typical big data systems such as key-value stores only allow a key-based access. In order to enable more complex querying mechanisms, while satisfying necessary latencies materialized views are employed. The efficiency of the maintenance of these views is a key factor of the usability of the system. Expensive operations such as full table scans are impractical for small, frequent modifications on Internet-scale data sets. In this paper, we present an efficient implementation of materialized views in key-value stores that enables complex query processing and is tailored for efficient maintenance.
DGFIndex for Smart Grid: Enhancing Hive with a Cost-Effective Multidimensional Range Index. Liu, Yue; Hu, Songlin; Rabl, Tilmann; Liu, Wantao; Jacobsen, Hans-Arno; Wu, Kaifeng; Chen, Jian; Li, Jintao in PVLDB (2014). 7(13) 1496–1507.
In Smart Grid applications, as the number of deployed electric smart meters increases, massive amounts of valuable meter data is generated and collected every day. To enable reliable data collection and make business decisions fast, high throughput storage and high-performance analysis of massive meter data become crucial for grid companies. Considering the advantage of high efficiency, fault tolerance, and price-performance of Hadoop and Hive systems, they are frequently deployed as underlying platform for big data processing. However, in real business use cases, these data analysis applications typically involve multidimensional range queries (MDRQ) as well as batch reading and statistics on the meter data. While Hive is high-performance at complex data batch reading and analysis, it lacks efficient indexing techniques for MDRQ. In this paper, we propose DGFIndex, an index structure for Hive that efficiently supports MDRQ for massive meter data. DGFIndex divides the data space into cubes using the grid file technique. Unlike the existing indexes in Hive, which stores all combinations of multiple dimensions, DGFIndex only stores the information of cubes. This leads to smaller index size and faster query processing. Furthermore, with pre-computing user-defined aggregations of each cube, DGFIndex only needs to access the boundary region for aggregation query. Our comprehensive experiments show that DGFIndex can save significant disk space in comparison with the existing indexes in Hive and the query performance with DGFIndex is 2-63 times faster than existing indexes in Hive, 2-94 times faster than HadoopDB, 2-75 times faster than scanning the whole table in different query selectivity.
TPC-DI: The First Industry Benchmark for Data Integration. Poess, Meikel; Rabl, Tilmann; Jacobsen, Hans-Arno; Caufield, Brian in PVLDB (2014). 7(13) 1367–1378.
Historically, the process of synchronizing a decision support system with data from operational systems has been referred to as Extract, Transform, Load (ETL) and the tools supporting such process have been referred to as ETL tools. Recently, ETL was replaced by the more comprehensive acronym, data integration (DI). DI describes the process of extracting and combining data from a variety of data source formats, transforming that data into a unified data model representation and loading it into a data store. This is done in the context of a variety of scenarios, such as data acquisition for business intelligence, analytics and data warehousing, but also synchronization of data between operational applications, data migrations and conversions, master data management, enterprise data sharing and delivery of data services in a service-oriented architecture context, amongst others. With these scenarios relying on up-to-date information it is critical to implement a highly performing, scalable and easy to maintain data integration system. This is especially important as the complexity, variety and volume of data is constantly increasing and performance of data integration systems is becoming very critical. Despite the significance of having a highly performing DI system, there has been no industry standard for measuring and comparing their performance. The TPC, acknowledging this void, has released TPC-DI, an innovative benchmark for data integration. This paper motivates the reasons behind its development, describes its main characteristics including workload, run rules, metric, and explains key decisions.