Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

HDES - A Dynamic Stream Processing Engine

Students: Nico Duldhardt, Torben Meyer, Marvin Thiele and Anton von Weltzien
Supervisors: Tilmann Rabl, Lawrence Benson

HDES (High-performance Dynamic Event Streaming) is a novel stream processing engine (SPE) that is built with first-class support for ad-hoc queries. Through ad-hoc queries, users can dynamically add and remove queries during runtime.
While previous approaches, such as AStream [1] and AJoin [2], added ad-hoc functionality on top of common open-source SPEs, HDES aims to make dynamicity a first-class citizen. This change in perspective allows us to optimize this use-case upfront and improves the performance of ad-hoc and multi-query usage scenarios.

Today, most commonly used SPEs, such as Apache Flink [3], Apache Storm [4] and Apache Spark Streaming [5], are optimized for running a fixed set of queries for a long period of time. While these tools can handle the most common workloads, they lack adaptability.
When a user wants to add a query, the state-of-the-art SPEs have to halt, recompile all queries and resubmit the job. While this process is happening, no new insights can be derived from the incoming data, and potentially important computations are delayed.
Dynamic SPEs would allow the user to create and submit a query without stopping the currently running queries. One can imagine a team of data-scientist who want to quickly verify a few hypotheses on incoming data streams, submitting several short-lived queries to gain new insights.

However, dynamic SPEs have a specific challenge: Running a large number of queries at the same time quickly exhausts the computational resources. Computation-intensive join queries cause a large part of this problem. To tackle this issue, we build on and extend the previous work of AStream and AJoin, while also utilizing the benefits of designing the HDES stream processing engine with ad-hoc query execution in mind.
HDES shares computation between queries that use the same data-sources and join keys which are crucial to achieving good performance. HDES also uses an execution plan which can be dynamically expanded and altered at runtime to enable ad-hoc queries.
In our benchmark, we have shown that HDES compares favourably to Apache Flink when executing multiple queries at once. Our benchmarks also show that HDES is capable of sustaining ad-hoc query workloads with complex queries inspired by the Nexmark benchmark.

More in-depth information on HDES can be obtained from our report [6] and presentation [7]. The code for HDES will be made availble soon.

References

[1] Jeyhun Karimov, Tilmann Rabl, Volker Markl: AStream: Ad-hoc Shared Stream Processing. SIGMOD 2019. https://jeyhunkarimov.github.io/assets/publications/karimov-astream-ad-hoc-shared-stream-processing.pdf

[2] Karimov, J., Rabl, T., Markl, V.: AJoin: Ad-hoc Stream Joins at Scale.Proceedings of the VLDB Endowment (2020). https://github.com/jeyhunkarimov/jeyhunkarimov.github.io/blob/master/assets/publications/karimov-ajoin-ad-hoc-stream-joins-at-scale.pdf

[3] Apache Flink, https://flink.apache.org/

[4] Apache Storm, https://storm.apache.org/

[5] Apache Spark, https://spark.apache.org/

[6] Report

[7] Slides

 

Project Video

...click to watch

Dynamic Stream Processing

Master Project, Winter 2019/2020

Original Project Description

The digital revolution leads to ever increasing amounts of data and a massively increased pace of data generation. In many use cases, archival of the data and later processing is either impossible or uneconomic due to the speed and amount of the data and the quick loss in value of data analysis over time. This has led to the development of stream processing engines (SPE), which can analysis large amounts of data in motion. This leads to two major challenges, the handling of time and potentially endless streams.

Current systems, such as Apache Spark Streaming or Apache Flink, handle these two challenges but work under the strong assumption that an analysis job is running very long and in isolation. This has led to an execution model that is very static regarding queries. Preliminary work [1] has explored the option to break this assumption and deal with streams of query additions and removals. Considering a standardized query structure, this has led to orders of magnitude improvements in throughput comparing to state of the art SPEs.

Project Outline

Goal of this master project is to build a prototype of a stream processing engine that has a concept of dynamic query deployment and removal. Unlike previous work [1], which is built on top of the SPE Apache Flink [2], in this project a standalone prototype will be built, with a clear focus on dynamicity.

The prototype should be able to process simple stream processing queries and streams. The set of query operators to be supported will be retrieved from benchmarks such as Nexmark [2], LinearRoad [3], or TPCx-IoT [4]. These need to be extended to cover the dynamic nature of the setups targeted. Based on the query set, the workload and operators can be defined. To support efficient processing under dynamic addition and removal of queries, online optimizations are required. This means that the deployed query graph can be modified, extended and parallelized. The scope of optimizations will be defined to support the initial workload.

In this project, students will learn the inner workings of stream processors and data management systems in general. It is targeting students interested in acquiring skills in data management, stream processing, data flows, and low-level systems programming.

General information and an introduction on stream processing can be found in the O’Reilly blog posts by Tyler Akidau [5,6] and the stream processing book [7].

Grading

Courses applicable: ITSE (Masterprojekt), DE (Data Engineering Lab)

Graded activity:

  • Implementation / group work
  • Final report (8 pages, double-column, ACM-art 9pt conference format)
  • Final presentation (20 min)

Contact

Tilmann Rabl & Lawrence Benson

Literature

[1] Jeyhun Karimov, Tilmann Rabl, Volker Markl: AStream: Ad-hoc Shared Stream Processing. SIGMOD 2019. https://jeyhunkarimov.github.io/assets/publications/karimov-astream-ad-hoc-shared-stream-processing.pdf

[2] Pete Tucker, Kristin Tufte, Vassilis Papadimos, and David Maier: NEXMark–A Benchmark for Queries over Data Streams (DRAFT). Technical report, OGI School of Science & Engineering at OHSU, 2008.

[3] Arvind Arasu et al.: Linear Road: A Stream Data Management Benchmark - www.cs.brandeis.edu/~linearroad/

[4] TPC Express Benchmark IoT (TPCx-IoT) - http://www.tpc.org/tpc_documents_current_versions/pdf/tpcx-iot_v1.0.3.pdf  

[5] Tyler Akidau: Streaming 101. https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

[6] Tyler Akidau: Streaming 102. https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

[7] Tyler Akidau, Slava Chernyak, Reuven Lax: Streaming Systems. O’Reilly. streamingsystems.net