Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

Stream Processing

Tilmann Rabl, HPI

Abstract

Whenever one has to work with data that is generated faster than what can be processed in traditional database systems or with data that requires real-time updates during processing, stream processing can help. At its core, stream processing focuses on the processing of unbounded data sets, meaning data that is in theory ever-growing and generated continuously. Since in this case a data stream has no end, stream processing requires a mechanism, called windows, to define when a partition of the data stream is ready for processing. How to efficiently compute these windows and how to optimize ad-hoc queries on data streams are two of the current research concerns in the field of stream processing.

Introduction of speaker

Prof. Dr. Tilmann Rabl has been involved in database research since 2007 and received his Ph.D. from the University of Passau in 2011. After publishing his doctoral thesis on the subject of scalability and data allocation in cluster databases, he continued his work as a postdoctoral researcher at the Middleware Systems Research Group at the University of Toronto. In 2015, he joined the Database Systems and Information Management group at Technische Universität Berlin as a senior researcher and visiting professor and assumed the position of Vice Director of the Intelligent Analytics for Massive Data group at the German Research Center for Artificial Intelligence. Since 2019, he has held the chair for Data Engineering Systems at the Digital Engineering Faculty of the University of Potsdam and the Hasso Plattner Institute. In addition, he was appointed ombudsperson at the latter and is the co-founder of the startup bankmark, which focuses on the generation of test data and benchmarking for data processing systems.

Summary

Written by: Katharina Hasenlust, Daniel Juehling, Jonathan Haas

Data Streams in the Wild

In several areas of life, data-intensive computation systems require additional tools to handle the sheer amount of new incoming information. From physics research, self-driving cars, or large online retailers' inner workings, seemingly never-ending data streams need to be analyzed as they arrive. These circumstances demand time-sensitive processing that allows for incremental granularity, pre-aggregation and filtering, as well as near real-time query access. In extension, resolving data streams in time is essential to performance monitoring and adequate load-balancing in order to be able to handle data emergence spikes in due course in dynamic and data-laden environments.

 

A high-level Overview of Stream Processing

Stream processing is a set of technologies that address the challenges described above. Its operators are designed to process and analyze conceptually unlimited flows of information - often described as unbounded data sets - and produce a continuous result stream.  

 

Notions of time in stream processing

In stream processing, the relationship with the data source may be conceptualized as a push model, wherein the stream processor is subscribed to a flow of incoming information. In time-resolved data analysis such as this one, a need arises to distinguish several notions of time related to the recorded information. Depending on the nature of the analysis, one might be interested in event time, ingestion time, or processing time, each representing the data production time, the system time when the data were received, and the system time when the data were processed, respectively.  For forms of processing that do not require any temporal resolution a fourth category was introduced: time-agnosia.

Image Description: Conceptual depiction of a stream processing job. Figure taken from Rabl, Tilmann: Lecture Series on Database Research: Introduction & Stream Processing. 2021.

Architecture of a stream processing job

A stream processor typically consists of a number of operators that collectively handle the incoming information. Source operators accept the data stream and enrich the sources’ records with control events in set intervals, ensuring the coherence and validity of the stream. Further processing can be conducted either in a stateful or stateless fashion, depending on the particular purpose. In that respect, a Filter operator can be instanced as a stateless and time-agnostic operator. Conversely, Aggregation operators are stateful and require a means of determining when to output the aggregated result.

This is done by defining so-called windows on a stream. In essence, a window tells the stream processing engines how to divide a stream and when a collection of events is ready for processing. There are different types of windows, some of the more common ones are:

  • Tumbling Windows (defined by the window length)
  • Sliding Windows (defined by the window length and slide)
  • Session Windows (defined by a “gap” in which no events are recorded, meaning for how long no events are registered before the window is closed)

In addition, the different concepts of time as mentioned above are also relevant when defining a window.

  • Processing time windows are rather simple: The system measures the passing of time and will close the window once the time as defined by the window length has elapsed. Thus, the system decides the stream partitioning but will disregard any time information in the stream by doing so.
  • In Counting Windows, as the name suggests, the system counts the number of events until a set value and will then start processing the events. Therefore they are similar to processing time windows.
  • Event Time windows are based on the time information from the stream (when the event occurred at the source). These are more complicated to implement since  delays in the network streams can be unordered regarding the event time. As a consequence,  the system needs to have a notion of when to be certain that no more events belonging to the window will be arriving. This can be achieved  by defining an upper-bound after which arriving events are disregarded.

 

Once a window is closed, the events belonging to the window can be processed. This is done by stream operators, which can aggregate the events using common functions such as min, max, average, sum, count, etc., or joining the events with events occurring in a different stream but in the same time window.

 

Research Topics in Stream Processing

Efficient Window Aggregation

One way of optimizing stream processing is by minimizing redundant computation and data replication when aggregating overlapping windows.

In many real-world scenarios, the window length is quite large compared to the slide. For example, one might be interested in health statistics of a server aggregated over the last hour with a one-second update time. This can be implemented as one-hour sliding windows with a slide of one second. Two adjacent windows will then have a high overlap of 59 minutes and 59 seconds, thus sharing most of the data. In a naive implementation, this leads to many copies of the data and lots of redundant computation. Here the idea of slices comes into play. A slice is a part of the stream that does not contain any window borders and thus can be reused between overlapping windows. In the case of sliding windows the slice has the length of the slide. For each slice, a partial aggregate is computed, which is reused for all windows in which the slice occurs. While this idea is straightforward for sliding windows, it can also be implemented for other window types, such as session windows.

Implementing efficient window aggregation in the stream processing engine Apache Flink resulted in orders of magnitude higher throughput depending on the numbers of concurrent windows than an implementation without efficient window aggregation.

 

Image Description: Efficient Window Aggregation with Stream Slices. Figure taken from Rabl, Tilmann: Lecture Series on Database Research: Introduction & Stream Processing. 2021. Original figure from: J. Traub et al.: Efficient Window Aggregation with General Stream Slicing. EDBT 2019. Best Paper. J. Traub et al.: Scotty:General and Efficient Open-Source Window Aggregation for Stream Processing Systems TODS 37. 2020

Distributed Stream Aggregation

The idea of slices is also beneficial in the case of distributed data stream sources. The main idea here is to compute partial aggregates close to the source instead of the consumer and reducing network traffic and improving compute resource utilization in the process. This, of course, requires compute resources to be available between the source and the consumer.

For example, in many IoT networks sensors emit data streams to local servers which then pass these raw data streams on to a data center. In this setup, the raw data stream can produce a lot of network traffic and the compute resources of the local servers are unused. To improve this setup, one can compute partial aggregates at the local server for each sensor and only transfer the partial aggregates to the data center. This reduces the network traffic , leverages the local servers’ compute resources and thus reduces the load on the data center.

 

Ad Hoc Stream Processing

 

Ad-hoc queries on data streams in multiuser-environments have been greatly neglected in the past - even if they open doors to various use cases (e.g. data analysis dashboards one can spontaneously start customizable queries with). Common stream processing engines are rather designed for long-running queries that are known at compile time. Thus, ad hoc queries are served by using stopgap solutions that lead to redundant computing and data copies.

Against this background, Rabl et al. introduced the stream processing framework „AStream“, which facilitates time- and memory-efficient processing of ad hoc queries. The system attaches a bit set (query set) encoding a tuple’s relevance for the recent queries to every streamed tuple. A changelog is used to keep the query sets up to date - this allows to delete or newly create queries at runtime. AStream also performs window slicing. The Shared join operator is another component of AStream whose purpose is to avoid redundant computations in terms of join operations. It joins slices one by one, so that the emerging results can be combined and used for multiple queries. The Shared aggregation operator implements a similarly incremental approach for aggregation and thus also fosters the smart reuse of computations. AStream is based on Apache Flink but can also be integrated into other stream processing frameworks.

Experiments showed that AStream is able to achieve a throughput of up to 70 million of tuples per second (in a scenario with 100 queries per second until 1000 active queries). Apache Flink without AStream was not even able to manage much smaller ad hoc workloads. As AStream does not have to provide a new streaming topology for every query it also shows a low deployment latency. When it comes to single queries, Apache Flink has a slightly better data throughput - but AStream actually comes very close to it.

Image Description: Architecture Overview: A Stream roughly consists of a) the Shared Session module generating changelogs, b) the Shared selection operator appending and updating the tuples’ query sets, c) Shared join/aggregation operators and d) the Router which passes on tuples to output channels or downstream operators. Figure taken from Karimov, Jeyhun, Markl, Volker and Rabl, Tilmann: AStream: Ad-hoc Shared Stream Processing. 2019.