Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

About the Speaker

Matthias Weidlich is a full professor at the Department of Computer Science at Humboldt-Universität zu Berlin (HU Berlin), Germany, where he holds the Chair on Databases and Information Systems. Before joining HU Berlin, he held positions at Imperial College London and at the Technion - Israel Institute of Technology. He holds a PhD in Computer Science from the Hasso-Plattner-Institute, University of Potsdam. His research focuses on data-driven process analysis, event stream processing, and exploratory data analysis. He serves as Co-Editor in Chief for the Information Systems journal and is a member of the steering committees of the BPM conference series and the ACM DEBS conference series.

About the Talk

Complex event processing emerged as a computational paradigm to detect patterns in event streams based on the continuous evaluation of event queries. Once such queries are evaluated in a network of event sources, efficient query evaluation may be achieved through the distributed evaluation of queries. In this talk, we present some of our recent results on achieving such distribution with the model of MuSE graphs as well as optimizations that rely on push-pull-communication.

Efficient Complex Event Processing

Summary written by Bengin Özdil and Zero Janetzki

Basics

Events

The term "event" is ambiguous in different scientific and engineering contexts. In probability theory, an event might be a set of outcomes with assigned probabilities. In systems and middleware, an event can be a change of state or a significant occurrence. Events vary in granularity, from simple actions such as a method call to complex ones such as a new employee being hired, an email being received, a new stock price being available.

Prof. Weidlich argues that even if other disciplines such as ML would not agree that events are instantaneous, events are conceptually instantaneous and represent a change of state of a system, i.e. the state before the event is different from the state after the event. He also distinguishes between the abstract nature of an event and its representation, such as event objects or event notifications (although these terms are often used interchangeably, even though event objects cannot travel, whereas notifications can).

CEP

Figure 1: Overview of the composition of CEP systems. source: Matthias Weidlich

The purpose of complex event processing (CEP) is to look at many simple events together and then try to find patterns in them.

Therefore, sensors and information systems generate event notifications that form an event stream (see Figure 1). Event streams are conceptually infinite and usually ordered, which affects their processing. Events are often categorized by types, each with its own semantics and data structure represented by key-value pairs. Detecting patterns in event streams requires specifying patterns in event query languages, which typically do so for a specified time window.

CEP has various applications, in particular urban transportation and sensor-based automation have been introduced. In bike-sharing systems, for example, events could be the unlocking of a bike or the congestion of bikes in a particular area. In sensor-based automation, where his research group has a partnership with a company, local events could be tasks performed by robots or simply a change in the robot’s location.

Considering those examples, some properties about events and their processing can be determined:

  • Most events are of no interest. Almost everything that can be predicted and is normal does not add value, while abnormalities and deviations are more informative.
  • Recent events are often of more interest than those that have taken place before. 
  • The ability to respond to changing situations adds value (e.g. in networks when a node fails).

Queries

Figure 2: Illustrative example for nested query structures. source: Matthias Weidlich

For later optimization steps, it is important to understand that many queries are often nested, as illustrated in Figure 2 using a sequence and AND operator. The structure of such queries can also be represented as a tree, with leaf nodes representing event types and non-leaf nodes representing operators.

 Of course, there are a number of event processing languages that can handle these and other queries, such as traditional DB languages from Oracle, IBM, Microsoft or from frameworks for distributed data processing such as Flink or Azure Stream Analytics. Unfortunately, these are hardly standardized. Therefore, the "Match Recognize" operator in SQL is highlighted as a key development in pattern matching within SQL queries. This operator allows for complex pattern matching, including specifying time windows, partitioning data, and defining patterns using syntax similar to regular expressions, as shown in Figure 3.

Figure 3: Example for the MATCH_RECOGNIZE operator. source: Matthias Weidlich

Distributed event processing

 Each component in a system, like transport robots in a factory, not only generates events, but also possesses such events from itself and other sources that are nearby and relevant to it. This setup facilitates in-network evaluation of event queries, with each node processing its own data and data from other sources. The goal is to minimize communication overhead, especially in scenarios where most data is uninteresting and only specific, rare events are of significance. 

Current research and benefits of efficiently distributed CEP

 Now we move on to Prof. Weidlich's research and the benefits of efficient distributed CEP and their researched MuSE graphs. His research is specifically aimed at a model for distributed evaluation, as well as an angle angular and orthogonal optimization, which does not optimize the distribution of computation, but rather the communication between components for processing specific queries.

Figure 4: Example of robots R1, R2 and R3 with different sensors (C, L, F). Exclamation marks indicate that C and L occur much more frequently than F. source: Matthias Weidlich

Consider an example of three robots (R1, R2, R3) with three different possible sensors detecting events: Obstacle detected by camera (C), obstacle detected by lidar (L), floor is clear to be used (F) (see Figure 4). In this example, we assume that each robot can talk to each other. The different events differ in their frequency of occurrence, with L and C events occurring much more frequently (up to 1000:1 times) than F events. That is an important prerequisite for optimization of the network traffic.

Now also consider that we want to evaluate the query shown in Figure 5 on our network from Figure 4 for the following approaches.

Figure 5: Example query for the network forms. source: Matthias Weidlich

Centralized Plan

Figure 6: Centralized Plan Example. source: Matthias Weidlich

The Centralized Plan is a basic approach, meaning all events are forwarded to one robot, a central node for event processing. Here we choose R2. That means all the queries from R1 and R3 are forwarded to R2 and will be evaluated at R2, the example query for this is [SEQ(AND (C,L) F)].

  • Pro: Existing CEP techniques can be used
  • Con: Almost all events generated must be sent over the network;
  • Con: One site needs to process all events

Classic Operator Placement

Figure 7: Classic Operator Placement Example. source: Matthias Weidlich

The limitations of the Centralized Plan are improved in this approach by splitting the query into subqueries and processing these sub-queries at different nodes in the network in order to reduce the communication overhead in the network. C events from R1 and L events from R3 are sent to R2, using the sub-query [AND(C,L)] at R2. This approach allows us to send only the partial results of our subquery to R1. This is usually (depending on the selectivity) much less than sending all data through the network. 

  • Pro: Leverage event sources as event processors
  • Con: Limited by operator hierarchy
  • Con: Single node placement

This approach still results in expensive event transfers over the network, especially C and L events. Placing queries and subqueries on only one node makes this approach still quite expensive in terms of data usage in the network. Splitting queries relies on the operator hierarchy and sub-queries are just sub-trees of the graph, which limits this approach in further optimizing the network traffic.

Multi-Sink Evaluation (MuSE)

Prof. Weidlich and his team worked on the two disadvantages of Operator Placement and came up with their MuSE approach to optimize event distribution even further.

  • Pro: Leverage event sources as event processors
  • Pro: Arbitrary query projections
  • Pro: Multi-sink placements

In their approach, they distribute the queries and subqueries as shown in Figure 8 to the different robots.

Figure 8: MuSE Example. source: Matthias Weidlich

 As an effect, the results of the queries and subqueries can be detected on different robots and not only on one node, since it does not matter on which one we get the result. Furthermore, it allows us to consider subqueries that are not direct subtrees of the original query, allowing a more efficient use of network resources.The data that needs to be sent to the nodes is reduced to just the results of the query [SEQ(C,F)] and the event F. 

Comparing their approach to centralized plan and operator placement approaches, the results show significant improvements, especially in scenarios with high event skew, where sensors produce different events at very different rates. 

MuSE Graphs

Figure 9: MuSE Graph for example set up in Figure 8. source: Matthias Weidlich

 MuSE graphs can be used to describe the evaluation plan. As illustrated in Figure 9, leaf nodes represent the generation of certain types of events at specific nodes, and other nodes represent the evaluation of queries. This model also assigns costs to different types of communication, distinguishing between data sent over the network, which has a cost of 1, and data processed locally, which has a cost of zero. 

Orthogonal optimization

 Consider an example with three components, of which each generates different types of events (A, B, C). 

In this setup, query evaluation is not only about detecting events in a certain order, but also about defining "pull sets" - the order in which different nodes request intermediate results from each other. For example, events are held and cached at the source and sent only on request, rather than continuously sending all events that may be needed. This request is based not only on the time window but also on the predicates defined over the payload of the events, resulting in a more selective and efficient data transfer process. The main challenge lies in the exponential growth of combinatorial possibilities when considering the number of projections and combinations of subqueries, making this optimization task NP-hard. Nevertheless, strategies (including pruning methods and dynamic programming) have been developed to make the evaluation of reasonably large settings more feasible. 

Conclusion

In his lecture on "Efficient Complex Event Processing", Professor Weidlich offers a comprehensive examination of CEP, highlighting the possible improvements in systems like sensor-equipped transport robots, but he also outlines further applications such as urban transportation. 

His lecture covers various aspects of events, from their concepts as instantaneous events to their practical implications in real-world systems. His approach emphasizes the improving potential of CEP in complex patterns within large data streams reducing the amount of data transmitted. The key improvements within their research lie in the innovative methodology for data transmission and analysis within interconnected network systems, highlighting MuSE, as well as pull- and push-approaches. These techniques address the limitations of traditional CEP techniques by introducing a more sophisticated and distributed evaluation model. The multi-sink placements and arbitrary query projections not only optimize network traffic but also provide a more efficient selective data transfer process and query processing. 

The work of Professor Weidlich and his team outlines a significant improvement in event processing that promises to redefine the operational efficiency of complex and interconnected systems in areas such as robotics, urban transportation, or more generally, in sensor-based automation solutions.

References

[1] Lecture: Efficient Complex Event Processing by Dr. Matthias Weidlich