Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

Introduction to Apache Flink

Arvid Heise, Ververica GmbH, Berlin

Abstract

As data processing becomes more real-time, stream processing is becoming more important. Apache Flink is a distributed, stateful stream processor and powers some of the worlds largest streaming applications at Alibaba, Netflix, or Uber and many other companies. It features exactly-once state consistency, sophisticated event-time support, high throughput, and low latency processing, and APIs at different levels of abstraction (Java, Scala, SQL). In this talk, Arvid Heise will give an introduction to Apache Flink, demonstrate the distinguishing features, and discuss the use cases it solves. He'll talk about Flink's community and how it is evolving Flink into a truly unified stream and batch processor.

Biography

Arvid studied IT Systems Engineering at HPI and received his PhD under Felix Naumann on Data cleansing and integration operators for Stratosphere, the predecessor of Apache Flink. He worked for 4 years on building data platforms and data pipelines at Bayer, Flixbus, and GfK with Kafka, Spark, and Flink. Just recently, he joint Ververica to work on the Flink runtime.

A recording of the presentation is available on Tele-Task.

Apache Flink Logo [5]

Summary

written by Adrian Jost, Jan Behrens, and Christian Warmuth

Imagine you and your fictitious company have a large amount of data and even more data being generated daily. For the survival of the company in a highly competitive environment, it is essential to always have the latest aggregated information to make business-critical decisions in the shortest possible time. One procedure which has been used so far defines periodically triggered ETL (extract, transform, load)-jobs and starts them monthly. The problem is that you have to wait a whole month (or the interval chosen for the company) without getting information about the latest developments in the data. In addition, subsequent analyses that run on the results of these ETL-jobs must of course wait, which often means that deadlines cannot be met in real companies as Arvid Heise, the speaker pointed out. In the following, the content of Arvid's Heise’s Talk is summarized and shown how to avoid such problems with Apache Flink. [1]

What is Apache Flink?

Apache Flink is a distributed data processing system designed for real-time event processing as well as ETL batch processing. Its goal is to provide a unified API for batch and stream processing to transition smoothly from your backlog to real-time data processing. Flink defines these cases as bounded and unbounded data. An unbounded data stream must be processed continuously because it is not possible to wait for all input data to arrive. It has a defined start, but no defined end point. Bounded data is therefore defined as a data stream with defined start and endpoints. The processing of bounded streams is also referred to as batch processing. [6]  Flink supports dozens of database sources and can run on different environments (such as YARN, HDFS, Kafka, Kinesis, …). Thanks to its architecture it scales really well and provides in-memory like performance. The largest implementation Arvid knew of had over 10,000s of cores and manages over 10s of TBs of state data at Netflix.

"Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale." [7]

This performance and scalability can be achieved through clever state management, which we will discuss later. Since in many cases all required states are in main memory, the system may be vulnerable to data loss. However, there is a solution that Flink uses to guarantee exactly-once consistency which will be covered later.

Users

Apache Flink is used by many well-known companies, including but not limited to eBay, Yelp, Uber, Comcast, and Alibaba. Alibaba recognized the benefits of Apache Flink to such an extent that they have recently acquired Ververica to intensify their commitment to stream data processing and pushing the development even further. [8] Arvid also pointed out that most of these companies use Apache Flink for real-time event processing and they still use Apache Spark for batch processing.

Use Cases

Use Case: Event-Driven Applications

Event-driven applications ingest events from one or more streams and react to incoming events by triggering computations, state updates, or external actions. Traditionally these kinds of jobs are separated into different computational and storage tiers where the application reads and persists data to a remote database. Event-driven applications, such as Flink, improve this concept by co-locating the state near the application. Persistent memory is only used for fault recovery, which improves application speed in terms of throughput and latency. [9]

Use Case: Batch & Stream Analytics

Analytical tasks pursue the goal of generating insights from raw data. Traditionally, such tasks are performed in batch queries on limited datasets of recorded events. To include new data in the analysis, the entire analysis process must be repeated on the new data set. For this reason, continuous analysis with streaming analytics has evolved. Events are processed in real-time and the resulting state can be used by other applications, like a monitoring dashboard. Apache Flink supports both query types by providing an interface with unified semantics for batch and streaming queries.

Use Case: ETL and Data Pipelining

It is common for data to be converted and moved between storage systems. For this case extract-transform-load (ETL) systems are built, which are periodically triggered by an external tool. For this purpose, Flink provides compact and easy-to-implement data pipelines. They fulfill the same task, but process the data continuously. This makes the results visible quicker and the pipelines can be used in a more versatile way.

Stream Processing

A Flink program comprises streams and transformations. A stream is conceptually a (potentially endless) flow of data. In this context, a transformation is a process that takes one or more streams as input and generates as a result one or more output streams. During execution, a Flink program is mapped to a streaming dataflow. A streaming dataflow consists of one or multiple sources (input), a computation/operation as well as one or more sinks (output) at the end. The dataflows are similar to arbitrarily directed acyclic graphs (DAGs), which can be executed in parallel. [10]

State Management using Checkpoints

Many of the operations used in stream processing are stateful, since the steaming data arrives over time whereas of course not all data can be kept. Therefore, operations must remember records or temporary results. In order not to loose the state of a variable in case of failure, Flink maintains the state locally per task (in-memory or on-disk) using periodic, asynchronous incremental snapshots. In this case, a checkpoint represents a consistent snapshot of the state of all tasks. This is achieved by all tasks copying their state when they are at the same location of the input (“checkpoint barrier”). With this approach, Flink implements an exactly-once consistency, since in the event of an error, the system continues from the checkpoint and the previously processed data passes through the pipeline again, so that each piece of information is processed exactly once.

In contrast to checkpoints, Flink also offers savepoints, which are intended for planned application upgrades, suspending and resuming applications and migration to upgraded infrastructure, as they are much more extensive and take longer to create and to import back into Flink. [11]

Differentiation between Event-Time and Processing-Time

Depending on the application and its purpose, there are differences between event-time and processing-time. Processing-time refers to the time stamp at which a record arrives, which can lead to non-deterministic results, since records might be out of order. The processing time is also not applicable when having recorded data, but might be useful to approximate low-latency results.

In event-time processing, the records are processed based on an inherent “watermark” with which you can achieve that no record with a timestamp before the watermark is left there to be processed. Results are consequently deterministic since records are processed in the order of creation which therefore requires more time to wait for records that are out of order. The use of event time and processing time depends on the application and its requirements for accuracy concerning the order of processing records.

Different API-Levels within Apache Flink [12]

Available APIs

In order to use the functions that are offered by Apache Flink, interfaces - APIs - are required. Flink offers three API levels that differ in their conciseness and expressiveness and are therefore best suited for different tasks.

Different API-Levels within Apache Flink

The first level APIs are SQL and the Table API. They are intended for high-level data analytics for which relational algebra is used. They offer a unified interface for both streaming data as well as data at rest (your backlog). However, the given API level is limited as it is not possible to define custom aggregates or access state or time triggers.

The second level is the DataStream API which can be used for stream and batch data processing. Programs are composed as dataflows and the data processing logic is implemented via custom user functions.

So called “ProcessFunctions” represent the third level of APIs. Their primary use is in stateful event-driven applications. As such they expose access to state and time and are embedded in applications using the DataStream API. They offer powerful and useful functionality, such as saving events and intermediate results into the state and the usage of timers.

Apache Flink Deployment Options [15]

Deployment and Interfaces

As mentioned at the beginning, Flink can be deployed in various ways and on multiple platforms. The classical approach is to run Apache Flink in an Hadoop environment. In this approach, Flink is running on Apache Hadoop YARN, a cluster resource management framework, in a single large cluster together with possible other YARN applications. [14]   However, it is not the recommended way for new users, as it is difficult to scale the application.

The recommended way to run Flink is via a Kubernetes library deployment. In this case Flink is just a library of the main application, which makes it much easier to deploy and scale the application.

To communicate and interact with other software systems, Flink supports a series of connection options. Flink offers interfaces for Kafka, Kinesis and Pulsar to be able to have event logs as an input. In addition, Flink supports many file system architectures like S3, HDFS, MapR-FS and many more. The data for Flink can be encoded in different styles, such as JSON, Avro, CSV or others. Flink supports standard databases such as JDBC or Hive. Additionally, there are connectors to several key-value-stores such as Cassandra and Redis. In addition, thanks to its open source character, the community has developed many other connectors. [16]

Growing open source community for Apache Flink [17]

Development Lifecycle and Open Source

Flink is a top level Apache open source project with many contributors. The contributions mentioned are not only source code, but also bug reports and documentation improvements. Therefore, you are especially invited to open issues and share your thoughts with the growing community.

Apache Flink uses semantic versioning for releases and a new minor version is normally released every 3-4 month. However, these releases get support beyond the next minor release. Typically, there are multiple patch versions for a minor release and those will be published more frequently whenever needed.

Outlook & Summary

Finally, we will have a concise outlook into the future development and the roadmap of Apache Flink. Currently there exist two processing engines. One for batch and one for stream processing. As we all know, maintaining and developing a software system is complicated, maintaining two is even more. Therefore, there are plans to unify the batch and stream processing engine to have one “true” streaming engine. This also includes porting the DataSet API into the DataStream API as so called “bounded streams”. The Table API is going to be updated to comprise Python support. The Python support will be an important factor for more machine learning and data exploration use cases for Apache Flink. The planned Notebook support should increase the adoption in the machine learning area as it is a common tool there. Other plans include to support unaligned checkpoints, which would make them more robust and also improve the performance. That’s especially helpful for faster yield recovery. [18]

In summary, Arvid gave a very good overview of what Apache Flink is in his talk - a stream processing engine. The talk also covered several use cases and potential users (companies that have a lot of continuously generated data to be processed in real-time while retaining having normal batch-jobs). In addition, he has gone deeper into the concept of time in stream processing, consistency and recovery with checkpoints and savepoints. Arvid has shown possible interfaces and deployment options and has given us a short insight into the roadmap for Apache Flink, which mainly focuses on machine learning and unifying the two separate engines for stream and batch processing into one single engine.

Thank you Arvid for giving all these interesting insights.


Bibliography:

* all sources where lastly checked for availability at 15.12.2019 13:30