Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

Mosaics in Big Data, 4th January 2022

Volker Markl, TU Berlin

Summarized by Isabell Zorr, Florian Papsdorf, Vincent Rahn
Lecture Series on Database Research Winter Term 2021/2022

About the Speaker

Prof. Dr. Volker Markl studied computer science at the Technical University of Munich. In 1995, he earned his Diploma with a thesis on exception handling in programming languages, and in 1999, he received his Ph.D. in the area of multidimensional indexing under the supervision of Prof. Rudolf Bayer.[1]

His main research interests lay in the intersection of distributed systems, scalable data processing, and machine learning. Markl worked as a research group leader at the Bavarian Research Center for knowledge-based Systems as well as a project leader at the IBM Almaden Research Center in San Jose. He was the director of the Berlin Big Data Center and the co-director of the Berlin Machine Learning Center. Both were merged into the Berlin Institute for the Foundations of Learning and Data, which he has directed since then.[2] He is leading the chair of Database Systems and Information Management at the Technical University of Berlin and co-chairs the Technological Enablers and Data Science Interdisciplinary Working Group of the platform “Lernende Systeme”. Furthermore, he is Chief Scientist and Head of the Intelligent Analytics for Massive Data Research Group at the German Research Center for Artificial intelligence in Berlin.[3]

In 2014, Markl was elected one of Germany’s leading “Digital Minds” by the German Informatics Society. He serves on the academic scientific advisory council to the Alexander von Humboldt Institute for Internet and Society and the scientific advisory board to the Software AG. Since 2021, he is also an elected member of the Berlin-Brandenburg Academy of Sciences and Humanities.[4] He received numerous scientific and paper awards over the years.[5]

From 2010 until 2019, Markl led the Stratosphere project, which led to the development of Apache Flink. Current active research includes, besides others, the NebulaStream platform and AGORA.

Introduction

We currently experience a wave of innovation in Artificial Intelligence and Data Science. Thereby, the combination of Machine Learning technologies and Data Management represents the key driver of this wave.

On the one hand, Data Management combined with Machine Learning led to a paradigm shift in science. Starting with empirical observations and theoretical reasoning for producing scientific evidence, continuing with computational methods, data-intensive approaches regularly come across in scientific processes now. Thereby, measurements and simulations generate huge amounts of data which are stored in databases and files. Data Management methods prepare the data so that statistical analysis for knowledge production can be conducted. For instance, the BigEarthNet dataset[6] provides a large-scale collection of labeled satellite pictures that can be used to predict further developments of particular landscapes.

On the other hand, became a factor of production, since business processes are built around it. In contrast to traditional production resources, data does not go away after its consumption. Moreover, the harvesting or usage of data often generates new data. In terms of Industry 4.0, production plants are sprinkled with sensors. Based on the produced data, underlying business processes could be improved. For instance, a conceivable situation would be analyzing sensor data of production lines to predict the quality of the produced goods in real-time, reacting to quality issues early, and thereby reducing costs.

Current Research Goal In Data Management

Data Management combined with Machine Learning allows new innovations. However, the Data Science process is complex and time-consuming. Thereby, two impacts exist.

On the one hand, we notice human latency. It is complicated to describe the solution to a given problem in a machine-interpretable structure. By designing better languages that focus on what should be computed and not how it should be computed, one can simplify the materialization of the solution. Besides significantly shorter development time, a more concise code, and often a more performant execution, declarative code allows higher runtime independence due to automatic query optimization mechanisms: The query optimizer compiles the declarative query respectively to the target infrastructure. Thus, developers do not have to consider whether they write code for a cluster or a single machine. There is no need to say that expert developers are able to write highly optimized code, surpassing imperative code which is auto-generated by a query optimizer.

On the other hand, we notice the technical latency. It describes technical limitations on how fast programs can be executed. For instance, with the help of caching methods, distributing work to multiple nodes, or more efficient algorithms, we can reduce the technical latency and boost up the processing. Thereby, computations should proceed as much as possible in real-time.

To put it in a nutshell, Prof. Dr. Volker Markl defines the current goal in Data Management research as “scaling the process and its operationalization with respect to human and technical latency”.

In his lecture “Mosaics in Big Data”[7], Prof. Markl brings out four main research topics. First, he differs between Optimistic and Pessimistic Fault Tolerance approaches and explains why optimistic approaches could lead to performance improvements in the field of Big Data Analysis. Second, Prof. Markl illustrates some examples of how Apache Flink does justice to the current goals in Data Management research. Thereby, he shows interests in simplifying the writing of data analytics programs in Apache Flink but also in automatically optimizing the written programs for achieving good performances. Third, Prof. Markl introduces requirements and concrete challenges for Big Data Systems in the world of IoT. Finally, Prof. Markl closes his talk by motivating the need for ecosystems that simplify Data Management, Data Mining and Data Analysis with AI technologies, comparing existing open data sources and their underlying platforms, and finally, he presents AGORA as a new, hot open source ecosystem.

Data Infrastructures

All Roads lead to Rome: Optimistic Fault Tolerance in Distributed Systems

Errors and faults usually occur in large-scale, distributed Data Management Systems. Thereby, differentiating fault tolerance approaches comprehensively is not straightforward. However, a metaphor of two Roman generals helps.

Both generals start with their legions in Germany, which is populated by barbarians. The generals have to get as fast as possible to Rome, but they will have to fight off any barbarians in their way. One general, called Pessimus, will have his legions build a camp every day for the night and advance as fast as possible on the road, fight off any attacking barbarians, and then retreat to the camp they built previously. The other general, called Optimus, will not let his legions build camps. They will follow the road they are currently on. If barbarians attack, they will chase them off and roam the area until they find a Roman road again and follow this road until they are being attacked again by barbarians.

How will Optimus ever get to Rome if he only follows random roads? The answer is simple: all roads lead to Rome. However, who of both generals will be faster? This question is not as easily answered since it depends on the number of attacking barbarians and thus the amount of required detours Optimus has to take.

The two generals can be seen as two fault tolerance approaches. Thereby, Pessimus represents the pessimistic recovery approach. Intermediate states are always persisted. In case of failure, the operation restarts on such a checkpoint. Simultaneously, all nodes have to copy the checkpoints which generates a high overhead, even if no failures occur. Optimus represents the optimistic recovery approach. In case of failure, fix point algorithms are used to jump to a compensation state. Therefore, jumping to a random state could mean the loss of data, but this fix point algorithm has one feature for preventing such losses. To find the compensation state, the fix point algorithm has to invent new numbers that fit the rest of the data. For instance, the so-called “Page Rank Algorithm” contains one constraint: all numbers, invented and given, have to add up to one on all machines that are working on the problem.

In conclusion, using an optimistic recovery with a convey function, Data Mining will always find its way to Rome and often surpasses the performance of pessimistic approaches. A concrete comparison of the different recovery strategies can be seen in Figure 1.

Figure 1: Comparison of failure-free, pessimistic, and optimistic recovery.[8]

Automatic Optimization of Complex Analysis Algorithms for Big Data

Apache Hadoop as a famous implementation of the Map-Reduce framework provides schema-on-read functionalities, flexible handling for user-defined functions as well as scalability that is way better than traditional database systems provide. For instance, Apache Hadoop implements proper mechanisms for fault tolerance and long-running operations to reach targeted scalability. However, a database engineer misses the relational algebra, the declarative programming paradigm, the automatic query optimization as well as the out-of-core robustness.

For that reason, the Stratosphere Project[9] was founded. It aimed to combine the advantages of the worlds of general-purpose programming and database technologies. Besides adding advanced data flows, some concepts of both joined worlds are reconsidered and improved. Apache Hadoop supports batch processing and thereby allows high throughput. However, these (mini-) batching concepts do not minimize latencies. Native streaming functionalities provide low latencies and high throughput. Even though database systems know constructs like recursion, the query optimizers do not deal well with them. Simultaneously, Machine Learning technologies require iterations for its model buildings.

Based on the Stratosphere Project, Apache Flink[10] was born. Apache Flink is officially defined as an “open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications”[11]. Over the last decade, the number of contributors all over the world grew, and various renowned papers[12],[13],[14] describe further improvements. Nevertheless, there are aspects of Apache Flink that are worthy of being illustrated in a bit more detail.

Apache Flink’s Key Concepts For Reducing Human Latency

At first glance, Apache Flink programs strongly resemble programs for database systems due to their declarative structure. Thereby, two models exist, an algebraic model and a cost model. The cost model is used for automatic query optimization. The algebraic model consists of second-order and third-order functions. Apache Hadoop already provides second-order functions, called map and reduce. A map function in Apache Hadoop represents a parallelization pattern that applies a user-defined function on every single record of one single input. Apache Flink refines this concept and implements the cross-function that is defined analogously to Apache Hadoop’s map function but allows multiple input streams. Apache Flink refines the concept of the reduce function equally. In Apache Hadoop, a reduce function represents a parallelization pattern that applies a user-defined function on a group of records of one input. Apache Flink’s refinement, called co-group, allows multiple input streams again. Due to these refinements, developers are able to express join logics without mixing logic for query processing and data analysis, as Apache Hadoop requires it. Thus, Apache Flink code becomes more concise as well as readable and thereby, Apache Flink reduces human latency. Besides described second-order functions, Apache Flink adds third-order functions for fix-pointing iterations.

Apache Flink’s Key Concepts For Reducing Technical Latency

Based on the declarative character of Apache Flink programs, further improvements for reducing the technical latencies are possible. The program results in a data flow graph consisting of higher-order functions. Now it is possible to derive optimization rules like push down of projections and filters or early and lazy grouping, as we already know them from database systems.

In Big Data Systems like Apache Spark, iterations are implemented as iterations by looping. Thereby, an external program executes the loop and the immediate results are even stored after each completion of the loop body. Apache Flink provides two optimizations in its third-order functions for loop.

First, Apache Flink implements iteration in data flows and thereby adds a direct feedback loop in the looping function. The operation starts with a partial solution, the operations in the loop body are applied to the partial solution and then, it will be checked if the fix point is reached or not. If it is reached, the loop is done. If it is not reached, the initial partial solution of the beginning is replaced and the cycle starts again. Due to the query optimizer’s possibilities to push work out of the loop, to cache loop-invariant data, or to maintain state as an index, processing of the loops becomes more efficient.

Second, Apache Flink implements delta iterations whose application requires sparse computational dependencies. Thereby, each iteration considers and evolves only a part of the whole data. After each iteration, the delta that has to be considered shrinks and thus, a performance improvement is achieved compared to the processing of the entire data in each iteration.  

Due to such optimizations in handling iterations, Apache Flink attains way better performance compared to Apache Spark or Apache Hadoop.

To put it in a nutshell, Apache Flink’s success can be measured within various dimensions. A declarative code structure and well-thought-out patterns help developers write more concise code. This reduces the human latency. Furthermore, the automatic query optimization, as well as particular implementation improvements in handling iterations, reduce technical latency measurably. Finally, there are various companies, like Alibaba Group, Bouygues, or AWS that practically use Apache Flink for their daily operations.

Analysis of Heterogeneous and Distributed Data Streams in the IoT

The Internet of Things (IoT) brings new challenges to the analysis of data streams. Applications of the IoT can contain millions of data sources that are highly distributed and highly dynamic. Furthermore, these systems may consist of heterogeneous hardware components and data formats.

Apache Flink, as presented in the previous section, and other existing Data Management approaches assume a homogeneous compute cluster on which the executions can be mapped. There existed two types of Data Management systems. Systems of the first group (e.g. Apache Flink and Apache Spark) are based on the so-called cloud paradigm and require the data to be collected centrally before further processing. For instance, Apache Spark or Apache Flink belong to the first group. Systems of the second group (e.g. Frontier) are based on the fog computing paradigm and only scale well inside the “fog” and do not make use of all capabilities of modern cloud infrastructures. Frontier represents an instance of the second group. Both architectural concepts lead to a bottleneck for IoT applications with requirements as specified above.[15]

The newly arising question is: how can a system be designed to offer reliable and efficient Data Stream Management for the IoT? Thereby, some concrete challenges exist:

  1. For a particular data source, where should the data be processed so that network traffic is minimized and security, as well as privacy, are maximized?
  2. How to generate code for different hardware devices and sensors to enable the usage of heterogeneous hardware components?
  3. Which powerful intermediate representation allows optimizations of such heterogeneous data analysis programs for different data processing workloads?
  4. How to handle error detection and intervention decentralized to deal with the spontaneous, potentially unreliable connectivity of IoT infrastructures?
  5. How to optimize the distributed processing topology dynamically to optimize long-running systems on the fly?

 

To tackle these challenges, the NebulaStream platform was created. It supports:

  1. data sharing among multiple streaming queries, on stream aggregations, as well as acquisitional query processing and on-demand scheduling of data transmissions,
  2. distributed query compilation for different query, hardware, and data characteristics,
  3. diverse data and programming models, a centralized management view, as well as a unified intermediate representation that performs optimizations across traditional boundaries,
  4. different failure recovery approaches on all layers of its topology,
  5. and constant evolution under continuous operation.[16]

These features aim to reduce human and technical latency. The NebulaStream platform is a joint research project led by the IoT Lab at the Berlin Institute for the Foundations of Learning and Data and is still actively developed.[17]

 

Methods and Technologies for a single data infrastructure for AI-Applications

With data being a necessity for production nowadays, industries would stop working if this factor is not available. Thus, a strong and reliable infrastructure is needed to keep data highly available. Over the years, methods and technologies have been developed to make this possible. Even enable AI-Applications to use these infrastructures. This resulted in Data Management Systems that can be used for storing vast amounts of data, and for Machine Learning. Examples of such Data Management Systems are the Sloan Digital Sky Survey[18] and IBM Analytics[19] (former Many Eyes). The Sloan Digital Sky Server takes astronomical data, stores it, and makes it accessible for the public, which can now run queries on new astronomical findings. It is possible to do that since the data analysis itself is separated from the field of astronomy. Thus, citizens are able to play with data and find new connections. This can even help the people working in the field. IBM Analytics works similarly. It was developed by IBM and uploads data sets that people can use, visualize, share information and build upon it in their own algorithms and programs.

All this builds a Data Analytics Ecosystem, where interdependencies, cross-platform availability, and algorithms with handy tools generate a processing infrastructure that simplifies Data Management and Data Mining. Thus it should be an open and protected space for developers, users, data providers, and services alike. Although ecosystems like this could be set up around a market, making it open source and freely available gets rid of one major problem we are facing in the Data Mining world. A lot of services are built, but there are not enough developers to maintain these efficiently, only the open-source community could do it.

One of such open-source ecosystems is AGORA[20], a so-called “One Data Management for AI-Applications”, which is the focus of current research activities. It provides blocks for building a Data Mining infrastructure. AGORA makes multi-platform, compliant distributed, and trustworthy data processing possible and also supports the debugging of data analysis programs. For instance, usages in healthcare and digital humanities are conceivable. However, there are a few difficulties AGORA and many other similar platforms are facing: First, how to ensure a compliant and/or trustworthy query processing. Second, how to design the pricing of information marketplaces, since data is a factor of production and a new form of currency. If these problems are solved, such ecosystems will be the go-to for AI-Applications.

 

Summary

Prof. Markl presented data as a new factor of production for sciences, humanities, and industry. To make the best use of Data Management processes, the main goal is to reduce human and technical latency. The reduction of human latency can be reached by providing an intuitively usable system, while the reduction of technical latency can succeed by the smart distribution of work and the development of more efficient algorithms. Such systems enable real-time applications for data analysts.

Over the years, important technological foundations were built by research in the field of Data Management. Prof. Markl summarizes an extensive evolution of Big Data platforms, starting with Data Warehouses in the first generation, continuing with systems such as Apache Hadoop, Apache Spark, and concluding with Apache Flink in the fourth generation.

Systems such as Apache Flink already provide automatic optimization of algorithms, parallelization, and hardware adaption as well as methods for distributed analysis of large datasets. But as the next steps in the progression of Big Data platforms, methods for the analysis of massive data streams out of the IoT as well as Data Management infrastructures for collaborative AI innovations must be developed. The current work of Prof. Markl and his colleagues, e.g. NebulaStream and AGORA, illustrate the active research efforts that are trying to solve these challenges. But still, Data Management research is an interdisciplinary field within and outside computer science and thus, many more challenges will follow for future work.


[1] European Big Data Value Forum: “Volker Markl.” Last accessed on 14.01.2022. https://2020.european-big-data-value-forum.eu/keynotes/volker-markl/.

[2] ibid.

[3] Database Systems and Information Management Group: “Prof. Dr. rer. nat. Volker Markl.”
Last accessed on 16.01.2022. https://www.dima.tu-berlin.de/menue/staff/volker_markl/.

[4] ibid.

[5] BIFOLD: “Prof. Dr. Volker Markl.” Last accessed on 16.01.2022. https://bifold.berlin/de/person/prof-dr-rer-nat-volker-markl/.

[6] https://mlhub.earth/data/bigearthnet_v1.

[7] Markl, Volker: “Mosaics in Big Data” Lecture Series on Database Research, 04.01.2022. https://www.tele-task.de/lecture/video/8942/.

[8] Markl, Volker: “Mosaics in Big Data” Lecture Series on Database Research, 04.01.2022, slides p. 47.

[9] http://stratosphere.eu.

[10] More information on Apache Flink can be found here: https://flink.apache.org.

[11] Database Systems and Information Management Group: “Dissemination of Research Results. Flink / Stratosphere”. Last accessed on 16.01.2022. https://www.dima.tu-berlin.de/menue/research/dissemination_of_research_results/.

[12] Ji, Hangxu, et al. “Multi-Job Merging Framework and Scheduling Optimization for Apache Flink.” Database Systems for Advanced Applications, Springer, vol. 12681, 2021.

[13] Reza, HoseinyFarahabady M, et al. “Q-Flink: A QoS-Aware Controller for Apache Flink.” 19th International Symposium on Network Computing and Applications (NCA), IEEE, 2020.

[14] Sahal, Radhya, et al. “Big Data Multi-Query Optimisation with Apache Flink.” International Journal of Web Engineering and Technology, vol. 13, 2018.

[15] Zeuch, Steffen, et al. “The NebulaStream Platform for Data and Application Management in the Internet of Things.” 10th Conference on Innovative Data Systems Research (CIDR), 2020.

[16] ibid.

[17] More resources on the NebulaStream platform can be found here: https://nebula.stream.

[18] More on the Sloan Digital Sky Survey can be found here: https://www.sdss.org/.

[19] More on IBM Analytics can be found here: https://www.ibm.com/analytics.

[20] https://agoradata.com/.