Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

What is FONDA?

FONDA – Foundations of Workflows for Large-Scale Scientific Data Analysis

FONDA investigates methods for increasing productivity in the development, execution, and maintenance of Data Analysis Workflows for large scientific data sets. We approach the underlying research questions from a fundamental perspective, aiming to find new abstractions, models, and algorithms that can eventually form the basis of a new class of future infrastructures for Data Analysis Workflows.

Essentially all scientific disciplines are generating an ever-increasing amount of data. To derive scientific discoveries, these data sets are analyzed by complex data analysis workflows (DAWs), which are series of discrete analysis programs arranged in (often non-linear) pipelines. Because they usually deal with very large data sets, DAWs must be executed on distributed and/or parallel computational infrastructures. Traditionally, DAWs are optimized for speed, which leads to solutions that are hard to reproduce and share and that are tightly bound to exactly one type of input. However, as stated as summary in a recent NSF/DOE workshop that brought together the workflow and the HPC communities, “… human productivity arguably still is the most expensive resource, trumping power, performance, and other factors …” [DOE15].

The proposed CRC FONDA – “Foundations of workflows for large-scale scientific data analysis” – will take up this observation and investigate methods for increasing productivity in the development, execution, and maintenance of DAWs for large scientific data sets. Our long-term goal is to develop methods and tools that achieve substantial reductions in development time and development cost of DAWs. We will approach these questions from a fundamental perspective, i.e., we aim at finding new abstractions, models, and algorithms that can eventually form the basis of a new class of future DAW infrastructures. Toward these goals, FONDA in its first phase will focus on three critical properties of DAWs and of DAW engines, namely portability, adaptability, and dependability (PAD). We want to investigate answers to questions such as: How can we build DAWs and DAW engines that enable portability of analysis across different infrastructures? How must DAWs be designed to adapt to changing input data or slightly changing requirements? How can we build dependable DAW systems that are aware of and control their own limitations and preconditions?

DAWs are bridges between two worlds: First, the specific scientific discipline using a DAW, and, second, Computer Science, which builds the infrastructures necessary for developing and executing DAWs. Developing novel foundations for scientific DAWs thus requires a close interaction between these two worlds. FONDA implements this idea by building on an interdisciplinary group of PIs from Computer Science, Material Science, Geosciences, and the Life Sciences. Through these cooperations, FONDA’s research results will be continuously validated using relevant and current scientific problems from different fields of the natural sciences.

[DOE15] DOE Workshop Report (2015): “The Future of Scientific Workflows – Report of the NGNS/DOE Scientific Workflows Workshop”

 

Research Area B: Abstractions for DAW Execution Infrastructures (B6)

This research area comprises six subprojects that research novel abstractions for the computational infrastructures underlying a DAW engine, which encompasses DAW execution engines, schedulers, resource managers, and configuration of the underlying network. The focus of subprojects in this area is on improving PAD compliance of infrastructures and DAW execution engines.

Distributed Run-Time Monitoring and Control of Data Analysis Workflows

Executions of DAWs are driven by specifications of the individual steps of a data-processing pipeline and the data it processes. However, for a multitude of reasons, execution may not function as desired, especially when distributed systems are used, which calls for proactive runtime monitoring of executions to detect and possibly resolve any problems early on. B6 will investigate the foundations of distributed monitoring and of control systems for DAWs, which is a prerequisite for the design of dependable DAWs. Cooperations will be established especially with A1, on the notion of “abnormal” versus “normal” behavior of an execution, and with B3 regarding the monitoring of distributed executions. The subproject will be led by Prof. Grunske, an expert in automated software analysis for error detection, and Prof. Rabl, an expert in efficient distributed streaming engines.

Intrusivness of Monitoring Systems

As in many cases monitoring systems need to be integrated on same hardware running the objective software there is a need for monitoring applications to be as less intrusive as possible. In our current research we investigate the iintrusiveness of state-of-the-art Stream Processing Engines (SPEs) like Flink and Kafka Streams. Therefore we are benchmarking SPEs regarding there CPU, memory and network consumption for the use case of monitoring DAWs like applications.