Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Distributed Computing

In this research area, we investigate computationally complex problems and how they can be solved in distributed environments. Complex problems are ubiquitous in many data engineering areas, such as data profiling, data cleaning, and data integration. They are also pervasive in data analytics, machine learning, and database system in general. Most existing solutions for complex data-centric problems lack robustness, efficiency, and elasticity -- flaws that we think can be overcome with distributed computing.

Head of Research Area

Dr. Thorsten Papenbrock

Professor at the University of Marburg

Phone: +49 6421 28-25475
Office: Building H04, Room 04C22
Philipps-Universität Marburg
Hans-Meerwein-Straße 6, 35032 Marburg

Researchers

Sebastian Schmidl

Phone: +49 331 5509 4977
Fax: +49 331 5509 287
Office: Building F, F-2.04

Former Members

Phillip Wenig

Phone: +49 331 5509 237
Fax: +49 331 5509 - 237
Office: Building F, F-2.04

Student Assistants

We have open positions! Please contact us!

Past student assistants

  • Oct 2020 - Sept 2021 Yannik Schröder

Research Mission

Computer systems up to the turn of the century became constantly faster without any particular effort, simply because the hardware they were running on increased its clock speed with every new release. But this free lunch is over! Today's CPUs stall at around 3 GHz and software developers need to break new grounds to make their products faster. The most popular approach for this is to design software with parallelization and distributed computing in mind because the number of computing elements (transistors, cores, CPUs, GPUs, cluster nodes etc.) in modern computer systems still increases constantly.

Big Data analytics and engineering are both multi-million dollar market that grow constantly. Data and the ability to control and use it is the most valuable ability of today's computer systems. Because data volumes grow so rapidly and with them the complexity of questions they should answer, data engineering, which is the process of shaping and transforming data, as well as data analytics, which is the ability of extracting any kind of information from the data, both become increasingly difficult. Both data-centric computer science disciplines can, in particular, not hope for the hardware getting any faster to cope with their performance problems: They need to embrace new software trends that let their performance scale with the still increasing number of processing elements.

This general paradigm shift in software development, however, introduces various challenges that must be solved to develop an algorithm or system that efficiently executes on various, possibly independent and heterogeneous computing elements. Some of these challenges involve the following questions:

How can the distributed algorithm or system …

  • utilize all available resources in an optimal way?
  • deal with the increased error susceptibility of a parallel/distributed system?
  • support elasticity, i.e., sets of computing resources that change at runtime?
  • control its resource consumption in terms of overall memory and CPU usage?
  • ensure reliable state and data storage?
  • start and terminate in a clean and secure way?
  • be debugged, monitored, and profiled?

Certain frameworks for parallel/distributed programming, such as Spark, Flink, and Storm, solve a couple of these questions already, but they enforce a certain programming model that does not fit for all computational complex tasks. Other distributed computing paradigms, such as message passing and actor programming (see, for instance, Akka, Orleans, or Erlang), leave these questions to the programmer, but they also offer much more flexibility for algorithmic designs.

In this research area, we investigate various data engineering and data analytics domains to identify and then solve their computationally complex problems via scalable and elastic approaches. We investigate general challenges for writing distributed systems, but also try to solve use-case-specific computational tasks that have no trivial distributed solutions yet.

Current Projects

  • Efficient Subsequence Anomaly Detection On Time Series Data (in cooperation with Rolls-Royce)
  • Distributed Machine Learning
    • Data Gossip
    • HYPEX: Scalable hyperparameter optimization in time series anomaly detection

Past projects

  • Distributed data profiling
    • Distributed Duplicate Detection on Streaming Data
    • DISTOD: Efficient Distributed Discovery of Bidirectional Order Dependencies
    • Distributed Unique Column Combination Discovery
    • Reactive Inclusion Dependency Discovery
    • Inclusion Dependency Discovery on Streaming Data
  • A2DB: A Reactive Database for Theta-Joins

Teaching

We offer lectures and seminars on the above topics (see teaching for current and past events), and also Bachelor and Master projects, as well as Master's theses topics. For a list of currently open thesis topics have a look here.