From UCCs to keys
The efficient discovery of unique column combinations (UCCs) is a well-known and much researched problem. Each UCC is a possible key for the relation at hand. The task of this thesis is to extract from the often very large set of UCCs those that in fact represent a key, i.e., that a database administrator would choose. This task is of utmost relevance to real-world profiling tools, as it make profiling results actionable. One approach is the use of heuristics (size of UCC, substrings of columns names, etc.), another might be to choose a set of features and train a machine learner using relations with known keys.
For more information please contact Prof. Felix Naumann or Thorsten Papenbrock.
Optimizing iterative cross-platform programs
Today’s data processing landscape encompasses a vast amount of data processing platforms, each having their own capabilities and performance characteristics. Picking and orchestrating the best combination of platforms for some data processing task at hand is not only difficult from an engineering perspective; it’s further impossible to do so statically as parameters change, such as the size of the input data or the available platforms. Rheem, a tool developed at HPI and the Qatar Research Computing Institute, frees developers from exactly that burden. Given a data processing plan, it automatically chooses a suitable combination of platforms and executes the plan accordingly.
In contrast to many other processing systems, Rheem considers DAG-shaped query plans with loop operators that are connected with feedback edges, thereby establishing also cyclic data flows. This important feature enables Rheem to support applications with iterations, such as machine learning and graph analytics applications. In fact, efficiently executing iterative data flows is quite important in order to timely extract knowledge from big data. As of now, Rheem optimizes and executes loops in a static fashion. That is, once it has taken a decision on how to execute a loop, it cannot change its decision on-the-fly across iterations. This can lead to an inefficient execution of iterative programs as their behavior can change from one iteration to another. For example, the amount of data to process can significantly shrink or grow after a certain number of iterations.
The proposed thesis aims at removing this shortcoming by letting Rheem adapt how it executes loops across iterations. However, to achieve this, several challenges need to be addressed. First, it requires devising techniques for efficient data movement among processing platforms and fast migration of the current status of iterative programs. Second, it is crucial to predict, among others, how the size of the dataset to be processed changes from one iteration to another. Third, it is necessary to inject checkpoints inside iterative programs that allow for changing processing platforms on-the-fly.
For more information, contact Sebastian Kruse. Additionally, Rheem's source code is hosted on GitHub.