Prof. Dr. Felix Naumann

Open theses

The information systems group is always looking for good master students to write master's theses. The theses can be in one of the following broad areas:

  • Duplicate Detection
  • Data Profiling
  • Linked Data
  • Data Mining
  • Recommender Systems
  • Information Retrieval
  • Natural Language Processing

Please note that the list below is only a small sample of possible thesis topics and ideas. Please contact us to discuss further, to find new topics, or to suggest a topic of your own.

For more information about writing a master's theses in our group, please see here.

Information Systems

From UCCs to keys

The efficient discovery of unique column combinations (UCCs) is a well-known and much researched problem. Each UCC is a possible key for the relation at hand. The task of this thesis is to extract from the often very large set of UCCs those that in fact represent a key, i.e., that a database administrator would choose. This task is of utmost relevance to real-world profiling tools, as it make profiling results actionable. One approach is the use of heuristics (size of UCC, substrings of columns names, etc.), another might be to choose a set of features and train a machine learner using relations with known keys. 

For more information please contact Prof. Felix Naumann or Thorsten Papenbrock.

Optimizing iterative cross-platform programs

Today’s data processing landscape encompasses a vast amount of data processing platforms, each having their own capabilities and performance characteristics. Picking and orchestrating the best combination of platforms for some data processing task at hand is not only difficult from an engineering perspective; it’s further impossible to do so statically as parameters change, such as the size of the input data or the available platforms. Rheem, a tool developed at HPI and the Qatar Research Computing Institute, frees developers from exactly that burden. Given a data processing plan, it automatically chooses a suitable combination of platforms and executes the plan accordingly.

In contrast to many other processing systems, Rheem considers DAG-shaped query plans with loop operators that are connected with feedback edges, thereby establishing also cyclic data flows. This important feature enables Rheem to support applications with iterations, such as machine learning and graph analytics applications. In fact, efficiently executing iterative data flows is quite important in order to timely extract knowledge from big data. As of now, Rheem optimizes and executes loops in a static fashion. That is, once it has taken a decision on how to execute a loop, it cannot change its decision on-the-fly across iterations. This can lead to an inefficient execution of iterative programs as their behavior can change from one iteration to another. For example, the amount of data to process can significantly shrink or grow after a certain number of iterations.

The proposed thesis aims at removing this shortcoming by letting Rheem adapt how it executes loops across iterations. However, to achieve this, several challenges need to be addressed. First, it requires devising techniques for efficient data movement among processing platforms and fast migration of the current status of iterative programs. Second, it is crucial to predict, among others, how the size of the dataset to be processed changes from one iteration to another. Third, it is necessary to inject checkpoints inside iterative programs that allow for changing processing platforms on-the-fly.

For more information, contact Sebastian Kruse. Additionally, Rheem's source code is hosted on GitHub.

Web Science

Predictions on MOOCs using Textual Data

Massive Open Online Courses (MOOCs) are gaining popularity allowing people of all ages and professions to attend online courses. Participants can view videos on-demand and study the course material on their own time. Typically, MOOCs span over multiple weeks, with weekly tests and a graded exam at the end. Participants discuss problems and questions in forums where other participants and the teaching team are also active.In this context, there are multiple interesting research fields that can be explored with the help of text mining techniques:
Firstly, drop-out prediction is an interesting research field: Given a set of events a user has created, predict whether the user will continue participating in the course. A Master's thesis would aim to evaluate how the addition of forum data (e.g., posted questions, number of up-votes received, ...) can improve prediction of drop-outs and also predict the weekly assignment and final exam grades of users based on their forum activity.
Second, user behaviour and forum activity can be leveraged to predict forum threads which the user would read next or in which his expertise can prove helpful. In a Master's thesis, different prediction models would be implemented, and evaluated using real-life data.
Lastly, forum data could also be analysed in order to inform a user asking a question of an already existing duplicate thread or that another thread is topically very similar and therefore might already contain the answer searched for. The main focus of a Master's thesis on this topic would be the implementation and evaluation of different machine learning techniques and possibly a study on the effect such a system has on the user of the platform.

For more information please contact Dr. Ralf Krestel or Maximilian Jenders


Automatische Klassifizierung von Twitterkonversationen

Soziale Medien enthalten große Mengen von textuellen Daten, die allerdings in unstrukturierter Form vorliegen. So finden sich auf einer Plattform wie Twitter politische Diskussionen, Produktbewertungen und Empfehlungen, Nachrichten, Argumentationen zu sozialen Entwicklungen, usw., aber auch belanglose Alltagsgespräche, Spam, Werbung und triviale bot-generierte Informationen (z.B. Radioprogramm, Pegelstände). Während die erste Menge der Tweets und Konversationen eine potentiell interessante Datenbasis bietet, die zur Zeit mehr und mehr zur Grundlage diverser computerlinguistischer, aber auch soziologischer, politischer und angewandter Studien wird, stellt die zweite Menge eher einen Störfaktor dar, der (jedenfalls für die meisten Zwecke) gezielt ausgesondert werden muss. Es ist oft nicht von vornherein klar, zu welchen Anteilen welche Art von Kommunikation und Interaktion auf einer Plattform stattfindet, und um welcher Art von Nachricht es sich bei einem individuellen Tweet handelt. Für viele Aufgaben ist darüber hinaus bedeutsam, wenn das Tweet in eine „Konversation“ eingebettet ist, wie sie durch die reply-to zwischen Tweets (und damit zwischen Usern) entsteht und die eine durchaus komplexe Baumstruktur ergeben kann.
Weitere Details
Kontakt: Tatjana Scheffler, Ph.D., tatjana.scheffler(at)uni-potsdam.de (Computerlinguistik)
Betreuer am HPI: Felix Naumann