Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Open theses

The information systems group is always looking for good master students to write master's theses. The theses can be in one of the following broad areas:

  • Duplicate Detection
  • Data Profiling
  • Linked Data
  • Text Mining
  • Recommender Systems
  • Information Retrieval
  • Natural Language Processing

Please note that the list below is only a small sample of possible thesis topics and ideas. Please contact us to discuss further, to find new topics, or to suggest a topic of your own.

For more information about writing a master's theses in our group, please see here.


Information Systems

From UCCs to keys

The efficient discovery of unique column combinations (UCCs) is a well-known and much researched problem. Each UCC is a possible key for the relation at hand. The task of this thesis is to extract from the often very large set of UCCs those that in fact represent a key, i.e., that a database administrator would choose. This task is of utmost relevance to real-world profiling tools, as it make profiling results actionable. One approach is the use of heuristics (size of UCC, substrings of columns names, etc.), another might be to choose a set of features and train a machine learner using relations with known keys. 

For more information please contact Prof. Felix Naumann or Thorsten Papenbrock.

Optimizing iterative cross-platform programs

Today’s data processing landscape encompasses a vast amount of data processing platforms, each having their own capabilities and performance characteristics. Picking and orchestrating the best combination of platforms for some data processing task at hand is not only difficult from an engineering perspective; it’s further impossible to do so statically as parameters change, such as the size of the input data or the available platforms. Rheem, a tool developed at HPI and the Qatar Research Computing Institute, frees developers from exactly that burden. Given a data processing plan, it automatically chooses a suitable combination of platforms and executes the plan accordingly.

In contrast to many other processing systems, Rheem considers DAG-shaped query plans with loop operators that are connected with feedback edges, thereby establishing also cyclic data flows. This important feature enables Rheem to support applications with iterations, such as machine learning and graph analytics applications. In fact, efficiently executing iterative data flows is quite important in order to timely extract knowledge from big data. As of now, Rheem optimizes and executes loops in a static fashion. That is, once it has taken a decision on how to execute a loop, it cannot change its decision on-the-fly across iterations. This can lead to an inefficient execution of iterative programs as their behavior can change from one iteration to another. For example, the amount of data to process can significantly shrink or grow after a certain number of iterations.

The proposed thesis aims at removing this shortcoming by letting Rheem adapt how it executes loops across iterations. However, to achieve this, several challenges need to be addressed. First, it requires devising techniques for efficient data movement among processing platforms and fast migration of the current status of iterative programs. Second, it is crucial to predict, among others, how the size of the dataset to be processed changes from one iteration to another. Third, it is necessary to inject checkpoints inside iterative programs that allow for changing processing platforms on-the-fly.

For more information, contact Sebastian Kruse. Additionally, Rheem's source code is hosted on GitHub.

Web Science

Combining Hierarchical and Labeled Topic Modeling for Patent Classification

Patent documents are traditionally classified using a complex hierarchy of categories. The goal of this Master's thesis is to automatically assign these category labels for new patent applications. This should be achieved by developing a model that takes the labels of the training data into account as well as the hierarchical structure of the categories. There already exist topic models dealing with labels (Labeled-LDA) and topic models dealing with hierarchical topic structure (HLDA). We want to combine both aspects.

This is part of the Topic modeling research project. For more information please contact Dr. Ralf Krestel.

Detecting Offensive Comments in a German Online Newspaper

Hate speech and inappropriate comments are a huge problem for social media platforms and online news media. Today, reader participation is possible for all major German news sites. This means that the news providers has to ensure that all comments posted on their sites adhere to certain standards and follow the rules. In practice this means that thousands of comments have to be checked manually which is very expensive. We aim to automate the filtering so that not all comments have to be checked but only the ones where the system is in doubt. Especially interesting is the adaptation of the learner over time to account for the daily change in topics and opinions. Methods such as word2vec should be augmented for classification of comments and with regard to the dynamic nature of the target domain.

This is part of the News analysis project. For more information please contact Dr. Ralf Krestel.

Analyzing Temporal Dependencies between Industry and Science Publications

Patent documents and scientific articles often talk about the same problems but using different wording. By jointly analyzing both genres we want to investigate the dynamics in different areas: Which domains are driven by research and which by application? Are there first patents for a new invention or research papers? In this Master's thesis we will investigate various techniques such as topic models and word embeddings to identify similar topics in the two genres. And as a second step analyze the temporal behavior of these topics to be able to predict future trends in science and/or industry.

This is part of the Cross-collection analysis project. For more information please contact Dr. Ralf Krestel.

Classifying Business E-mails

Large companies often loose track of the topics discussed in e-mails with customers or between employees. This makes it hard to identify experts for a topic in the company. Further it is difficult in retrospective to identify persons responsible or trace decision processes within larger projects. In this Master's thesis we will use the Enron e-mail dataset to develop a classifier for business communication. The goal is to use state-of-the-art text mining methods to group e-mails into (predefined?) categories.

This is part of the Analyzing business communication project. For more information please contact Dr. Ralf Krestel.

 

Automatische Klassifizierung von Twitterkonversationen

Soziale Medien enthalten große Mengen von textuellen Daten, die allerdings in unstrukturierter Form vorliegen. So finden sich auf einer Plattform wie Twitter politische Diskussionen, Produktbewertungen und Empfehlungen, Nachrichten, Argumentationen zu sozialen Entwicklungen, usw., aber auch belanglose Alltagsgespräche, Spam, Werbung und triviale bot-generierte Informationen (z.B. Radioprogramm, Pegelstände). Während die erste Menge der Tweets und Konversationen eine potentiell interessante Datenbasis bietet, die zur Zeit mehr und mehr zur Grundlage diverser computerlinguistischer, aber auch soziologischer, politischer und angewandter Studien wird, stellt die zweite Menge eher einen Störfaktor dar, der (jedenfalls für die meisten Zwecke) gezielt ausgesondert werden muss. Es ist oft nicht von vornherein klar, zu welchen Anteilen welche Art von Kommunikation und Interaktion auf einer Plattform stattfindet, und um welcher Art von Nachricht es sich bei einem individuellen Tweet handelt. Für viele Aufgaben ist darüber hinaus bedeutsam, wenn das Tweet in eine „Konversation“ eingebettet ist, wie sie durch die reply-to zwischen Tweets (und damit zwischen Usern) entsteht und die eine durchaus komplexe Baumstruktur ergeben kann.
Weitere Details
Kontakt: Tatjana Scheffler, Ph.D., tatjana.scheffler(at)uni-potsdam.de (Computerlinguistik)
Betreuer am HPI: Felix Naumann