Prof. Dr. Felix Naumann

Open theses

The information systems group is always looking for good master students to write master's theses. The theses can be in one of the following broad areas:

  • Duplicate Detection
  • Data Profiling
  • Linked Data
  • Text Mining
  • Recommender Systems
  • Information Retrieval
  • Natural Language Processing

Please note that the list below is only a small sample of possible thesis topics and ideas. Please contact us to discuss further, to find new topics, or to suggest a topic of your own.

For more information about writing a master's theses in our group, please see here.

Information Systems

Data Exploration with Profiling Results

With data profiling, we can efficiently detect numerous statistics and dependencies within a given dataset. Such profiling results help to explore the data, its content and inner logic. The amount of discovered metadata is, however, often so overwhelmingly large that interesting patterns and relevant statements are impossible to see. For this reason, this master thesis aims to investigate visual and analytical methods that bring these insights to light. One core task for these methods is to separate random results from semantically meaningful ones - a classification task that could be well suited for machine learning algorithms. Another aspect of data exploration is to find patterns in the metadata, such as cliques, chains, hubs and authorities that help to assess the relevance and the connection of schema attributes. The overall goal of data exploration is to extract possibly many insights about the data from its matadata.

For more information please contact Prof. Felix Naumann or Thorsten Papenbrock.

Progressive Data Profiling

Data profiling is often a very time consuming task. Especially the discovery of complex dependencies, such as functional dependencies (FDs) and inclusion dependencies (INDs), can take many hours or even days. Inspired by the Pareto principle, which says that you can usually get 80% of your tasks done in 20% of your time, this master thesis is about developing a progressive solution for the discovery of FDs/INDs. More specifically, the algorithm should discover as many dependencies as possible in a given amount of time. Such a progressive algorithm would help data scientists to explore their data before expensive exhaustive profiling is conducted. It should also be able to deliver some results for such datasets that cannot be fully profiled due to time or budget restrictions.

For more information please contact Prof. Felix Naumann or Thorsten Papenbrock.

Metric Functional Dependencies and Matching Dependencies

Metric functional dependencies (MFDs) and matching dependencies (MDs) are both functional dependencies that incorporate some form of distance or similarity between values. So values must not be strictly equal, but close in some sense to meet the dependencies. In GEO data, for instance, the same location might be represented with slightly varying coordinates due to measurement inaccuracy, but it is always the same location. MFDs consider such cases with metrics that capture and evaluate distances - in this case, distances between GEO coordinates. MDs, in contrast, measure distances between values with similarity functions that also quantify the distance between non-numerical data. For instance, the distance between "Caribbean" and "Carribean" is two, which is, two edits that are necessary to turn one string into the other. MFDs and MDs are very important for tasks such as data cleaning or data integration, but they are also very hard to discover. The goal of this master thesis is to develop an algorithm that automatically discovers all MFDs/MDs of a given relational dataset.

For more information please contact Prof. Felix Naumann or Thorsten Papenbrock.

Relationship Extraction for Lazy Data Scientists

Machine learning and deep learning are increasingly seen as silver bullets for many difficult knowledge mining tasks. Stanford’s InfoLab has proposed and provides the novel Snorkel platform (http://hazyresearch.github.io/snorkel/), which alleviates many of the difficult preparatory tasks involved in applying learning methods to real-world problems. In particular, it significantly reduces the painful work of creating training data by reducing the task to the creation of a few simple tagging functions in lieu of manually tagging actual data.

We propose to apply this technique to the notoriously difficult problems of named entity recognition (NER) and relationship extraction (RE). One particular goal is to find out with how few and how simple tagging functions one can get away with.

For more information please contact Prof. Felix Naumann or Michael Loster.

Web Science

Combining Hierarchical and Labeled Topic Modeling for Patent Classification

Patent documents are traditionally classified using a complex hierarchy of categories. The goal of this Master's thesis is to automatically assign these category labels for new patent applications. This should be achieved by developing a model that takes the labels of the training data into account as well as the hierarchical structure of the categories. There already exist topic models dealing with labels (Labeled-LDA) and topic models dealing with hierarchical topic structure (HLDA). We want to combine both aspects.

This is part of the Topic modeling research project. For more information please contact Dr. Ralf Krestel.

Analyzing Temporal Dependencies between Industry and Science Publications

Patent documents and scientific articles often talk about the same problems but using different wording. By jointly analyzing both genres we want to investigate the dynamics in different areas: Which domains are driven by research and which by application? Are there first patents for a new invention or research papers? In this Master's thesis we will investigate various techniques such as topic models and word embeddings to identify similar topics in the two genres. And as a second step analyze the temporal behavior of these topics to be able to predict future trends in science and/or industry.

This is part of the Cross-collection analysis project. For more information please contact Dr. Ralf Krestel.

Classifying Business E-mails

Large companies often loose track of the topics discussed in e-mails with customers or between employees. This makes it hard to identify experts for a topic in the company. Further it is difficult in retrospective to identify persons responsible or trace decision processes within larger projects. In this Master's thesis we will use the Enron e-mail dataset to develop a classifier for business communication. The goal is to use state-of-the-art text mining methods to group e-mails into (predefined?) categories.

This is part of the Analyzing business communication project. For more information please contact Dr. Ralf Krestel.


Automatische Klassifizierung von Twitterkonversationen

Soziale Medien enthalten große Mengen von textuellen Daten, die allerdings in unstrukturierter Form vorliegen. So finden sich auf einer Plattform wie Twitter politische Diskussionen, Produktbewertungen und Empfehlungen, Nachrichten, Argumentationen zu sozialen Entwicklungen, usw., aber auch belanglose Alltagsgespräche, Spam, Werbung und triviale bot-generierte Informationen (z.B. Radioprogramm, Pegelstände). Während die erste Menge der Tweets und Konversationen eine potentiell interessante Datenbasis bietet, die zur Zeit mehr und mehr zur Grundlage diverser computerlinguistischer, aber auch soziologischer, politischer und angewandter Studien wird, stellt die zweite Menge eher einen Störfaktor dar, der (jedenfalls für die meisten Zwecke) gezielt ausgesondert werden muss. Es ist oft nicht von vornherein klar, zu welchen Anteilen welche Art von Kommunikation und Interaktion auf einer Plattform stattfindet, und um welcher Art von Nachricht es sich bei einem individuellen Tweet handelt. Für viele Aufgaben ist darüber hinaus bedeutsam, wenn das Tweet in eine „Konversation“ eingebettet ist, wie sie durch die reply-to zwischen Tweets (und damit zwischen Usern) entsteht und die eine durchaus komplexe Baumstruktur ergeben kann.
Weitere Details
Kontakt: Tatjana Scheffler, Ph.D., tatjana.scheffler(at)uni-potsdam.de (Computerlinguistik)
Betreuer am HPI: Felix Naumann