Prof. Dr. Felix Naumann

Open theses

The information systems group is always looking for good master students to write master's theses. The theses can be in one of the following broad areas:

  • Duplicate Detection
  • Data Profiling
  • Linked Data
  • Text Mining
  • Recommender Systems
  • Information Retrieval
  • Natural Language Processing

Please note that the list below is only a small sample of possible thesis topics and ideas. Please contact us to discuss further, to find new topics, or to suggest a topic of your own.

For more information about writing a master's theses in our group, please see here.

Information Systems

Data Exploration with Profiling Results

With data profiling, we can efficiently detect numerous statistics and dependencies within a given dataset. Such profiling results help to explore the data, its content and inner logic. The amount of discovered metadata is, however, often so overwhelmingly large that interesting patterns and relevant statements are impossible to see. For this reason, this master thesis aims to investigate visual and analytical methods that bring these insights to light. One core task for these methods is to separate random results from semantically meaningful ones - a classification task that could be well suited for machine learning algorithms. Another aspect of data exploration is to find patterns in the metadata, such as cliques, chains, hubs and authorities that help to assess the relevance and the connection of schema attributes. The overall goal of data exploration is to extract possibly many insights about the data from its matadata.

For more information please contact Prof. Felix Naumann or Thorsten Papenbrock.

Progressive Data Profiling

Data profiling is often a very time consuming task. Especially the discovery of complex dependencies, such as functional dependencies (FDs) and inclusion dependencies (INDs), can take many hours or even days. Inspired by the Pareto principle, which says that you can usually get 80% of your tasks done in 20% of your time, this master thesis is about developing a progressive solution for the discovery of FDs/INDs. More specifically, the algorithm should discover as many dependencies as possible in a given amount of time. Such a progressive algorithm would help data scientists to explore their data before expensive exhaustive profiling is conducted. It should also be able to deliver some results for such datasets that cannot be fully profiled due to time or budget restrictions.

For more information please contact Prof. Felix Naumann or Thorsten Papenbrock.

Relationship Extraction for Lazy Data Scientists

Machine learning and deep learning are increasingly seen as silver bullets for many difficult knowledge mining tasks. Stanford’s InfoLab has proposed and provides the novel Snorkel platform (http://hazyresearch.github.io/snorkel/), which alleviates many of the difficult preparatory tasks involved in applying learning methods to real-world problems. In particular, it significantly reduces the painful work of creating training data by reducing the task to the creation of a few simple tagging functions in lieu of manually tagging actual data.

We propose to apply this technique to the notoriously difficult problems of named entity recognition (NER) and relationship extraction (RE). One particular goal is to find out with how few and how simple tagging functions one can get away with.

For more information please contact Prof. Felix Naumann or Michael Loster.

Automatische Textsegmentierung für Geschäftsberichte

Geschäftsberichte sind eine wichtige Informationsquelle für die Risikoabschätzung von Banken. Nach wie vor verbringen viele Mitarbeiter von Banken, Versicherungen und Dienstleistern ihre wertvolle Zeit damit, diese Berichte zu lesen. Ein digitales Verständnis der Dokumente kann dabei helfen, dem Nutzer direkt die für ihn relevanten Passagen zu präsentieren. Alle Geschäftsberichte müssen folgende Bestandteile enthalten: Bilanz (Aktiva und Passiva), Gewinn- und Verlustrechnung, Eigenkapitalveränderungsrechnung, Kapitalflussrechnung, Anhang (Notes) und bei Kapitalmarktorientierung die Segmentberichterstattung.

Die Herausforderung liegt in der Heterogenität der Daten. Diese können sowohl als PDF, als auch in Papierform, bzw. einem hieraus erstellten digitalen Format vorliegen. Bei letzterer Variante spielen OCR-Fehler eine nicht zu vernachlässigende Rolle. Auch können die Dokumente in verschiedenen Sprachen vorliegen. Das Hauptaugenmerk liegt jedoch auf deutschen Dokumenten.

Das Ziel der Masterarbeit ist es, Geschäftsberichte automatisch zu segmentieren und ein Inhaltsverzeichnis zu extrahieren:

  1. Segmentierung: Auffinden der festen Bestandteile: Lagebericht, Bilanz, GuV, Eigenkapitalveränderungsrechnung, Kapitalflussrechnung und Anhang (Notes).
  2. Strukturierung: Auf Basis von Schriftgrößen, Fettdruck, Position und anderen Metainformationen soll für das Dokument ein Inhaltsverzeichnis erstellt werden.

Die Wahl der Techniken ist frei und kann zum Beispiel Screen-Scraping zum Erkennen des Seitenlayouts, Similarity Measures um OCR-Fehler zu erkennen, oder Machine Learning zum Erlernen wiederholter Teilstrukturen einschließen. Beispieldaten können sowohl in PDF Form, als auch in vorstrukturierten XML Dateien bereitgestellt werden.

Die Masterarbeit wird in Kooperation mit PPA angeboten: Seit dem Jahr 2000 hat sich die PPA als Dienstleister in der Erfassung von Bilanzdaten fest etabliert. Zu den Kunden zählen die größten und erfolgreichsten Finanzdienstleister aus Deutschland und der Schweiz. Diese profitieren von zuverlässig erfassten Daten für ein Finanzrating nach aktuellen regulatorischen Anforderungen. Über 100 qualifizierte PPA-Mitarbeiter erfassen dafür an den Standorten in Darmstadt und Zürich pro Jahr rund 150.000 Abschlüsse. Die PPA ist inhabergeführt, unabhängig und arbeitet nach den höchsten Quali-tätsmaßstäben.

Bei Interesse wenden Sie sich bitte an felix.naumann@hpi.de

Wikipedia​ ​Table​ Layout​ ​Detection​ and Standardization

Web​ ​tables​ are​ ​an​ ​extensive​ ​source​​ of​ information.​ ​For​ instance​​ on​ ​Wikipedia,​ ​tables​ ​serve as​ ​a​ ​concise​ ​way​ of​​ presenting​​ data​ ​on​ ​various​ subjects.​​ However,​ ​those​​ tables​​ are designed​ ​to​ ​be​ ​human-readable,​ ​hence,​ ​they​ ​pose​ ​a​ number​ of​ challenges​ ​to​ ​algorithmic interpretation.
In​ our​​ current​​ project,​ we​​ want​ to​ explore​ ​changes​ in​ ​web​ ​tables.​ In​​ order​​ to​​ distinguish between​​ data​ ​changes​ ​and​​ layout​​ changes​ ​in​ those​ tables,​​ it​ ​is​​ essential​ to​ ​classify​ the​​ table layouts​ and​​ transform​ the​ data​ into​ a ​​standardized​ ​format.​ The​ ​aim​​ of​​ this​ thesis​​ is​​ to​​ classify table​ layout​ types​ according​ to​ a well-defined​ taxonomy, segment​ tables​ if​ necessary​ and transform​ the​ table’s​ content​ into​ a ​standardized​ format​ to​ a relational​​ database. In​ contrast to​ related​ work, we​ do​ not​ want​ to​ consider​ singular​ snapshots​ of​ tables​ but​ the​whole tables’​​ history​ shall​ serve​ as​ an​ additional​ input.


  • Establish​ ​taxonomy​ of​ table​ layouts​ (with​ the​ help​ of​ related​ work)
  • Create​ / ​find​ gold​ standard​ (ground​​ truth)
  • Design,​ implement​ and​ evaluate​ a web​ table​ classification

    • With​ the​ help​ of​ a​ table’s​ history

  • Transform​ web​ tables​ to​ a​ standardized​ relational​ format

For more information please contact Prof. Felix Naumann, Tobias Bleifuß or Leon Bornemann.

Web Science Group

Business Communication Analysis

Today's business communication is almost unimaginable without emails. They document discussions and decisions or summarise face-to-face meetings in the form of unstructured text or attachments and thus hold a significant amount of information about a business. In very exceptional cases, for example when investigating a known case of fraud, specialists examine inboxes and attached files of involved personnel to determine the extent of the situation. However, the sheer quantity of data is unmanageable without some guidance by an exploration tool. In this project, we develop and evaluate methods to combine in a novel exploration tool. This work touches the fields of text mining, text summarisation, document classification, topic modelling, named entity extraction, entity linking, relationship extraction, as well as social network-, and graph analysis. We work together with our industry partner from the financial sector to put our prototypes in the hands of auditors for real world feedback.

For more detailed ideas see this page or contact Tim Repke.

Natural Language Processing for Patent Retrieval

Granted patents form an extensive knowledge base for information retrieval, which is an interesting research field for academia and industry. Especially domain-specific terminology is challenging for state-of-the-art approaches. Therefore, this master’s thesis focuses on document representations that are able to capture a patent’s topics. These representations are the basis for a patent retrieval algorithm. 

In this master thesis, you will jointly mine the topical aspect, but also the spatial aspect of a dataset of 5 million patents, in order to improve current retrieval models. For example, the inventor’s address can be geocoded to the actual geolocation, so that regional patterns can be found. Besides regional patterns, you will analyse patent topics with regard to changes over time. Therefore, you will deal with topic modeling, document embedding, and geocoding.

For more information please contact Julian Risch.

Combining Hierarchical and Labeled Topic Modeling for Patent Classification

Patent documents are traditionally classified using a complex hierarchy of categories. The goal of this Master's thesis is to automatically assign these category labels for new patent applications. This should be achieved by developing a model that takes the labels of the training data into account as well as the hierarchical structure of the categories. There already exist topic models dealing with labels (Labeled-LDA) and topic models dealing with hierarchical topic structure (HLDA). We want to combine both aspects.

This is part of the Topic modeling research project. For more information please contact Dr. Ralf Krestel.

Analyzing Temporal Dependencies between Industry and Science Publications

Patent documents and scientific articles often talk about the same problems but using different wording. By jointly analyzing both genres we want to investigate the dynamics in different areas: Which domains are driven by research and which by application? Are there first patents for a new invention or research papers? In this Master's thesis we will investigate various techniques such as topic models and word embeddings to identify similar topics in the two genres. And as a second step analyze the temporal behavior of these topics to be able to predict future trends in science and/or industry.

This is part of the Cross-collection analysis project. For more information please contact Dr. Ralf Krestel.

Automatische Klassifizierung von Twitterkonversationen

Soziale Medien enthalten große Mengen von textuellen Daten, die allerdings in unstrukturierter Form vorliegen. So finden sich auf einer Plattform wie Twitter politische Diskussionen, Produktbewertungen und Empfehlungen, Nachrichten, Argumentationen zu sozialen Entwicklungen, usw., aber auch belanglose Alltagsgespräche, Spam, Werbung und triviale bot-generierte Informationen (z.B. Radioprogramm, Pegelstände). Während die erste Menge der Tweets und Konversationen eine potentiell interessante Datenbasis bietet, die zur Zeit mehr und mehr zur Grundlage diverser computerlinguistischer, aber auch soziologischer, politischer und angewandter Studien wird, stellt die zweite Menge eher einen Störfaktor dar, der (jedenfalls für die meisten Zwecke) gezielt ausgesondert werden muss. Es ist oft nicht von vornherein klar, zu welchen Anteilen welche Art von Kommunikation und Interaktion auf einer Plattform stattfindet, und um welcher Art von Nachricht es sich bei einem individuellen Tweet handelt. Für viele Aufgaben ist darüber hinaus bedeutsam, wenn das Tweet in eine „Konversation“ eingebettet ist, wie sie durch die reply-to zwischen Tweets (und damit zwischen Usern) entsteht und die eine durchaus komplexe Baumstruktur ergeben kann.

Weitere Details
Kontakt: Tatjana Scheffler, Ph.D., tatjana.scheffler(at)uni-potsdam.de (Computerlinguistik)
Betreuer am HPI: Felix Naumann