Prof. Dr. Felix Naumann

Open theses

The information systems group is always looking for good master students to write master's theses. The theses can be in one of the following broad areas:

  • Duplicate Detection
  • Data Profiling
  • Linked Data
  • Text Mining
  • Recommender Systems
  • Information Retrieval
  • Natural Language Processing

Please note that the list below is only a small sample of possible thesis topics and ideas. Please contact us to discuss further, to find new topics, or to suggest a topic of your own.

For more information about writing a master's theses in our group, please see here.

Database Group

Wikipedia​ ​Table​ Layout​ ​Detection​ and Standardization

Web​ ​tables​ are​ ​an​ ​extensive​ ​source​​ of​ information.​ ​For​ instance​​ on​ ​Wikipedia,​ ​tables​ ​serve as​ ​a​ ​concise​ ​way​ of​​ presenting​​ data​ ​on​ ​various​ subjects.​​ However,​ ​those​​ tables​​ are designed​ ​to​ ​be​ ​human-readable,​ ​hence,​ ​they​ ​pose​ ​a​ number​ of​ challenges​ ​to​ ​algorithmic interpretation.
In​ our​​ current​​ project,​ we​​ want​ to​ explore​ ​changes​ in​ ​web​ ​tables.​ In​​ order​​ to​​ distinguish between​​ data​ ​changes​ ​and​​ layout​​ changes​ ​in​ those​ tables,​​ it​ ​is​​ essential​ to​ ​classify​ the​​ table layouts​ and​​ transform​ the​ data​ into​ a ​​standardized​ ​format.​ The​ ​aim​​ of​​ this​ thesis​​ is​​ to​​ classify table​ layout​ types​ according​ to​ a well-defined​ taxonomy, segment​ tables​ if​ necessary​ and transform​ the​ table’s​ content​ into​ a ​standardized​ format​ to​ a relational​​ database. In​ contrast to​ related​ work, we​ do​ not​ want​ to​ consider​ singular​ snapshots​ of​ tables​ but​ the​whole tables’​​ history​ shall​ serve​ as​ an​ additional​ input.


  • Establish​ ​taxonomy​ of​ table​ layouts​ (with​ the​ help​ of​ related​ work)
  • Create​ / ​find​ gold​ standard​ (ground​​ truth)
  • Design,​ implement​ and​ evaluate​ a web​ table​ classification

    • With​ the​ help​ of​ a​ table’s​ history

  • Transform​ web​ tables​ to​ a​ standardized​ relational​ format

For more information please contact Prof. Felix Naumann, Tobias Bleifuß or Leon Bornemann.

Discovering Temporal Inclusion Dependencies in Changing Webtables

An inclusion dependency between two attribute sets X1,...,Xn and Y1,...,Yn of two relations R and S is defined in the following: For each tuple of values appearing in columns X1,...,Xn of Relation R, there must be an equivalent tuple in columns Y1,...,Yn in relation S. In short this is denoted as R[X1,...,Xn] ⊆ S[Y1,...,Yn].

Inclusion dependencies are an important topic in the research area of data profiling, which aims to extract new metadata for tables. In particular, inclusion dependencies are a prerequisite for foreign keys and thus are a starting point to detect meaningful relationships between tables. Existing work has suggested algorithms for the discovery of inclusion dependencies in databases [1] and webtables [2].

While the discovery of inclusion dependencies has been researched for static databases or webtables, the discovery of inclusion dependencies in historical data remains an open problem. In particular, it should be discussed how the definition of inclusion dependencies for database snapshots (as explained in the first paragraph) should be generalized to table histories and whether the development of the database over time can be used to reduce the search space.

Tasks in this thesis include:

  • Define temporal inclusion dependencies as candidates for foreign keys between two dynamic tables.

  • Design of a scalable, distributed algorithm for the discovery of inclusion dependencies. Implementation of the Algorithm in Apache Spark

  • Creation of a Gold-Standard on webtables

  • Evaluation on real-world datasets, such as all relational tables in the History of Wikipedia (the dataset is already available)

For more information please contact Prof. Felix Naumann, Tobias Bleifuß or Leon Bornemann.

[1] Papenbrock, Thorsten, et al. "Divide & conquer-based inclusion dependency discovery." Proceedings of the VLDB Endowment 8.7 (2015): 774-785.

[2] Tschirschnitz, Fabian, Thorsten Papenbrock, and Felix Naumann. "Detecting inclusion dependencies on very many tables." ACM Transactions on Database Systems (TODS) 42.3 (2017): 18.

Further Related Work on adapting Dependencies to Temporal Databases:

[3] Jensen, Christian S., Richard T. Snodgrass, and Michael D. Soo. "Extending existing dependency theory to temporal databases." IEEE Transactions on Knowledge and Data Engineering 8.4 (1996): 563-582.

[4] Artale, Alessandro, and Enrico Franconi. "Temporal ER modeling with description logics." International Conference on Conceptual Modeling. Springer, Berlin, Heidelberg, 1999.

Design and Implementation of a Data Change Query Language

Data change, all the time. This undeniable fact has motivated the development of database management systems in the first place. While they are good at recording this change, and while much technology has emerged to retrieve and analyze this data, there is not yet much research to explore and understand the nature of such change, i.e., the behavior of a dataset over time. A large research undertaking in our group is dedicated to developing models, methods, and systems to systematically explore and analyze such changes [3].

The goal of this thesis is to develop a query language for changes in data. One of its key requirements is to support complex queries that involve (temporal) patterns. For example, we want to answer queries like:

  • “For what percentage of cities that underwent a change of government between 1990 and 2000, was there a reversal of that change within the next decade?”

  • “Which countries recorded population growth for at least 10 years and then shrank by at least the same amount within 5 years?”.

  • “How often do employees get salary increases compared to the average number of salary increases of all people who have ever been their supervisor? How high are these?”

The task is not only to design the language and demonstrate its usefulness, but also to develop a prototypical implementation. It could draw inspiration from related temporal query languages such as TSQL2 [1] or T-SPARQL [2], but should distinguish from them by focusing on the change and not on temporal data. The focus of the new language should be on state transitions and not on states at certain times in the past, which is why these state changes should be the atomic units of its result sets.

[1] Snodgrass, Richard T., ed. The TSQL2 temporal query language. Vol. 330. Springer Science & Business Media, 2012.

[2] Fabio Grandi. T-SPARQL: A TSQL2-like Temporal Query Language for RDF. In ADBIS (Local Proceedings), pp. 21-30. 2010.

[3] https://hpi.de/naumann/projects/data-profiling-and-analytics/data-change-exploration.html

For more information please contact Prof. Felix Naumann or Tobias Bleifuß.

Relationship Extraction for Lazy Data Scientists

Machine learning and deep learning are increasingly seen as silver bullets for many difficult knowledge mining tasks. Stanford’s InfoLab has proposed and provides the novel Snorkel platform (http://hazyresearch.github.io/snorkel/), which alleviates many of the difficult preparatory tasks involved in applying learning methods to real-world problems. In particular, it significantly reduces the painful work of creating training data by reducing the task to the creation of a few simple tagging functions in lieu of manually tagging actual data.

We propose to apply this technique to the notoriously difficult problems of named entity recognition (NER) and relationship extraction (RE). One particular goal is to find out with how few and how simple tagging functions one can get away with.

For more information please contact Prof. Felix Naumann or Michael Loster.

Web Science Group

Business Communication Analysis

Today's business communication is almost unimaginable without emails. They document discussions and decisions or summarise face-to-face meetings in the form of unstructured text or attachments and thus hold a significant amount of information about a business. In very exceptional cases, for example when investigating a known case of fraud, specialists examine inboxes and attached files of involved personnel to determine the extent of the situation. However, the sheer quantity of data is unmanageable without some guidance by an exploration tool. In this project, we develop and evaluate methods to combine in a novel exploration tool. This work touches the fields of text mining, text summarisation, document classification, topic modelling, named entity extraction, entity linking, relationship extraction, as well as social network-, and graph analysis. We work together with our industry partner from the financial sector to put our prototypes in the hands of auditors for real world feedback.

For more detailed ideas see this page or contact Tim Repke.

Natural Language Processing for Patent Retrieval

Granted patents form an extensive knowledge base for information retrieval, which is an interesting research field for academia and industry. Especially domain-specific terminology is challenging for state-of-the-art approaches. Therefore, this master’s thesis focuses on document representations that are able to capture a patent’s topics. These representations are the basis for a patent retrieval algorithm. 

In this master thesis, you will jointly mine the topical aspect, but also the spatial aspect of a dataset of 5 million patents, in order to improve current retrieval models. For example, the inventor’s address can be geocoded to the actual geolocation, so that regional patterns can be found. Besides regional patterns, you will analyse patent topics with regard to changes over time. Therefore, you will deal with topic modeling, document embedding, and geocoding.

For more information please contact Julian Risch.

Combining Hierarchical and Labeled Topic Modeling for Patent Classification

Patent documents are traditionally classified using a complex hierarchy of categories. The goal of this Master's thesis is to automatically assign these category labels for new patent applications. This should be achieved by developing a model that takes the labels of the training data into account as well as the hierarchical structure of the categories. There already exist topic models dealing with labels (Labeled-LDA) and topic models dealing with hierarchical topic structure (HLDA). We want to combine both aspects.

This is part of the Topic modeling research project. For more information please contact Dr. Ralf Krestel.

Analyzing Temporal Dependencies between Industry and Science Publications

Patent documents and scientific articles often talk about the same problems but using different wording. By jointly analyzing both genres we want to investigate the dynamics in different areas: Which domains are driven by research and which by application? Are there first patents for a new invention or research papers? In this Master's thesis we will investigate various techniques such as topic models and word embeddings to identify similar topics in the two genres. And as a second step analyze the temporal behavior of these topics to be able to predict future trends in science and/or industry.

This is part of the Cross-collection analysis project. For more information please contact Dr. Ralf Krestel.