Prof. Dr. Felix Naumann

Open theses

The information systems group is always looking for good master students to write master's theses. The theses can be in one of the following broad areas:

  • Duplicate Detection
  • Data Profiling
  • Linked Data
  • Text Mining
  • Recommender Systems
  • Information Retrieval
  • Natural Language Processing

Please note that the list below is only a small sample of possible thesis topics and ideas. Please contact us to discuss further, to find new topics, or to suggest a topic of your own.

For more information about writing a master's theses in our group, please see here.

Information Systems

From UCCs to keys

The efficient discovery of unique column combinations (UCCs) is a well-known and much researched problem. Each UCC is a possible key for the relation at hand. The task of this thesis is to extract from the often very large set of UCCs those that in fact represent a key, i.e., that a database administrator would choose. This task is of utmost relevance to real-world profiling tools, as it make profiling results actionable. One approach is the use of heuristics (size of UCC, substrings of columns names, etc.), another might be to choose a set of features and train a machine learner using relations with known keys. 

For more information please contact Prof. Felix Naumann or Thorsten Papenbrock.

Web Science

Combining Hierarchical and Labeled Topic Modeling for Patent Classification

Patent documents are traditionally classified using a complex hierarchy of categories. The goal of this Master's thesis is to automatically assign these category labels for new patent applications. This should be achieved by developing a model that takes the labels of the training data into account as well as the hierarchical structure of the categories. There already exist topic models dealing with labels (Labeled-LDA) and topic models dealing with hierarchical topic structure (HLDA). We want to combine both aspects.

This is part of the Topic modeling research project. For more information please contact Dr. Ralf Krestel.

Analyzing Temporal Dependencies between Industry and Science Publications

Patent documents and scientific articles often talk about the same problems but using different wording. By jointly analyzing both genres we want to investigate the dynamics in different areas: Which domains are driven by research and which by application? Are there first patents for a new invention or research papers? In this Master's thesis we will investigate various techniques such as topic models and word embeddings to identify similar topics in the two genres. And as a second step analyze the temporal behavior of these topics to be able to predict future trends in science and/or industry.

This is part of the Cross-collection analysis project. For more information please contact Dr. Ralf Krestel.

Classifying Business E-mails

Large companies often loose track of the topics discussed in e-mails with customers or between employees. This makes it hard to identify experts for a topic in the company. Further it is difficult in retrospective to identify persons responsible or trace decision processes within larger projects. In this Master's thesis we will use the Enron e-mail dataset to develop a classifier for business communication. The goal is to use state-of-the-art text mining methods to group e-mails into (predefined?) categories.

This is part of the Analyzing business communication project. For more information please contact Dr. Ralf Krestel.


Automatische Klassifizierung von Twitterkonversationen

Soziale Medien enthalten große Mengen von textuellen Daten, die allerdings in unstrukturierter Form vorliegen. So finden sich auf einer Plattform wie Twitter politische Diskussionen, Produktbewertungen und Empfehlungen, Nachrichten, Argumentationen zu sozialen Entwicklungen, usw., aber auch belanglose Alltagsgespräche, Spam, Werbung und triviale bot-generierte Informationen (z.B. Radioprogramm, Pegelstände). Während die erste Menge der Tweets und Konversationen eine potentiell interessante Datenbasis bietet, die zur Zeit mehr und mehr zur Grundlage diverser computerlinguistischer, aber auch soziologischer, politischer und angewandter Studien wird, stellt die zweite Menge eher einen Störfaktor dar, der (jedenfalls für die meisten Zwecke) gezielt ausgesondert werden muss. Es ist oft nicht von vornherein klar, zu welchen Anteilen welche Art von Kommunikation und Interaktion auf einer Plattform stattfindet, und um welcher Art von Nachricht es sich bei einem individuellen Tweet handelt. Für viele Aufgaben ist darüber hinaus bedeutsam, wenn das Tweet in eine „Konversation“ eingebettet ist, wie sie durch die reply-to zwischen Tweets (und damit zwischen Usern) entsteht und die eine durchaus komplexe Baumstruktur ergeben kann.
Weitere Details
Kontakt: Tatjana Scheffler, Ph.D., tatjana.scheffler(at)uni-potsdam.de (Computerlinguistik)
Betreuer am HPI: Felix Naumann