Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Open theses

The information systems group is always looking for good master students to write master's theses. The theses can be in one of the following broad areas:

  • Duplicate Detection
  • Data Profiling
  • Linked Data
  • Text Mining
  • Recommender Systems
  • Information Retrieval
  • Natural Language Processing

Please note that the list below is only a small sample of possible thesis topics and ideas. Please contact us to discuss further, to find new topics, or to suggest a topic of your own.

For more information about writing a master's theses in our group, please see here.


Database Group

Inferring Regular Expressions for Database Columns

  • A regular expression (regex) is an elegant, standard, and informative description of the content of a database column to support various analytical tasks.
  • Building a high-quality expression is a difficult and tedious process that needs human involvement in most cases.
  • Automatic learning and inference of such expressions is a both relevant and challenging task.

The regular expression is a formal language consisting of meta-symbols to describe a pattern in a set of data values. A large class of data-driven tasks is based on using regular expressions, such as information retrieval, data validation, entity extraction, DNA string matching, etc. Under the umbrella of Data Profiling, a regular expression representation of the columns of a table are very interesting metadata. This metadata has many applications, such as (1) data format transformation and data preparation, (2) instance-based schema matching, (3) data validation and cleansing by enforcing the most frequent patterns in a column, and (4) data integration from different sources.

An optimal regex should be neither too specific nor too generic. For example, ^\d{4}$ and ^(19|20)\d{2}$ are regexes that can match a year column in the 20th or 21st centuries. But, which one is optimal and in which use case?

The main goal of this thesis is to design a scalable, precise and fully-automated algorithm to learn and infer optimal regular expressions of each column in a table. This includes the development of several use cases and corresponding quality measures to determine the quality of the results.

For more information please contact Prof. Felix Naumann or Hazar Harmouch.

Wikipedia​ ​Table​ Layout​ ​Detection​ and Standardization

Web​ ​tables​ are​ ​an​ ​extensive​ ​source​​ of​ information.​ ​For​ instance​​ on​ ​Wikipedia,​ ​tables​ ​serve as​ ​a​ ​concise​ ​way​ of​​ presenting​​ data​ ​on​ ​various​ subjects.​​ However,​ ​those​​ tables​​ are designed​ ​to​ ​be​ ​human-readable,​ ​hence,​ ​they​ ​pose​ ​a​ number​ of​ challenges​ ​to​ ​algorithmic interpretation.
In​ our​​ current​​ project,​ we​​ want​ to​ explore​ ​changes​ in​ ​web​ ​tables.​ In​​ order​​ to​​ distinguish between​​ data​ ​changes​ ​and​​ layout​​ changes​ ​in​ those​ tables,​​ it​ ​is​​ essential​ to​ ​classify​ the​​ table layouts​ and​​ transform​ the​ data​ into​ a ​​standardized​ ​format.​ The​ ​aim​​ of​​ this​ thesis​​ is​​ to​​ classify table​ layout​ types​ according​ to​ a well-defined​ taxonomy, segment​ tables​ if​ necessary​ and transform​ the​ table’s​ content​ into​ a ​standardized​ format​ to​ a relational​​ database. In​ contrast to​ related​ work, we​ do​ not​ want​ to​ consider​ singular​ snapshots​ of​ tables​ but​ the​whole tables’​​ history​ shall​ serve​ as​ an​ additional​ input.

Tasks:

  • Establish​ ​taxonomy​ of​ table​ layouts​ (with​ the​ help​ of​ related​ work)
  • Create​ / ​find​ gold​ standard​ (ground​​ truth)
  • Design,​ implement​ and​ evaluate​ a web​ table​ classification

    • With​ the​ help​ of​ a​ table’s​ history

  • Transform​ web​ tables​ to​ a​ standardized​ relational​ format

For more information please contact Prof. Felix Naumann, Tobias Bleifuß or Leon Bornemann.

Discovering Temporal Inclusion Dependencies in Changing Webtables

An inclusion dependency between two attribute sets X1,...,Xn and Y1,...,Yn of two relations R and S is defined in the following: For each tuple of values appearing in columns X1,...,Xn of Relation R, there must be an equivalent tuple in columns Y1,...,Yn in relation S. In short this is denoted as R[X1,...,Xn] ⊆ S[Y1,...,Yn].

Inclusion dependencies are an important topic in the research area of data profiling, which aims to extract new metadata for tables. In particular, inclusion dependencies are a prerequisite for foreign keys and thus are a starting point to detect meaningful relationships between tables. Existing work has suggested algorithms for the discovery of inclusion dependencies in databases [1] and webtables [2].

While the discovery of inclusion dependencies has been researched for static databases or webtables, the discovery of inclusion dependencies in historical data remains an open problem. In particular, it should be discussed how the definition of inclusion dependencies for database snapshots (as explained in the first paragraph) should be generalized to table histories and whether the development of the database over time can be used to reduce the search space.

Tasks in this thesis include:

  • Define temporal inclusion dependencies as candidates for foreign keys between two dynamic tables.

  • Design of a scalable, distributed algorithm for the discovery of inclusion dependencies. Implementation of the Algorithm in Apache Spark

  • Creation of a Gold-Standard on webtables

  • Evaluation on real-world datasets, such as all relational tables in the History of Wikipedia (the dataset is already available)

For more information please contact Prof. Felix Naumann, Tobias Bleifuß or Leon Bornemann.

[1] Papenbrock, Thorsten, et al. "Divide & conquer-based inclusion dependency discovery." Proceedings of the VLDB Endowment 8.7 (2015): 774-785.

[2] Tschirschnitz, Fabian, Thorsten Papenbrock, and Felix Naumann. "Detecting inclusion dependencies on very many tables." ACM Transactions on Database Systems (TODS) 42.3 (2017): 18.

Further Related Work on adapting Dependencies to Temporal Databases:

[3] Jensen, Christian S., Richard T. Snodgrass, and Michael D. Soo. "Extending existing dependency theory to temporal databases." IEEE Transactions on Knowledge and Data Engineering 8.4 (1996): 563-582.

[4] Artale, Alessandro, and Enrico Franconi. "Temporal ER modeling with description logics." International Conference on Conceptual Modeling. Springer, Berlin, Heidelberg, 1999.

Design and Implementation of a Data Change Query Language

Data change, all the time. This undeniable fact has motivated the development of database management systems in the first place. While they are good at recording this change, and while much technology has emerged to retrieve and analyze this data, there is not yet much research to explore and understand the nature of such change, i.e., the behavior of a dataset over time. A large research undertaking in our group is dedicated to developing models, methods, and systems to systematically explore and analyze such changes [3].

The goal of this thesis is to develop a query language for changes in data. One of its key requirements is to support complex queries that involve (temporal) patterns. For example, we want to answer queries like:

  • “For what percentage of cities that underwent a change of government between 1990 and 2000, was there a reversal of that change within the next decade?”

  • “Which countries recorded population growth for at least 10 years and then shrank by at least the same amount within 5 years?”.

  • “How often do employees get salary increases compared to the average number of salary increases of all people who have ever been their supervisor? How high are these?”

The task is not only to design the language and demonstrate its usefulness, but also to develop a prototypical implementation. It could draw inspiration from related temporal query languages such as TSQL2 [1] or T-SPARQL [2], but should distinguish from them by focusing on the change and not on temporal data. The focus of the new language should be on state transitions and not on states at certain times in the past, which is why these state changes should be the atomic units of its result sets.

[1] Snodgrass, Richard T., ed. The TSQL2 temporal query language. Vol. 330. Springer Science & Business Media, 2012.

[2] Fabio Grandi. T-SPARQL: A TSQL2-like Temporal Query Language for RDF. In ADBIS (Local Proceedings), pp. 21-30. 2010.

[3] https://hpi.de/naumann/projects/data-profiling-and-analytics/data-change-exploration.html

For more information please contact Prof. Felix Naumann or Tobias Bleifuß.

Relationship Extraction for Lazy Data Scientists

Machine learning and deep learning are increasingly seen as silver bullets for many difficult knowledge mining tasks. Stanford’s InfoLab has proposed and provides the novel Snorkel platform (http://hazyresearch.github.io/snorkel/), which alleviates many of the difficult preparatory tasks involved in applying learning methods to real-world problems. In particular, it significantly reduces the painful work of creating training data by reducing the task to the creation of a few simple tagging functions in lieu of manually tagging actual data.

We propose to apply this technique to the notoriously difficult problems of named entity recognition (NER) and relationship extraction (RE). One particular goal is to find out with how few and how simple tagging functions one can get away with.

For more information please contact Prof. Felix Naumann or Michael Loster.

Duplicate Detection on GPUs

Duplicate Detection is a crucial part of data cleansing, as duplicate entries cause a number of issues in data analytics and business operations. A typical process flow includes (1) normalizing the data, (2) retrieving candidates, (3) candidate comparison, (4) classification, and (5) evaluation. The 2nd and 3rd steps require record pair comparisons, which use similarity measures, such as Levenshtein and Jaro-Winkler. In this thesis we will implement or imitate such measures, in the GPU environment, and systematically evaluate the advantages of migrating from CPU to the graphical equivalent. 

A record could be represented as a vector of string or numerical values. Numerical values are more suitable for GPUs, since GPU vector comparisons are very fast, and orders of magnitude faster than in CPU. Therefore we want to examine the benefits of using such vectors, with manually crafted features, in comparison with the similarity measures.

For more information please contact Prof. Felix Naumann or Ioannis Koumarelas.


Web Science Group

Informationssysteme der Zukunft - Der Chatbot als Smart Assistant

Treten im Betrieb von komplexen Systemen Probleme auf, so muss ein Spezialist (meistens) eine Ferndiagnose stellen und bestenfalls Lösungen bereithalten. Um festzustellen, ob es sich tatsächlich um Fehler handelt und um diese beheben zu können, ist der Wartungsspezialist auf diverse Informationen angewiesen. Diese stammen teilweise von Kunden oder Mitarbeitern vor Ort und müssen dediziert abgefragt werden. Weiter müssen Informationen aus internen Systemen zusammengetragen werden und gegebenenfalls weitere Experten (Ingenieure, Mechaniker, Designer) kontaktiert werden. Viele dieser Schritte laufen über Emails und manuelles Suchen in Datenbanken, Dokumentationen, Handbüchern, etc.

Ziel dieser Arbeit ist es, einen Chatbot zu entwickeln, der als Anlaufstelle für Fehlermeldungen dient, die Kommunikation zu bestimmten Fehlermeldungen bündelt und in einer ersten Ausbaustufe Informationen aus einer domainspezifischen Datenbank auf Anfrage bereitstellt. Denkbar wäre auch die Bereitstellung von annotierten Visualisierungen, beispielsweise um einzelne Bauteile in einer Konstruktionsskizze hervorzuheben.

Für weitere Informationen kontaktieren Sie Dr. Ralf Krestel.

Business Communication Analysis

Today's business communication is almost unimaginable without emails. They document discussions and decisions or summarise face-to-face meetings in the form of unstructured text or attachments and thus hold a significant amount of information about a business. In very exceptional cases, for example when investigating a known case of fraud, specialists examine inboxes and attached files of involved personnel to determine the extent of the situation. However, the sheer quantity of data is unmanageable without some guidance by an exploration tool. In this project, we develop and evaluate methods to combine in a novel exploration tool. This work touches the fields of text mining, text summarisation, document classification, topic modelling, named entity extraction, entity linking, relationship extraction, as well as social network-, and graph analysis. We work together with our industry partner from the financial sector to put our prototypes in the hands of auditors for real world feedback.

For more detailed ideas see this page or contact Tim Repke.

Comment Analysis

The comment sections of online newspapers are an important space to indulge in political discussions and discuss various opinions. These discussion forums have to be moderated due to the misuse by spammers, haters, trolls, and means of propaganda. This moderation process is very expensive and many online news providers have discontinued their comment sections. With more and more political campaigning, or even agitation being distributed over the internet, serious and safe platforms to discuss political topics are increasingly important.

In this master thesis, you will analyze comments, users, and articles to understand the dynamics, the information flow, and the interactions in the comment sections. You will work on a subtask of comment analysis, such as detecting inappropriate comments, predicting popular news topics, recommending information, and others. Our recent seminar Text Mining in Practice covered a selection of possible tasks.

For more information please contact Julian Risch.

Computational Linguistics

If you are more interested in the linguistic side of NLP and Text Mining you can have a look at open theses topics from our colleagues at the Linguistics Department of University of Potsdam.

Feel free to talk to us about the possibility to co-supervise a thesis in the context of computational linguistics. For more information please contact Dr. Ralf Krestel.