Our group includes PostDocs, PhD students, and student assistants, and is headed by Prof. Felix Naumann. If you are interested in joining our team, please contact Felix Naumann.

For bachelor students we offer German lectures on database systems in addition to paper- or project-oriented seminars. Within a one-year bachelor project, students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, and information retrieval enhanced by specialized seminars, master projects and we advise master theses.

Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our datasets and source code.

Please do not hesitate to reach out directly to us, if you cannot find a paper, slides, or other research artifacts.

In this web page you can find some of my ideas for Master theses. If you have something else in mind, which is interesting, I am open to suggestions.

Duplicate Detection on GPUs

Experimental study of similarity measures on CPUs and GPUs

Duplicate Detection is a crucial part of data cleansing, as duplicate entries cause a number of issues in data analytics and business operations. The pipeline above, is a typical process flow used to tackle this issue. The 2nd and 3rd steps require record pair comparisons, which use similarity measures, such as Levenshtein and Jaro-Winkler. In this thesis we will implement or imitate such measures, in the GPU environment, and systematically evaluate the advantages of migrating from CPU to the graphical equivalent.

A record could be represented as a vector of string or numerical values, as you can see in the examples of the tables below. Numerical values are more suitable for GPUs, since GPU vector comparisons are very fast, and orders of magnitude faster than in CPU. Therefore we want to examine the benefits of using such vectors, with manually crafted features, in comparison with the similarity measures.

Natural Language Processing for Patent Retrieval

In collaboration with Julian Risch

You can find the thesis specification in Julian's master theses web page.

Chair

Prof. Dr. Felix Naumann

Information Systems

E-Mail: felix.naumann(at)hpi.de

Assistant: Diana Stephan

Office: Campus II, House F, F-2.01
Tel.: +49 (0)331 5509-280
Fax: +49 (0)331 5509-287
E-Mail: office-naumann(at)hpi.de

To visit us, please see these directions.

Project highlights

Metanome: Big Data Profiling

Data Preparation

Janus: Change exploration

KITQAR: AI and Data Quality

Duplicate Detection on GPUs

Experimental study of similarity measures on CPUs and GPUs

Natural Language Processing for Patent Retrieval

In collaboration with Julian Risch

Chair

News

06.10.2024 | Paper accepted at EDBT 2025

06.09.2024 | Congratulations Dr. Phillip Wenig

06.09.2024 | Congratulations Dr. Mazhar Hameed!

16.07.2024 | Congratulations Dr. Leon Bornemann-Paulus!

23.05.2024 | Paper accepted at NLDB 2024

Project highlights

People and open positions