For bachelor students we offer German lectures on database systems in addition with paper- or project-oriented seminars. Within a one-year bachelor project students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, search engines and information retrieval enhanced by specialized seminars, master projects and advised master theses.
Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our data sets and source code.
I am a Ph.D. student at the Infomation Systems Research Group and my research started in collaboration with SAP and SAP Concur. Through my Ph.D. I have worked in the general area of Data Cleaning, Data Preparation, with my main focus on Duplicate Detection.
Hasso-Plattner-Institut für Softwaresystemtechnik Prof.-Dr.-Helmert-Straße 2-3 D-14482 Potsdam Office: F-2.05, Campus II
Duplicate Detection (Record Linkage, Entity Resolution etc.), Data Cleaning, Data Preparation
Parallel and Distributed Systems, Big Data Management
Data Mining, Machine Learning, Deep Learning
Cooperation project with SAP and SAP Concur, for Vendor Data Cleaning of hotels. Our main task has been to apply Duplicate Detection, thus identify duplicates and understand what are their causes. The approaches we followed mainly use data preparation and matching dependencies, for which more information is further available through our publications.
Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed, by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach, prior to performing duplicate detection. Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.
Koumarelas, I., Papenbrock, T., Naumann, F.: MDedup: Duplicate Detection with Matching Dependencies.Proceedings of the VLDB Endowment (PVLDB).13, (2020).
Duplicate detection is an integral part of data cleaning and serves to identify multiple representations of same real-world entities in (relational) datasets. Existing duplicate detection approaches are effective, but they are also hard to parameterize or require a lot of pre-labeled training data. Both parameterization and pre-labeling are at least domain-specific if not dataset-specific, which is a problem if a new dataset needs to be cleaned. For this reason, we propose a novel, rule-based and fully automatic duplicate detection approach that is based on matching dependencies (MDs). Our system uses automatically discovered MDs, various dataset features, and known gold standards to train a model that selects MDs as duplicate detection rules. Once trained, the model can select useful MDs for duplicate detection on any new dataset. To increase the generally low recall of MD-based data cleaning approaches, we propose an additional boosting step. Our experiments show that this approach reaches up to 94% F-measure and 100% precision on our evaluation datasets, which are good numbers considering that the system does not require domain or target data-specific configuration.
Koumarelas, I., Kroschk, A., Mosley, C., Naumann, F.: Experience: Enhancing Address Matching with Geocoding and Similarity Measure Selection.Journal of Data and Information Quality (JDIQ).10,8:1--8:16 (2018).
Given a query record, record matching is the problem of finding database records that represent the same real-world object. In the easiest scenario, a database record is completely identical to the query. However, in most cases, problems do arise, for instance, as a result of data errors or data integrated from multiple sources or received from restrictive form fields. These problems are usually difficult, because they require a variety of actions, including field segmentation, decoding of values, and similarity comparisons, each requiring some domain knowledge. In this article, we study the problem of matching records that contain address information, including attributes such as Street-address and City. To facilitate this matching process, we propose a domain-specific procedure to, first, enrich each record with a more complete representation of the address information through geocoding and reverse-geocoding and, second, to select the best similarity measure per each address attribute that will finally help the classifier to achieve the best f-measure. We report on our experience in selecting geocoding services and discovering similarity measures for a concrete but common industry use-case.
Pietrangelo, A., Simonini, G., Bergamaschi, S., Naumann, F., Koumarelas, I.: Towards Progressive Search-driven Entity Resolution.Italian Symposium on Advanced Database Systems (SEBD) (2018).
Keyword-search systems for databases aim to answer a user query composed of a few terms with a ranked list of records. They are powerful and easy-to-use data exploration tools for a wide range of contexts. For instance, given a product database gathered scraping e-commerce websites, these systems enable even non-technical users to explore the item set (e.g., to check whether it contains certain products or not, or to discover the price of an item). However, if the database contains dirty records (i.e., incomplete and duplicated records), a preprocessing step to clean the data is required. One fundamental data cleaning step is Entity Resolution, i.e., the task of identifying and fusing together all the records that refer to the same real-word entity. This task is typically executed on the whole data, independently of: (i) the portion of the entities that a user may indicate through keywords, and (ii) the order priority that a user might express through an order by clause. This paper describes a first step to solve the problem of progressive search-driven Entity Resolution: resolving all the entities described by a user through a handful of keywords, progressively (according to an order by clause). We discuss the features of our method, named SearchER and showcase some examples of keyword queries on two real-world datasets obtained with a demonstrative prototype that we have built.
Samiei, A., Koumarelas, I., Loster, M., Naumann, F.: Combination of Rule-based and Textual Similarity Approaches to Match Financial Entities.Data Science for Macro-Modeling with Financial and Economic Datasets (DSMM). ACM (2016).
Record linkage is a well studied problem with many years of publication history. Nevertheless, there are many challenges remaining to be addressed, such as the topic addressed by FEIII Challenge 2016. Matching ﬁnancial entities (FEs) is important for many private and governmental organizations. In this paper we describe the problem of matching such FEs across three datasets: FFIEC, LEI and SEC.