Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

DBStrange: Exploring the Multiverse of Entity Resolution Datasets

Lukas Laskowski, Dr. Fabian Panse, Prof. Dr. Felix Naumann

Benchmarking Entity Resolution

Entity resolution (ER) is an essential step in data cleaning pipelines. It aims to detect and consolidate multiple records that refer to the same real-world entity. This topic has been studied in the literature for more than 50 years, but many challenges remain. One of them is the lack of consolidated benchmarks for evaluating and comparing ER approaches. This lack is exacerbated by the fact that state-of-the-art ER approaches are based on supervised learning methods, which are particularly data-hungry. More specifically: (i) There is no centralized repository of ER benchmarks. Rather, they are fragmented across multiple websites (e.g., [1, 2, 3, 4, 5], to list a few). (ii) Different versions of the same benchmark exist, so comparing multiple approaches requires selecting the correct dataset versions and understand how these versions were created by transforming the original data.

 

 

Project Goal

We will follow an entire research cycle from problem inception and literature research to algorithm development and, finally, to evaluation and deployment of a publicly available ERdataset repository. Together, we will prepare a research article and submit it to an international conference. Our goal is to bring clarity to the embarrassingly diffuse landscape of ER-Benchmark datasets. We lay a special focus on the analysis of different dataset versions. We want to develop algorithms that can (i) cluster ER-datasets to identify different versions of the same data and (ii) characterize the differences of multiple dataset versions using frameworkssuch as Explain-Da-V [5]. We start the project with a literature search phase. Afterwards, we will collect ERdataset versions typically used to benchmark ER-algorithms. Here we are not starting from scratch, but already have a few dataset versions as a starting point. Then, we will design and develop our algorithms for clustering and characterizing those dataset versions. Thereafter, we will conduct a sensitivity analysis and check whether using different versions of the same dataset has an impact on the reported quality of several state-of-the-art ER-approaches, such as Ditto [6]. Finally, we will build our repository and develop a dashboard to help users navigate within this repository. In summary, our approach consists of the following tasks:

  • Collect available ER benchmark datasets
  • Provide datasets and ground truths in a unified format
  • Store meta-information about the datasets
  • Cluster datasets to identify different versions of the same data
  • Characterize the differences of multiple dataset versions
  • Measure impact of different dataset versions (sensitivity analysis)
  • Create a dashboard that allows filtering and querying the dataset repository
  • Rights management (are we allowed to share them?)