Explainable Data Matching

Description

Data matching is the process of detecting (and subsequently cleaning) multiple representations of the same real-world object within a given dataset. Typical approaches create a candidate set of record pairs, determine their similarity, and then compare it to some threshold. Such data matching systems and their components can be quite complex, and understanding their results is difficult. Building upon the data matching benchmark platform Frost and its implementation Snowman (pdf, github), we plan to develop methods to better explain data matching results to developers and domain experts.

These explanations could be in the form of carefully selected record pairs, a visualization of value similarities, an analysis of dependencies between certain values and misclassification of their records, etc. We will design, implement and test such novel methods, ideally resulting in a submission to a scientific conference.

Time Table

We meet Tuesdays at 17:00 in F.2-10. The first meeting is open to all. I expect a binding registration to me via email by April 29, after which I will notify the participants. In case of more participants than slots, I will randomly select students.

Date	Topic
25.04.2022	Introduction to data matching and topic selection
02.05.2022	Kickoff, introductions and scheduling
10.05.2022 (online)	First insights into research avenues
17.05.2022	Brief presentations of related work
24.05.2022	Presentations of solution ideas
31.05.2022	Guest talk: Andrea Baraldi (U Modena) on Landmark Explanations
07.06.2022	Report on team deep-dives
14.06.2022	Student-internal meeting (discuss evaluation methods)
21.06.2022	Intermediate presentations (15min each)
28.06.2022	Status updates
05.07.2022	Status updates
12.07.2022
19.07.2022
26.07.2022	Final Presentations

Final report submission deadline: August 26, 2022

Literature

Felix Naumann, Melanie Herschel: An Introduction to Duplicate Detection. Synthesis Lectures on Data Management, Morgan & Claypool Publishers 2010
Peter Christen: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications, Springer 2012, ISBN 978-3-642-31163-5, pp. I-XIX, 1-270