Explainable Data Matching

Description

Data matching is the process of detecting (and subsequently cleaning) multiple representations of the same real-world object within a given dataset. Typical approaches create a candidate set of record pairs, determine their similarity, and then compare it to some threshold. Such data matching systems and their components can be quite complex, and understanding their results is difficult. Building upon the data matching benchmark platform Frost and its implementation Snowman (pdf, github), we plan to develop methods to better explain data matching results to developers and domain experts.

These explanations could be in the form of carefully selected record pairs, a visualization of value similarities, an analysis of dependencies between certain values and misclassification of their records, etc. We will design, implement and test such novel methods, ideally resulting in a submission to a scientific conference.

Time Table

We meet Tuesdays at 17:00 in F.2-10. The first meeting is open to all. I expect a binding registration to me via email by April 29, after which I will notify the participants. In case of more participants than slots, I will randomly select students.

Date

Topic

25.04.2022 Introduction to data matching and topic selection
02.05.2022 Kickoff, introductions and scheduling
10.05.2022 (online) First insights into research avenues
17.05.2022 Brief presentations of related work 
24.05.2022 Presentations of solution ideas
31.05.2022 Guest talk: Andrea Baraldi (U Modena) on Landmark Explanations
07.06.2022 Report on team deep-dives
14.06.2022 Student-internal meeting (discuss evaluation methods)
21.06.2022 Intermediate presentations (15min each)
28.06.2022 Status updates
05.07.2022 Status updates
12.07.2022  
19.07.2022  
26.07.2022 Final Presentations

Final report submission deadline: August 26, 2022

Literature