Explainable Data Matching
Description
Data matching is the process of detecting (and subsequently cleaning) multiple representations of the same real-world object within a given dataset. Typical approaches create a candidate set of record pairs, determine their similarity, and then compare it to some threshold. Such data matching systems and their components can be quite complex, and understanding their results is difficult. Building upon the data matching benchmark platform Frost and its implementation Snowman (pdf, github), we plan to develop methods to better explain data matching results to developers and domain experts.
These explanations could be in the form of carefully selected record pairs, a visualization of value similarities, an analysis of dependencies between certain values and misclassification of their records, etc. We will design, implement and test such novel methods, ideally resulting in a submission to a scientific conference.
Time Table
We meet Tuesdays at 17:00 in F.2-10. The first meeting is open to all. I expect a binding registration to me via email by April 29, after which I will notify the participants. In case of more participants than slots, I will randomly select students.
Date | Topic |
| 25.04.2022 | Introduction to data matching and topic selection |
| 02.05.2022 | Kickoff, introductions and scheduling |
| 10.05.2022 (online) | First insights into research avenues |
| 17.05.2022 | Brief presentations of related work |
| 24.05.2022 | Presentations of solution ideas |
| 31.05.2022 | Guest talk: Andrea Baraldi (U Modena) on Landmark Explanations |
| 07.06.2022 | Report on team deep-dives |
| 14.06.2022 | Student-internal meeting (discuss evaluation methods) |
| 21.06.2022 | Intermediate presentations (15min each) |
| 28.06.2022 | Status updates |
| 05.07.2022 | Status updates |
| 12.07.2022 | |
| 19.07.2022 | |
| 26.07.2022 | Final Presentations |
Final report submission deadline: August 26, 2022
Literature
- Felix Naumann, Melanie Herschel: An Introduction to Duplicate Detection. Synthesis Lectures on Data Management, Morgan & Claypool Publishers 2010
- Peter Christen: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications, Springer 2012, ISBN 978-3-642-31163-5, pp. I-XIX, 1-270