Data matching is the process of detecting (and subsequently cleaning) multiple representations of the same real-world object within a given dataset. Typical approaches create a candidate set of record pairs, determine their similarity, and then compare it to some threshold. Such data matching systems and their components can be quite complex, and understanding their results is difficult. Building upon the data matching benchmark platform Frost and its implementation Snowman (pdf, github), we plan to develop methods to better explain data matching results to developers and domain experts.
These explanations could be in the form of carefully selected record pairs, a visualization of value similarities, an analysis of dependencies between certain values and misclassification of their records, etc. We will design, implement and test such novel methods, ideally resulting in a submission to a scientific conference.