The evaluation of duplicate detection systems usually needs pre-classified results. Such gold standards are often expensive to come by (much manual classification is necessary), not representative (too small or too synthetic), and proprietary and thus preclude repetition (company-internal data). This lament has been uttered in many papers and even more paper reviews.
We propose an annealing standard, which is a structured set of duplicate detection results, some of which are manually verified and some of which are merely validated by many classifiers. As more and more classifiers are evaluated against the annealing standard, more and more results are verified and validation becomes more and more confident.
Read the paper published in the Journal of Data and Information Quality (JDIQ) 2014.
We want to evaluate the annealing standard by deduplicating a large real-world data set. The task is to develop a classifier for the data set so that we can use the results to improve the quality of the annealing standard.