Duplicates in datasets are multiple, different representations of same real-world object. Their detection is usually complex. Huge datasets and the online nature of current modern systems even demand for more processing power with more nodes participating to handle the incoming data. In this seminar, we want to explore existing techniques for distributed duplicate detection, re-implement, extend and evaluate them. Popular frameworks like Apache Spark or Apache Flink, will be selected to be used by the teams to ease the implementation.
A naive approach could be to replicate all m records to n nodes, and then split the pairs that should be compared uniformly to all the nodes, so that every node has m/n pairs to compare.
The students who participate, will be provided with relevant literature or they can propose their own, for algorithms that handle this problem, with different approaches. We are looking for maximum 12 students that will form groups, where each group will be assigned one of the latter papers, with the task of implementing and extending it. The evaluation metrics include how efficient(=fast) the algorithms are, and how effective against the metrics of precision, recall and f-measure. The final grade does not depend on the rank of the teams but on the ideas, the implementation, the evaluation, and the presentation.