The bib-item deduplication service is powered by the DuDe toolkit, which supports duplicate detection on various types of import data. The DuDe toolkit offers various algorithms from literature, different output formats and several utility classes, e.g., to gather statistics, generate transitive closures, etc. The toolkit is developed and maintained by the Information Systems Group at the Hasso-Plattner Institute, Potsdam. Please visit the project's homepage for further details.
The following Java code listing shows how easy it is to use DuDe to detect duplicates. This code snippet is also used in the backend of this service in a slightly different form.
// ... // initializes data source BibtexSource source = new BibtexSource("bibtex", bibtexFile); // "bibtex" is the source id and bibtexFile represents the uploaded bibtex file source.addIdAttributes(BibtexSource.KEY_ATTRIBUTE); // specifies the id attribute // initializes the algorithm and enables in-memory processing SortingKey sortingKey = new SortingKey(new TextBasedSubkey("title")); // the sorting key defines the sorting order within SNM int windowSize = 30; // the window size defines the search range SortedNeighborhoodMethod algorithm = new SortedNeighborhoodMethod(sortingKey, windowSize); algorithm.enableInMemoryProcessing(); algorithm.addDataSource(source); // instantiates the used similarity function SimilarityFunction similarityFunction = new BibtexSimilarityFunction(); // all duplicates are collected List<DuDeObjectPair> result = new ArrayList<DuDeObjectPair>(); // duplicate detection using a threshold of 0.9 for (DuDeObjectPair pair : algorithm) { if (similarityFunction.getSimilarity(pair) > 0.9) { result.add(pair); } } // clean up algorithm.cleanUp(); // ...
The similarity between two entries in a bib-file is calculated using a weighted average of similarity functions that are based on BibTeX attributes. Two attributes are equal, if their similarity is 1.0. Each of the similarity functions is described in the following table:
Attribute | Similarity Function | Weight | |
author | Jaccard Similarity (including recognition of abbreviations) | 2 | |
pages1, 2 | Levenshtein Distance 0 or 1: | 1 - Normalized Levenshtein Distance | 2 |
otherwise: | 0.0 | ||
title | 1 - Normalized Levenshtein Distance | 2 | |
type | types equal: | 1.0 | 2 |
types are article and inproceedings: | 0.5 | ||
otherwise: | 0.0 | ||
year1 | years equal: | 1.0 | 1 |
difference of 1: | 0.8 | ||
otherwise: | 0.0 |