For bachelor students we offer German lectures on database systems in addition with paper- or project-oriented seminars. Within a one-year bachelor project students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, search engines and information retrieval enhanced by specialized seminars, master projects and advised master theses.
Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our data sets and source code.
MDedup - Duplicate Detection with Matching Dependencies
Training data and evaluation results
Ioannis Koumarelas, Thorsten Papenbrock, Felix Naumann
Duplicate detection is an integral part of data cleaning and serves to identify multiple representations of same real-world entities in (relational) datasets. Existing duplicate detection approaches are very effective, but they are also very hard to parameterize or require a lot of pre-labeled training data. Both parameterization and pre-labeling are at least domain-specific if not dataset-specific. For many duplicate detection algorithms that are based on machine learning it is also difficult to explain why certain duplicates have been discovered and others not.
For these reasons, we propose a novel, rule-based and fully automatic duplicate detection approach that is based on matching dependencies (MDs). Our system uses automatically discovered MDs, various dataset features, and known gold standards to train a machine learning model to select MDs as duplicate detection rules. Once trained, the model can select useful MDs for duplicate detection on any new dataset. To increase the generally low recall of MD-baseddata cleaning approaches, we propose an additional boosting technique. Our experiments show that this approach reaches up to 80% F-measure and 99% on our evaluation datasets, which are very good numbers considering that the system is configuration free.
Hotels: provided by our industry partner, Concur: www.concur.com. Unfortunately, due to privacy issues it cannot be disclosed. However, we do provide the discovered matching dependencies (MDs) and matching dependency combinations (MDCs).
The following matching dependency combinations (MDCs) are produced according to the training pipeline, as discussed in the paper. They are used to support the discovery of MDCs in a new dataset. (Consider MDC Prediction in the project mdedup)
Training pipeline: "Selection", "Expansion", "Exploration" (where features are generated for MDCs of Selection and Exploration). The MDCs of the exploration phase are used to train the regression model.
Application pipeline: We also provide the MDCs of the application pipeline, marked with "Prediction", as were scored in our experimental evaluation. The MDC with the highest "prediction_score" is selected for the first-cut duplicate detection, for which precision, recall, and F-measure scores are also provided.
MDC Selection: This corresponds to the MDC with the highest F-measure (F1) for phase = "selection".
MDC Prediction: This corresponds to the MDC with the highest prediction_score for phase = "prediction".
Boosting: Both cases of MDC Selection and MDC Prediction, with boosting applied, using Support Vector Machines (SVMs).