For current projects in the area of data quality and data cleansing, please see our work on data preparation and data profiling.

Completed projects

In the past, we have built various large and small data integration systems. They are no longer maintained and many if not most of them are not longer actively running. Please contact Felix Naumann to learn more.

COLT: A few-shot knowledge validation approach using rules
CurEx: A system for extracting, curating, and exploring domain-specific knowledge graphs
DuDe: A duplicate detection framework and suite of algorithms and datasets
Annealing Standard: A system too gradually build a gold standard for classification problems
Aladin: A system to perform almost automatic integration of datasets.
BibTex Deduplication: An online service to deduplicate bibliography files.
DAQS: Data Quality as a Service
Data Fusion: Technolgies for combining duplicates into single consistent records
Detecting Duplicates in XML: A domain-independent algorithm that effectively identifies duplicates in XML documents
Dirty XML Generator: Create XML data with duplicates for evaluation purposes
DogmatiX: A generalized framework for duplicate detection
GovWILD: An integrated set of government data to explore nepotism in politics and economy.
HiQIQ: High quality information querying
MAC / Hummer: A system to integrate heterogeneous datasets, including schema matching, deduplication and data fusion steps.
METL: Systematic management of sets of complex ETL processes
MyDBLP: Systematically annotate bibliographic data
Service Integration with Posr/Depot/Faster: Search, maintain and compose data services
Similarity Search: Methods to efficiently find similar records in large databases
SNNDedupe: An neural approach for entity resolution leveraging siamese neural networks and knowledge transfer
System P: A peer data management system (PDMS) for data integration
Viqtor: Bulk data quality annotations
XClean: Cleaning and deduplicating XML data
XML Duplicate Detection Benchmark: A mechanism of evaluating XML duplicate detection algorithms with the help of several metrics
XML Duplicate Detection Using Sorted Neigborhoods: A extension of the sorted neighborhood method to nested XML elements
XQueryGen: An interactive tool to create complex XQueries

Completed projects

Chair

News

06.10.2024 | Paper accepted at EDBT 2025

06.09.2024 | Congratulations Dr. Phillip Wenig

06.09.2024 | Congratulations Dr. Mazhar Hameed!

16.07.2024 | Congratulations Dr. Leon Bornemann-Paulus!

23.05.2024 | Paper accepted at NLDB 2024

Project highlights

People and open positions