Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

For current projects in the area of data quality and data cleansing, please see our work on data preparation and data profiling.

Completed projects

In the past, we have built various large and small data integration systems. They are no longer maintained and many if not most of them are not longer actively running. Please contact Felix Naumann to learn more.

  • COLT: A few-shot knowledge validation approach using rules 
  • CurEx: A system for extracting, curating, and exploring domain-specific knowledge graphs
  • DuDe: A duplicate detection framework and suite of algorithms and datasets
  • Annealing Standard: A system too gradually build a gold standard for classification problems
  • Aladin: A system to perform almost automatic integration of datasets.
  • BibTex Deduplication: An online service to deduplicate bibliography files.
  • DAQS: Data Quality as a Service
  • Data Fusion: Technolgies for combining duplicates into single consistent records
  • Detecting Duplicates in XML: A domain-independent algorithm that effectively identifies duplicates in XML documents
  • Dirty XML Generator: Create XML data with duplicates for evaluation purposes
  • DogmatiX: A generalized framework for duplicate detection
  • GovWILD: An integrated set of government data to explore nepotism in politics and economy.
  • HiQIQ: High quality information querying
  • MAC / Hummer: A system to integrate heterogeneous datasets, including schema matching, deduplication and data fusion steps.
  • METL: Systematic management of sets of complex ETL processes
  • MyDBLP: Systematically annotate bibliographic data
  • Service Integration with Posr/Depot/Faster: Search, maintain and compose data services
  • Similarity Search: Methods to efficiently find similar records in large databases
  • SNNDedupe: An neural approach for entity resolution leveraging siamese neural networks and knowledge transfer   
  • System P: A peer data management system (PDMS) for data integration
  • Viqtor: Bulk data quality annotations
  • XClean: Cleaning and deduplicating XML data
  • XML Duplicate Detection Benchmark:  A mechanism of evaluating XML duplicate detection algorithms with the help of several metrics
  • XML Duplicate Detection Using Sorted Neigborhoods: A extension of the sorted neighborhood method to nested XML elements
  • XQueryGen: An interactive tool to create complex XQueries