Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Repeatability - DCs

This is a repeatability page for DC discovery algorithms. The algorithms are provided in the state their results have been published, but they may not represent the most recent version of their implementations.

DC Algorithms

The efficient discovery of denial constraints in tables is a challenging task. So far, our group has developed two DC discovey algorithms:

  • We have released a hybrid DC discovery algorithm Hydra, which is available as part of the metanome-algorithms respository.
  • Our most recent addition to the family of DC discovery algorithm is a new approximate DC discovery algorithm DCFinder, which is also available as part of the metanome-algorithms respository.

The data profiling tool Metanome provides standardized interfaces to facilitate the comparison of different DC discovery methods.

 

Datasets

Our DC algorithms have been exhaustively tested on the following datasets:

Name Source Columns Rows Size
Adult uci 15 32,561 3.5 MB
Airport Airport 18 55,113 7.3 MB
Flight bts.gov 20 500.000 71 MB
Hospital https://data.medicare.gov/ 15 114,919 30.6 MB
Inspection https://data.cityofchicago.org/ 19 170,000 192.6 MB
ncvoter ncsbe.gov 22 938.085 191.4 MB
Stock http://pages.swcp.com/stocks/ 7 122,496 5.3 MB
Tax http://da.qcri.org/ 15 100,000 7.5 MB
Tax Tax generator by Xu Chu 15 1,000,000 73 MB