Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Repeatability - INDs

Inclusion dependencies are a well studied property of relational data sources and mainly serve the detection of foreign key relationships. As an integral part of most data profiling efforts, they further support schema reconstruction, data exploration, and database maintenance. This is the repeatability page for the following research contribution:

Repeatability for INDs discovery Algorithms

This is a repeatability section for IND discovery algorithms. The algorithms are provided in the state their results have been published, but they may not represent the most recent version of their implementations. To get the more up-to-date version of the algorithms, use the binaries provided here.

The following IND discovery algorithms are available for the Metanome data profiling tool:

Besides that, we provide the following algorithms that are not yet integrated with Metanome:

Experimental Evaluation of State-of-the-art INDs Algorithms

This section is dedecated to provide all materials have been used for experimentally comparing ten INDs discovery algorithms. IND discovery is one of the hardest tasks in data profiling: It is NP-hard, as well as one of the first real-world problems proven to be W[3]-complete. This complexity is mainly due to the large and complex search space and the expensive candidate checks required to find maximum Inds. Current Ind discovery algorithms still have specific shortcomings when running on different datasets. We evaluated and compared the following ten well-known algorithms for discovering Inds:

 unary IND Algorithms

N-ary IND Algorithms

Bell and Brockhausen

Mind

DeMarchi

ZigZag

Spider and SPIDER Bruteforce

Find2

S-indd

Mind2

Binder

Binder

SINDYFAIDA
FAIDA 
MANY 
S-INDD++ 

The code for all this algorithm and for repeating the experiments can be found in GitHub and Metanome algorithms repository. The evaluation framework and run instructions are also available.  (S-INDD++ code is available on request due to copyright)

Foreign Key Detection

Once discovered, INDs serve to identify foreign key constraints in relational schemata. In our WebDB'09 paper A Machine Learning Approach to Foreign Key Discovery, we proposed machine learning techniques to solve this task. Please find supplementary material below:

  • Algorithm that first implemented the proposed heuristics (zip, Java)
  • Training data and models for the proposed machine learning approach (zip for WEKA)

Datasets

The IND algorithms have been exhaustively tested on datasets of the following sources. The links refer to the original source website. Due to the large size of most datasets, we provide the actual data only upon request:

Name | File Size | Attributes | Unary INDs | N-ary INDs

COMA [1] | 20 KB | 4 | 0 | 0
SCOP|16 MB|22|43 (39)|40 (36)
CENSUS|112 MB|48 (42)|73 (39)|147 (89)
WIKIPEDIA [1]|540 MB|14|2|0
BIOSQL|560 MB|148 (77)|12463 (348)|22 (507)
WIKIRANK [1]|697 MB|35 (29)|321 (99)|339 (103)
LOD [2]|830 MB|41|298 (258)|1361005
ENSEMBL|836 MB|448 (130)|142510 (364)|100
CATH|908 MB|115|62|81
CATH 4.0|16 MB|25|51|81
TESMA [3]|1 GB|128 (114)|1780 (2)|0
PDB|44 GB|2790|800651|?
MusicBrainz|26.8 GB (27GB)|1053 (1054)|45793 (49829)|?
PLISTA [4]|61 GB|140|4877|?
TPC-H [5]|100 GB|61|90|6
TPC-H 1|1 GB|61|96|8
TPC-H 10|10 GB|61|97|11
BTC-2012 FB [6]|8.6 GB|22,236|202,331|?
ghaIND [7]|822MB|36|18|203

[1] Crawled from the Wikipedia knowledge base.

[2] Extracted from linked open data on famous persons; stored in relational format.

[3] Generated using our own db-tesma data generator (binaries for Windows 32 Bit).

[4] Streamed anonymized web log data from Plista.

[5] Generated using the dbgen data generator (binaries for Debian 64 Bit DB2).

[6] Obtained by partitioning the Freebase triples from the BTC 2012 dataset by their predicate.

[7] Generated using this generator.

*Numbers in brackets are the numbers we got when these datasets used for the experimental comparison among IND algorithms with NULL equal semantic