Repeatability - INDs

Inclusion dependencies are a well studied property of relational data sources and mainly serve the detection of foreign key relationships. As an integral part of most data profiling efforts, they further support schema reconstruction, data exploration, and database maintenance. This is the repeatability page for the following research contribution:

Repeatability for INDs discovery Algorithms

This is a repeatability section for IND discovery algorithms. The algorithms are provided in the state their results have been published, but they may not represent the most recent version of their implementations. To get the more up-to-date version of the algorithms, use the binaries provided here.

The following IND discovery algorithms are available for the Metanome data profiling tool:

BINDER - unary and n-ary IND detection
- BINDERDatabase.jar
- BINDERFile.jar
SPIDER - unary IND detection
- SPIDERDatabase.jar
- SPIDERFile.jar
FAIDA - approximate unary and n-ary IND detection
- FAIDAAlgorithm.jar
SINDY, SANDY, ANDY - family of distributed algorithms for unary (SINDY), partial (SANDY), and n-ary (ANDY) inclusion dependency discovery
- Repository
- Releases

Besides that, we provide the following algorithms that are not yet integrated with Metanome:

CINDERELLA - conditional IND detection
- Source code

Experimental Evaluation of State-of-the-art INDs Algorithms

This section is dedecated to provide all materials have been used for experimentally comparing ten INDs discovery algorithms. IND discovery is one of the hardest tasks in data profiling: It is NP-hard, as well as one of the first real-world problems proven to be W[3]-complete. This complexity is mainly due to the large and complex search space and the expensive candidate checks required to find maximum Inds. Current Ind discovery algorithms still have specific shortcomings when running on different datasets. We evaluated and compared the following ten well-known algorithms for discovering Inds:

unary IND Algorithms	N-ary IND Algorithms
Bell and Brockhausen	Mind
DeMarchi	ZigZag
Spider and SPIDER Bruteforce	Find2
S-indd	Mind2
Binder	Binder
SINDY	FAIDA
FAIDA
MANY
S-INDD++

The code for all this algorithm and for repeating the experiments can be found in GitHub and Metanome algorithms repository. The evaluation framework and run instructions are also available. (S-INDD++ code is available on request due to copyright)

Foreign Key Detection

Once discovered, INDs serve to identify foreign key constraints in relational schemata. In our WebDB'09 paper A Machine Learning Approach to Foreign Key Discovery, we proposed machine learning techniques to solve this task. Please find supplementary material below:

Algorithm that first implemented the proposed heuristics (zip, Java)
Training data and models for the proposed machine learning approach (zip for WEKA)

Datasets

The IND algorithms have been exhaustively tested on datasets of the following sources. The links refer to the original source website. Due to the large size of most datasets, we provide the actual data only upon request:

Name | File Size | Attributes | Unary INDs | N-ary INDs

COMA [1] | 20 KB | 4 | 0 | 0

SCOP|16 MB|22|43 (39)|40 (36)

CENSUS|112 MB|48 (42)|73 (39)|147 (89)

WIKIPEDIA [1]|540 MB|14|2|0

BIOSQL|560 MB|148 (77)|12463 (348)|22 (507)

WIKIRANK [1]|697 MB|35 (29)|321 (99)|339 (103)

LOD [2]|830 MB|41|298 (258)|1361005

ENSEMBL|836 MB|448 (130)|142510 (364)|100

CATH|908 MB|115|62|81

CATH 4.0|16 MB|25|51|81

TESMA [3]|1 GB|128 (114)|1780 (2)|0

PDB|44 GB|2790|800651|?

MusicBrainz|26.8 GB (27GB)|1053 (1054)|45793 (49829)|?

PLISTA [4]|61 GB|140|4877|?

TPC-H [5]|100 GB|61|90|6

TPC-H 1|1 GB|61|96|8

TPC-H 10|10 GB|61|97|11

BTC-2012 FB [6]|8.6 GB|22,236|202,331|?

ghaIND [7]|822MB|36|18|203

[1] Crawled from the Wikipedia knowledge base.

[2] Extracted from linked open data on famous persons; stored in relational format.

[3] Generated using our own db-tesma data generator (binaries for Windows 32 Bit).

[4] Streamed anonymized web log data from Plista.

[5] Generated using the dbgen data generator (binaries for Debian 64 Bit DB2).

[6] Obtained by partitioning the Freebase triples from the BTC 2012 dataset by their predicate.

[7] Generated using this generator.

*Numbers in brackets are the numbers we got when these datasets used for the experimental comparison among IND algorithms with NULL equal semantic

Repeatability - INDs

Repeatability for INDs discovery Algorithms

Experimental Evaluation of State-of-the-art INDs Algorithms

Foreign Key Detection

Datasets

Name | File Size | Attributes | Unary INDs | N-ary INDs

Chair

News

06.09.2024 | Congratulations Dr. Phillip Wenig

06.09.2024 | Congratulations Dr. Mazhar Hameed!

16.07.2024 | Congratulations Dr. Leon Bornemann-Paulus!

23.05.2024 | Paper accepted at NLDB 2024

29.04.2024 | Paper accepted at ITISE 2024

Project highlights

People and open positions