Prof. Dr. Felix Naumann

Repeatability - INDs

This is a repeatability page for IND discovery algorithms. The algorithms are provided in the state their results have been published, but they may not represent the most recent version of their implementations. To get the more up-to-date version of the algorithms, use the binaries provided here.


Inclusion dependencies are a well studied property of relational data sources and mainly serve the detection of foreign key relationships. As an integral part of most data profiling efforts, they further support schema reconstruction, data exploration, and database maintenance. The following IND discovery algorithms are available for the Metanome data profiling tool:

Besides that, we provide the following algorithms that are not yet integrated with Metanome:

Foreign Key Detection

Once discovered, INDs serve to identify foreign key constraints in relational schemata. In our WebDB'09 paper A Machine Learning Approach to Foreign Key Discovery, we proposed machine learning techniques to solve this task. Please find supplementary material below:

  • Algorithm that first implemented the proposed heuristics (zip, Java)
  • Training data and models for the proposed machine learning approach (zip for WEKA)


The IND algorithms have been exhaustively tested on datasets of the following sources. The links refer to the original source website. Due to the large size of most datasets, we provide the actual data only upon request:


File Size


Unary INDs

N-ary INDs

COMA [1] 20 KB 4 0 0
SCOP 16 MB 22 43 40
CENSUS 112 MB 48 73 147
WIKIPEDIA [1] 540 MB 14 2 0
BIOSQL 560 MB 148 12463 22
WIKIRANK [1] 697 MB 35 321 339
LOD [2] 830 MB 41 298 1361005
ENSEMBL 836 MB 448 142510 100
CATH 908 MB 115 62 81
TESMA [3] 1 GB 128 1780 0
PDB 44 GB 2790 800651 ?
MusicBrainz 26.8 GB 1053 45793 ?
PLISTA [4] 61 GB 140 4877 ?
TPC-H [5] 100 GB 61 90 6
BTC-2012 FB [6] 8.6 GB 22,236 202,331 ?

[1] Crawled from the Wikipedia knowledge base.
[2] Extracted from linked open data on famous persons; stored in relational format.
[3] Generated using our own db-tesma data generator (binaries for Windows 32 Bit).
[4] Streamed anonymized web log data from Plista.
[5] Generated using the dbgen data generator (binaries for Debian 64 Bit DB2).
[6] Obtained by partitioning the Freebase triples from the BTC 2012 dataset by their predicate.