Prof. Dr. Felix Naumann

Repeatability - INDs

This is a repeatability page for IND discovery algorithms. The algorithms are provided in the state their results have been published, but they may not represent the most recent version of their implementations. To get the more up-to-date version of the algorithms, use the binaries provided here.


Inclusion dependencies are a well studied property of relational data sources and mainly serve the detection of foreign key relationships. As an integral part of most data profiling efforts, they further support schema reconstruction, data exploration, and database maintenance. The following IND discovery algorithms are available for the Metanome data profiling tool:

Besides that, we provide the following algorithms that are not yet integrated with Metanome:

Foreign Key Detection

Once discovered, INDs serve to identify foreign key constraints in relational schemata. In our WebDB'09 paper A Machine Learning Approach to Foreign Key Discovery, we proposed machine learning techniques to solve this task. Please find supplementary material below:

  • Algorithm that first implemented the proposed heuristics (zip, Java)
  • Training data and models for the proposed machine learning approach (zip for WEKA)


The IND algorithms have been exhaustively tested on datasets of the following sources. The links refer to the original source website. Due to the large size of most datasets, we provide the actual data only upon request:


File Size


Unary INDs

N-ary INDs

COMA [1] 20 KB 4 0 0
SCOP16 MB224340
CENSUS112 MB4873147
WIKIPEDIA [1]540 MB1420
BIOSQL560 MB1481246322
WIKIRANK [1]697 MB35321339
LOD [2]830 MB412981361005
ENSEMBL836 MB448142510100
CATH908 MB1156281
TESMA [3]1 GB12817800
PDB44 GB2790800651?
MusicBrainz26.8 GB105345793?
PLISTA [4]61 GB1404877?
TPC-H [5]100 GB61906
BTC-2012 FB [6]8.6 GB22,236202,331?

[1] Crawled from the Wikipedia knowledge base.
[2] Extracted from linked open data on famous persons; stored in relational format.
[3] Generated using our own db-tesma data generator (binaries for Windows 32 Bit).
[4] Streamed anonymized web log data from Plista.
[5] Generated using the dbgen data generator (binaries for Debian 64 Bit DB2).
[6] Obtained by partitioning the Freebase triples from the BTC 2012 dataset by their predicate.