Repeatability - FDs and ODs

This is a repeatability page for functional dependency (FD) and order dependency (OD) discovery algorithms. You can find more repeatability pages (for other types of dependencies) here.

The algorithms are provided in the state their results have been published, but they may not represent the most recent version of their implementations. To get the more up-to-date version of the algorithms, use the binaries provided here.

FD Algorithms

The efficient discovery of functional dependencies in tables is a well-known challenge in database research and has seen several approaches.

An experimental evaluation of seven algorithms

The following seven FD algorithms have been evaluated as part of an experimental evaluation published in PVLDB Vol. 8.

Lattice traversal algorithms:
TANE	jar	code
FUN	jar	code
FD_Mine	jar	code
DFD	jar	code
Difference- and agree-set algorithms:
Dep-Miner	jar	code
FastFDs	jar	code
Dependency induction algorithms:
fdep	jar	code

All seven FD algorithms can be executed with the data profiling tool Metanome. A Metanome build in version 0.0.2 can be downloaded here.

Partial FD algorithms (also known as approximate FDs)

Pyro is a discovery algorithm for approximate (or partial) functional dependencies. A paper on this algorithm has been published at PVLDB Vol. 11.

Pyro

jar

code

Note that Pyro is also capable of discovering partial unique column combinations and that the Pyro repository additionally contains versions of Tane, Ducc/Dfd, and fdep for partial FD/UCC discovery.

FD Algorithms (approximate discovery)

The exact discovery of all functional dependencies in relational tables is a complex task and, due to this complexity, state-of-the-art algorithms are often not able to process datasets of real-world size. When relaxing the discovery requirements, i.e., completeness, minimality, and correctness of the discovered FDs, we can build much more efficient, approximate algorithms. One such algorithm is the following, which has been published at CIKM'16:

AIDFD

jar

code

OD Algorithms

The efficient discovery of order dependencies in tables is related to that of functional dependency discovery. An order dependency states that ordering a table by one set of attributes also orders the table by another set of attributes.

ORDER

jar

code

We also published a distributed discovery algorithm for bidirectional order dependencies DISTOD that you can find here.

Datasets

You can find unlinked datasets here.

All FD algorithms have been exhaustively tested on the following datasets (null = null semantics):

Name	Source	Columns	Rows	Size	FDs	ODs
iris	uci	5	150	5 KB	4
balance-scale	uci	5	625	7 KB	1
chess	uci	7	28.056	519 KB	1
abalone	uci	9	4177	187 KB	137	0
nursery	uci	9	12.960	1,024 KB	1
breast-cancer-wisconsin	uci	11	699	20 KB	46	0
bridges	uci	13	108	6 KB	142	0
echocardiogram	uci	13	132	6 KB	538	12
adult	uci	14	48.842	3,528 KB	78	0
letter	uci	16	20000	695 KB	61	0
ncvoter	alt.ncsbe.gov	19	1000	151 KB	758
ncvoter	ncsbe.gov	22	938.085	230 MB		>90
hepatitis	uci	20	155	8 KB	8250	0
horse	uci	27	300	25 KB	128726	3
fd-reduced-30	dbtesma	30	250.000	69,581 KB	89571
plista	plista	63	1000	568 KB	178152
flight	bts.gov	109	1000	575 KB	982631
flight	bts.gov	20	500.000	71 MB		>1545
uniprot	uniprot.org	223	1000	2,439 KB	unknown
TPC H lineitem	tpc.org	16	2.999.671	368 MB		1

Pyro has been evaluated with the datasets below.

Name	Source	Columns [#]	Tuples [#]	Size	Notes
School results	Africa Open Data	27	14,384	1.8 MB
Adult	uci	15	32,561	3.5 MB
Classification	SCOP	12	70,859	6.8 MB
Reflns	PDB	37	24,769	6.2 MB
Atom sites	PDB	31	32,485	4.9 MB
DB status	PDB	35	29,787	6.3 MB	original file is named PDBX_DATABASE_STATUS.csv
Entity source	PDB	46	26,139	6.3 MB	original file is named ENTITY_SRC_GEN.csv
Bio entry	BIOSQL	9	184,292	24 MB
Voter	ncsbe.gov	19	100,001	11 MB	apparently truncated version
FDR-15	dbtesma	15	250,001	35 MB	generated
FDR-30	dbtesma	30	250,001	68 MB	generated
Atom	PDB	31	160,000	29 MB
Census	CENSUS	42	199,524	107 MB
Wiki image	Wikipedia	12	77,767	98 MB	crawled
Spots	Foursquare	15	973,150	130 MB	exact origin unknown
Struct sheet	PDB	32	973,510	147 MB	file name STRUCT_SHEET_RANGE.csv
Ditag feature	ENSEMBL	13	3,960,124	341 MB

Repeatability - FDs and ODs

FD Algorithms

An experimental evaluation of seven algorithms

Partial FD algorithms (also known as approximate FDs)

FD Algorithms (approximate discovery)

OD Algorithms

Datasets

Chair

News

06.10.2024 | Paper accepted at EDBT 2025

06.09.2024 | Congratulations Dr. Phillip Wenig

06.09.2024 | Congratulations Dr. Mazhar Hameed!

16.07.2024 | Congratulations Dr. Leon Bornemann-Paulus!

23.05.2024 | Paper accepted at NLDB 2024

Project highlights

People and open positions