Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Repeatability - FDs and ODs

This is a repeatability page for functional dependency (FD) and order dependency (OD) discovery algorithms. You can find more repeatability pages (for other types of dependencies) here.

The algorithms are provided in the state their results have been published, but they may not represent the most recent version of their implementations. To get the more up-to-date version of the algorithms, use the binaries provided here.

FD Algorithms

The efficient discovery of functional dependencies in tables is a well-known challenge in database research and has seen several approaches.

FDHits

FDHits is a new FD discovery algorithm that models the problem as an enumeration problem of hitting sets.

A publication on this algorithm is currently under review at PVLDB Vol. 14.

You can find the source code of FDHits here and all datasets that we used for the evaluation here.

An experimental evaluation of seven algorithms

The following seven FD algorithms have been evaluated as part of an experimental evaluation published in PVLDB Vol. 8.

Lattice traversal algorithms:
TANEjarcode
FUNjarcode
FD_Minejarcode
DFDjarcode
Difference- and agree-set algorithms:
Dep-Minerjarcode
FastFDsjarcode
Dependency induction algorithms:
fdepjarcode

All seven FD algorithms can be executed with the data profiling tool Metanome. A Metanome build in version 0.0.2 can be downloaded here.

Partial FD algorithms (also known as approximate FDs)

Pyro is a discovery algorithm for approximate (or partial) functional dependencies. A paper on this algorithm has been published at PVLDB Vol. 11.

Pyrojarcode

Note that Pyro is also capable of discovering partial unique column combinations and that the Pyro repository additionally contains versions of Tane, Ducc/Dfd, and fdep for partial FD/UCC discovery.

FD Algorithms (approximate discovery)

The exact discovery of all functional dependencies in relational tables is a complex task and, due to this complexity, state-of-the-art algorithms are often not able to process datasets of real-world size. When relaxing the discovery requirements, i.e., completeness, minimality, and correctness of the discovered FDs, we can build much more efficient, approximate algorithms. One such algorithm is the following, which has been published at CIKM'16:

AIDFD        jar            code

OD Algorithms

The efficient discovery of order dependencies in tables is related to that of functional dependency discovery. An order dependency states that ordering a table by one set of attributes also orders the table by another set of attributes.

ORDER        jar            code

We also published a distributed discovery algorithm for bidirectional order dependencies DISTOD that you can find here.

Datasets

All FD algorithms have been exhaustively tested on the following datasets (null = null semantics):

NameSourceColumnsRowsSizeFDsODs
irisuci51505 KB4 
balance-scaleuci56257 KB1 
chessuci728.056519 KB1 
abaloneuci94177187 KB1370
nurseryuci912.9601,024 KB1 
breast-cancer-wisconsinuci1169920 KB460
bridgesuci131086 KB1420
echocardiogramuci131326 KB53812
adultuci1448.8423,528 KB780
letteruci1620000695 KB610
ncvoteralt.ncsbe.gov191000151 KB758 
ncvoterncsbe.gov22938.085230 MB >90
hepatitisuci201558 KB82500
horseuci2730025 KB1287263
fd-reduced-30dbtesma30250.00069,581 KB89571 
plistaplista631000568 KB178152 
flightbts.gov1091000575 KB982631 
flightbts.gov20500.00071 MB >1545
uniprotuniprot.org22310002,439 KBunknown 
lineitemtpc.org162.999.671368 MB 1

Pyro has been evaluated with the datasets below.

NameSourceColumns [#]Tuples [#]SizeNotes
School resultsAfrica Open Data2714,3841.8 MB 
Adultuci1532,5613.5 MB 
ClassificationSCOP1270,8596.8 MB 
ReflnsPDB3724,7696.2 MB 
Atom sitesPDB3132,4854.9 MB 
DB statusPDB3529,7876.3 MBoriginal file is named PDBX_DATABASE_STATUS.csv
Entity sourcePDB4626,1396.3 MBoriginal file is named ENTITY_SRC_GEN.csv
Bio entryBIOSQL9184,29224 MB 
Voterncsbe.gov19100,00111 MBapparently truncated version
FDR-15dbtesma15250,00135 MBgenerated
FDR-30dbtesma30250,00168 MBgenerated
AtomPDB31160,00029 MB 
CensusCENSUS42199,524107 MB 
Wiki imageWikipedia1277,76798 MBcrawled
SpotsFoursquare15973,150130 MBexact origin unknown
Struct sheetPDB32973,510147 MBfile name STRUCT_SHEET_RANGE.csv
Ditag featureENSEMBL133,960,124341 MB