Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Repeatability - FDs

This is a repeatability page for FD discovery algorithms. The algorithms are provided in the state their results have been published, but they may not represent the most recent version of their implementations. To get the more up-to-date version of the algorithms, use the binaries provided here.

FD Algorithms

The efficient discovery of functional dependencies in tables is a well-known challenge in database research and has seen several approaches. The following FD algorithms represent the current state-of-the-art.

Lattice traversal traversal algorithms:
TANE
jarcode
FUN
jarcode
FD_Mine
jarcode
DFDjarcode
Difference- and agree-set algorithms:
Dep-Miner
jarcode
FastFDs
jarcode
Dependency induction algorithms:
fdepjarcode

All seven FD algorithms can be executed with the data profiling tool Metanome. A Metanome build in version 0.0.2 can be downloaded here.

Partial FD algorithms (also known as approximate FDs)

Pyrojarcode

Note that Pyro is also capable of discovering partial unique column combinations and that the Pyro repository additionally contains versions of Tane, Ducc/Dfd, and fdep for partial FD/UCC discovery.

OD Algorithms

The efficient discovery of order dependencies in tables is related to that of functional dependency discovery. An order dependency states that ordering a table by one set of attributes also orders the table by another set of attributes.

ORDER
        jar            code

Datasets

All FD algorithms have been exhaustively tested on the following datasets:

Name Source Columns Rows Size FDs ODs
iris uci 5 150 5 KB 4  
balance-scale uci 5 625 7 KB 1  
chess uci 7 28.056 519 KB 1  
abalone uci 9 4177 187 KB 137 0
nursery uci 9 12.960 1,024 KB 1  
breast-cancer-wisconsin uci 11 699 20 KB 46 0
bridges uci 13 108 6 KB 142 0
echocardiogram uci 13 132 6 KB 538 12
adult uci 14 48.842 3,528 KB 78 0
letter uci 16 20000 695 KB 61 0
ncvoter alt.ncsbe.gov 19 1000 151 KB 758  
ncvoter ncsbe.gov 22 938.085 230 MB   >90
hepatitis uci 20 155 8 KB 8250 0
horse uci 27 300 25 KB 128726 3
fd-reduced-30 dbtesma 30 250.000 69,581 KB 89571  
plista plista 63 1000 568 KB 178152  
flight bts.gov 109 1000 575 KB 982631  
flight bts.gov 20 500.000 71 MB   >1545
uniprot uniprot.org 223 1000 2,439 KB unknown  
lineitem tpc.org 16 2.999.671 368 MB   1

Pyro has been evaluated with the following datasets. They are available on request (contact Sebastian Kruse).

 

Name Source Columns [#] Tuples [#] Size Notes
School results Africa Open Data 27 14,384 1.8 MB  
Adult uci 15 32,561 3.5 MB  
Classification SCOP 12 70,859 6.8 MB  
Reflns PDB 37 24,769 6.2 MB  
Atom sites PDB 31 32,485 4.9 MB  
DB status PDB 35 29,787 6.3 MB original file is named PDBX_DATABASE_STATUS.csv
Entity source PDB 46 26,139 6.3 MB original file is named ENTITY_SRC_GEN.csv
Bio entry BIOSQL 9 184,292 24 MB  
Voter ncsbe.gov 19 100,001 11 MB apparently truncated version
FDR-15 dbtesma 15 250,001 35 MB generated
FDR-30 dbtesma 30 250,001 68 MB generated
Atom PDB 31 160,000 29 MB  
Census CENSUS 42 199,524 107 MB  
Wiki image Wikipedia 12 77,767 98 MB crawled
Spots Foursquare 15 973,150 130 MB exact origin unknown
Struct sheet PDB 32 973,510 147 MB file name STRUCT_SHEET_RANGE.csv
Ditag feature ENSEMBL 13 3,960,124 341 MB