Repeatability - FDs and ODs

This is a repeatability page for functional dependency (FD) and order dependency (OD) discovery algorithms. You can find more repeatability pages (for other types of dependencies) here.

The algorithms are provided in the state their results have been published, but they may not represent the most recent version of their implementations. To get the more up-to-date version of the algorithms, use the binaries provided here.

FD Algorithms

The efficient discovery of functional dependencies in tables is a well-known challenge in database research and has seen several approaches.

An experimental evaluation of seven algorithms

The following seven FD algorithms have been evaluated as part of an experimental evaluation published in PVLDB Vol. 8.

code

All seven FD algorithms can be executed with the data profiling tool Metanome. A Metanome build in version 0.0.2 can be downloaded here.

Discovering Functional Dependencies through Hitting Set Enumeration

Here you can find the source code of the FDHits algorithm published as part of the Proceedings of the ACM on Management of Data (PACMMOD) Volume 2, Issue 1: Code

Partial FD algorithms (also known as approximate FDs)

Pyro is a discovery algorithm for approximate (or partial) functional dependencies. A paper on this algorithm has been published at PVLDB Vol. 11.

code

Note that Pyro is also capable of discovering partial unique column combinations and that the Pyro repository additionally contains versions of Tane, Ducc/Dfd, and fdep for partial FD/UCC discovery.

Incremental FD Discovery

Functional dependencies (FDs) support various tasks for the man agement of relational data, such as schema normalization, data cleaning, and query optimization. However, while existing FD discovery algorithms regard only static datasets, many real-world datasets are constantly changing – and with them their FDs. Unfortunately, the computational hardness of FD discovery prohibits a continuous re-execution of those existing algorithms with every change of the data.

To this end, we propose DynFD, the first algorithm to dis cover and maintain functional dependencies in dynamic datasets. Whenever the inspected dataset changes, DynFD evolves its FDs rather than recalculating them. For this to work efficiently, we propose indexed data structures along with novel and efficient update operations. Our experiments compare DynFD’s incremen tal mode of operation to the repeated re-execution of existing, static algorithms. They show that DynFD can maintain the FDs of dynamic datasets over an order of magnitude faster than its static counter-parts.

Code: https://github.com/HPI-Information-Systems/dynfd

Paper: DynFD: Functional Dependency Discovery in Dynamic Datasets (EDBT 2019)

FD Algorithms (approximate discovery)

The exact discovery of all functional dependencies in relational tables is a complex task and, due to this complexity, state-of-the-art algorithms are often not able to process datasets of real-world size. When relaxing the discovery requirements, i.e., completeness, minimality, and correctness of the discovered FDs, we can build much more efficient, approximate algorithms. One such algorithm is the following, which has been published at CIKM'16:

AIDFD         jar             code

OD Algorithms

The efficient discovery of order dependencies in tables is related to that of functional dependency discovery. An order dependency states that ordering a table by one set of attributes also orders the table by another set of attributes.

ORDER         jar             code

We also published a distributed discovery algorithm for bidirectional order dependencies DISTOD that you can find here.

Datasets

You can find unlinked datasets here.

All FD algorithms have been exhaustively tested on the following datasets (null = null semantics):

1

Pyro has been evaluated with the datasets below.

ENSEMBL
13 3,960,124 341 MB