Algorithms

Metanome Tool and Profiling Algorithms

The Metanome profiling tool is a framework for various profiling algorithms. It handles both algorithms and datasets as external resources, which is why there are no profiling algorithms contained in tool itself. Having algorithms as external resources is a design decision that allows researches to contribute profiling functionality without changing the tool itself. This makes Metanome on open profiling platform for both algorithm engineers and data scientists. The following image depicts the architecture of the profiling tool:

Metanome Tool

The newest version of the Metanome profiling tool is always available on GitHub. If you cannot build the sources, you can also download the following (less up-to-date) Metanome binaries:

Metanome Profiling Tool (v0.0.2, v1.0, v1.1, v1.2)

Metanome CLI

Metanome CLI Tool (v1.2)

See here for more information on Metanome CLI.

Metanome Datasets

To load a dataset into the Metanome tool, it must be placed into the folder /WEB-INF/classes/inputData. It will then appear in the frontend in the import list. The datasets need to be relational and in some kind of csv or tsv format. The separator and quote characters can be defined in the frontend when importing the individual datasets. Alternatively to file-imports, one can specify a database connection in the frontend.

We provide some test datasets on our repeatability page.

Metanome Algorithms

In the context of the Metanome data profiling project, we developed and re-implemented the following profiling algorithms. To run a profiling algorithm, place the according jar-file into the folder /WEB-INF/classes/algorithms and register it in the Metanome frontend. If you want to write your own profiling algorithm for the Metanome tool, we recommend this Skeleton Project to start your development.

The source code for these algorithms is also available on GitHub.

Unique Column Combination (Key Discovery)

HyUCC (v1.1, v1.2)
DUCC (v0.0.2, v1.0, v1.1, v1.2)

Inclusion Dependency (Foreign-Key Discovery) repeatability page

BINDER (v0.0.2, v1.0, v1.1, v1.2)
BINDER Database (v0.0.2, v1.0, v1.1, v1.2)
SPIDER (v0.0.2, v1.0, v1.1, v1.2)
SPIDER Database (v0.0.2, v1.0, v1.1, v1.2)
MANY (v0.0.2, v1.0, v1.1, v1.2)
FAIDA (v0.0.2, v1.0, v1.1, v1.2) (approximate)

Functional Dependencies (Normalization) repeatability page

HyFD (v1.1, v1.2)
DFD (v0.0.2, v1.0, v1.1, v1.2)
Tane (v0.0.2, v1.0, v1.1, v1.2)
Fun (v0.0.2, v1.0, v1.1, v1.2)
fdep (v0.0.2, v1.0, v1.1, v1.2)
FastFDs (v0.0.2, v1.0, v1.1, v1.2)
FdMine (v0.0.2, v1.0, v1.1, v1.2)
DepMiner (v0.0.2, v1.0, v1.1, v1.2)
AIDFD (v0.0.2, v1.0, v1.1, v1.2) (approximate)
CFDFinder (v1.1, v1.2) (conditional)

Matching Dependencies (Data Cleaning) repeatability page

HyMD (v1.1, v1.2)

Multivalued Dependencies (Normalization)

MvDDetector (v1.1, v1.2)

Order Dependencies (Data Ordering) repeatability page

ORDER (v0.0.2, v1.0, v1.1, v1.2)

Denial Constraints (Data Cleaning) repeatability page

Hydra (v1.1, v1.2)
DCFinder (v1.2)

Complement Dependencies(Data Cleaning)

Cody (v1.2)

Basic Statistics (Data Exploration)

SCDP (v0.0.2, v1.0, v1.1, v1.2)
Tireless (v1.1) (Infereing regular expressions)

Cardinality Estimation (Zeroth-frequency moment of dataset) repeatability page

FM (v1.1, v1.2)
PCSA (v1.1, v1.2)
LC (v1.1, v1.2)
AMS (v1.1, v1.2)
BJKST (v1.1, v1.2)
LogLog (v1.1, v1.2)
SuperLogLog (v1.1, v1.2)
MinCount (v1.1, v1.2)
AKMV (v1.1, v1.2)
HyperLogLog (v1.1, v1.2)
Bloom filter (v1.1, v1.2)
HyperLogLog++ (v1.1, v1.2)

Schema Normalization

Normalize (v1.1, v1.2) (Boyce-Codd Normal Form)

Further Profiling Algorithms

The following list points to data profiling algorithms that have been developed also in the context of the Metanome project, but that do not support the Metanome algorithm interface:

Inclusion Dependencies

https://github.com/srfc/raida (distributed)
https://github.com/alpreu/spin (distributed on streaming data)

Functional Dependencies

https://github.com/HPI-Information-Systems/dynfd (on dynamic data)

Bidirectional Order Dependencies

DISTOD (Sources and Releases on Github) (distributed, discovers ODs in set-based form) repeatability page

Key and Foreign Key Discovery with HoPF

https://github.com/lanchiang/hopf