Enumeration in Data Profiling

This is a platform for the research project on enumeration problems in data profiling of the HPI Algorithm Engineering group lead by Tobias Friedrich. The research is conducted in close collaboration with our colleagues at the neighboring Information Systems group of Felix Naumann as well as with Kitty Meeks at the University of Glasgow.

Description

Data profiling is the gathering of metadata from databases. In relational data dependencies among different attributes are of particular importance, e.g., unique column combinations (UCCs, a.k.a. candidate keys), functional dependencies, inclusion dependencies, and denial constraints. Usually it is not enough to detect the existence of a single such dependency in a database, instead one is interested in a comprehensive list of all occurrences. This naturally leads to several interesting enumeration problems.

Many discovery algorithms in data profiling use a reduction to the hitting set problem in hypergraphs. In the case of UCCs, we showed that their enumeration is in fact equivalent to the famous transversal hypergraph problem, i.e., computing all minimal hitting sets. While the computational complexity of the transversal hypergraph problem is a major open question, this equivalence opens intriguing new perspectives and in turn facilitates new algorithms for data profiling.

The results of this project are continuously published at the leading scientific conferences, including at IPEC 2016, ALENEX 2019, VLDB 2020, and ESA 2020. Additionally, the late-breaking developments are presented at workshops like the 2018 Dagstuhl Seminar on Algorithmic Enumeration and WEPA 2019.

The project includes research conducted together with our students. Julius Lischeid wrote his bachelor's thesis on Lexicographic Enumeration of Hitting Sets in Hypergraphs and Benjamin Feldmann wrote his master's thesis on Distributed Unique Column Combinations Discovery. Johann Birnick contributed to the development and implementation of the HPIValid algorithm as part of this project.

HPIValid

The most recent algorithm developed in this project is HPIValid, it enumerates all inclusion-wise minimal unique column combinations of a given database. The code of a C++ implementation can be found in this GitLab repository. A ZIP-file with the data used for testing can be downloaded here (3.5GB).

Videos

The following videos highlight some of the results obtained in this project, which were presented at VLDB 2020 as well as ESA 2020.

_{Video credits: HPI/respective speakers}

Publications

The project has resulted in the following scientific publications.

[ 2022 ] [ 2020 ] [ 2019 ] [ 2016 ]

Clean Citation Style 005

{ "authors" : []}

2022 [ nach oben ]

Bläsius, Thomas; Friedrich, Tobias; Lischeid, Julius; Meeks, Kitty; Schirneck, Martin Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data ProfilingJournal of Computer and System Sciences 2022: 192–213

[ Abstract ] [ BibTeX ] [ URL ] [ DOI ] [ Download ]

Bläsius, Thomas; Friedrich, Tobias; Schirneck, Martin The Complexity of Dependency Detection and Discovery in Relational DatabasesTheoretical Computer Science 2022: 79–96

[ Abstract ] [ BibTeX ] [ URL ] [ DOI ] [ Download ]

2020 [ nach oben ]

Birnick, Johann; Bläsius, Thomas; Friedrich, Tobias; Naumann, Felix; Papenbrock, Thorsten; Schirneck, Martin Hitting Set Enumeration with Partial Information for Unique Column Combination DiscoveryProceedings of the VLDB Endowment 2020: 2270–2283

[ Abstract ] [ BibTeX ] [ URL ] [ DOI ] [ Download ]

Bläsius, Thomas; Friedrich, Tobias; Schirneck, Martin The Minimization of Random HypergraphsEuropean Symposium on Algorithms (ESA) 2020: 21:1–21:15

[ Abstract ] [ BibTeX ] [ URL ] [ DOI ] [ Download ]

@inproceedings{Blaesius20MinimizationESA,
  abstract = {We investigate the maximum-entropy model \(\mathcal{B}_{n,m,p}\) for random \(n\)-vertex, \(m\)-edge multi-hypergraphs with expected edge size \(pn\). We show that the expected size of the minimization \(\min(\mathcal{B}_{n,m,p})\), i.e., the number of inclusion-wise minimal edges of \(\mathcal{B}_{n,m,p}\), undergoes a phase transition with respect to \(m\). If \(m\) is at most \(1/(1-p)^{(1-p)n}\), then \(\mathrm{E}[|\min(\mathcal{B}_{n,m,p})|]\) is of order \(\Theta(m)\), while for \(m \ge 1/(1-p)^{(1-p+\varepsilon)n}\) for any \(\varepsilon > 0\), it is \(\Theta( 2^{(\mathrm{H}(\alpha) + (1-\alpha) \log_2 p) n}/ \sqrt{n})\). Here, \(\mathrm{H}\) denotes the binary entropy function and \(\alpha = - (\log_{1-p} m)/n\). The result implies that the maximum expected number of minimal edges over all \(m\) is \(\Theta((1+p)^n/\sqrt{n})\). Our structural findings have algorithmic implications for minimizing an input hypergraph. This has applications in the profiling of relational databases as well as for the Orthogonal Vectors problem studied in fine-grained complexity. We make several technical contributions that are of independent interest in probability. First, we improve the Chernoff&ndash;Hoeffding theorem on the tail of the binomial distribution. In detail, we show that for a binomial variable \(Y \sim \mathrm{Bin}(n,p)\) and any \(0 < x < p\), it holds that \(\mathrm{P}[Y \le xn] = \Theta( 2^{-\!\mathrm{D}(x \,{\|}\, p) n}/\sqrt{n})\), where \(\mathrm{D}\) is the binary Kullback&ndash;Leibler divergence between Bernoulli distributions. We give explicit upper and lower bounds on the constants hidden in the big-O notation that hold for all \(n\). Secondly, we establish the fact that the probability of a set of cardinality \(i\) being minimal after \(m\) i.i.d. maximum-entropy trials exhibits a sharp threshold behavior at \(i^* = n + \log_{1-p} m\).},
  author = {Bläsius, Thomas and Friedrich, Tobias and Schirneck, Martin},
  booktitle = {European Symposium on Algorithms (ESA)},
  keywords = {esa thomasblaesius tobiasfriedrich martinschirneck year2020},
  pages = {21:1-21:15},
  title = {The Minimization of Random Hypergraphs},
  year = 2020
}

2019 [ nach oben ]

Bläsius, Thomas; Friedrich, Tobias; Lischeid, Julius; Meeks, Kitty; Schirneck, Martin Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data ProfilingAlgorithm Engineering and Experiments (ALENEX) 2019: 130–143

[ Abstract ] [ BibTeX ] [ URL ] [ DOI ] [ Download ]

2016 [ nach oben ]

Bläsius, Thomas; Friedrich, Tobias; Schirneck, Martin The Parameterized Complexity of Dependency Detection in Relational DatabasesInternational Symposium on Parameterized and Exact Computation (IPEC) 2016: 6:1–6:13

[ Abstract ] [ BibTeX ] [ URL ] [ DOI ] [ Download ]

Additional full versions are available as a preprint on ArXiv.

[ 2019 ]

Clean Citation Style 005

{ "authors" : []}

2019 [ nach oben ]

Bläsius, Thomas; Friedrich, Tobias; Schirneck, Martin The Minimization of Random HypergraphsCoRR 2019

ArXiv preprint

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

Code

We provide C++ implementation of two UCC enumeration algorithms. Special thanks go to Julius Lischeid for authoring and maintaining the code for the ALENEX paper.

ALENEX 2019: an oracle-based enumeration algorithm, hosted on GitHub.
VLDB 2020: HPIValid, hosted on our GitLab.

Data

We provide the databases used for testing and in some cases the corresponding hypergraphs of minimal difference sets and UCCs.

ALENEX 2019: full testing data including database snippets and hypergraphs - link (ZIP, 44.7MB)
other snippets, the corresponding hypergraphs, new datasets - link (ZIP, 39.1MB)
VLDB 2020: raw databases for testing HPIValid, without the hypergraphs - link (ZIP, 3.5GB)

Additional Resources

This survey article gives an overview of problems and techniques in data profiling.
The slides of the talk at the Dagstuhl seminar on Algorithmic Enumeration can be found here.
The Information Systems group maintains another GitHub repository containing different data profiling tools, including their Metanome suite.