Data profiling is the gathering of metadata from databases. In relational data dependencies among different attributes are of particular importance, e.g., unique column combinations (UCCs, a.k.a. candidate keys), functional dependencies, inclusion dependencies, and denial constraints. Usually it is not enough to detect the existence of a single such dependency in a database, instead one is interested in a comprehensive list of all occurrences. This naturally leads to several interesting enumeration problems.
Many discovery algorithms in data profiling use a reduction to the hitting set problem in hypergraphs. In the case of UCCs, we showed that their enumeration is in fact equivalent to the famous transversal hypergraph problem, i.e., computing all minimal hitting sets. While the computational complexity of the transversal hypergraph problem is a major open question, this equivalence opens intriguing new perspectives and in turn facilitates new algorithms for data profiling.
The results of this project are continuously published at the leading scientific conferences, including at IPEC 2016, ALENEX 2019, VLDB 2020, and ESA 2020. Additionally, the late-breaking developments are presented at workshops like the 2018 Dagstuhl Seminar on Algorithmic Enumeration and WEPA 2019.
The project includes research conducted together with our students. Julius Lischeid wrote his bachelor's thesis on Lexicographic Enumeration of Hitting Sets in Hypergraphs and Benjamin Feldmann wrote his master's thesis on Distributed Unique Column Combinations Discovery. Johann Birnick contributed to the development and implementation of the HPIValid algorithm as part of this project.