Hasso-Plattner-Institut
Prof. Dr. Tobias Friedrich
  
 

Enumeration in Data Profiling

This site provides a platform for the research project on enumeration problems in data profiling of the HPI's Algorithm Engineering group lead by Tobias Friedrich. The research is conducted in close collaboration with our colleagues at the neighboring Information Systems group of Felix Naumann as well as Kitty Meeks at the University of Glasgow.

Description

Data profiling is the gathering of metadata from databases. For relational data, dependencies among attributes are of particular importance, including unique column combinations (UCC, also called candidate keys), functional dependencies and inclusion dependencies. Usually it is not enough to detect the existence of a single such dependency in a database, instead one is interested in a comprehensive list of all (minimal/maximal) dependencies. This leads to several interesting enumeration problems.

In the case of UCCs, we showed that their enumeration is equivalent to the famous transversal hypergraph problem, i.e., computing all hitting sets of a hypergraph. While the computational complexity of the transversal hypergraph problem is a major open question, this equivalence allows for interesting new perspectives and, in turn, new algorithms for data profiling.

This project includes research conducted together with our students. For example, Julius Lischeid wrote his bachelor's thesis on Lexicographic Enumeration of Hitting Sets in Hypergraphs as part of this project.

Publications

So far, the project resulted in the following scientific publications.

[ 2019 ] [ 2016 ]

2019 [ to top ]

  • Efficiently Enumerating H... - Download
    Bläsius, Thomas; Friedrich, Tobias; Lischeid, Julius; Meeks, Kitty; Schirneck, Martin Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling. Algorithm Engineering and Experiments (ALENEX) 2019
     

2016 [ to top ]

  • The Parameterized Complex... - Download
    Bläsius, Thomas; Friedrich, Tobias; Schirneck, Martin The Parameterized Complexity of Dependency Detection in Relational Databases. International Symposium on Parameterized and Exact Computation (IPEC) 2016: 6:1-6:13
     

Codebase

A C++ implementation of the UCC enumeration algorithm presented in the paper Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling is avaiable via GitHub. Special thanks go to Julius Lischeid for authoring and maintaining the code.

Additional Resources

  • This survey article by Abedjan, Golab, and Naumann gives an overview of problems and techniques in data profiling.
  • Some results of this project were presented at the Dagstuhl seminar on Algorithmic Enumeration in October 2018. The slides of the talk can be found here.
  • The Information Systems group maintains another GitHub repository containing different data profiling tools, including their Metanome suite.