Hasso-Plattner-Institut
Prof. Dr. Tobias Friedrich
  
 

Enumeration in Data Profiling

This is a platform for the research project on enumeration problems in data profiling of the HPI's Algorithm Engineering group lead by Tobias Friedrich. The research is conducted in close collaboration with our colleagues at the neighboring Information Systems group of Felix Naumann as well as with Kitty Meeks at the University of Glasgow.

Description

Data profiling is the gathering of metadata from databases. In relational data dependencies among different attributes are of particular importance, e.g., unique column combinations (UCCs, a.k.a. candidate keys), functional dependencies, and inclusion dependencies. Usually it is not enough to detect the existence of a single such dependency in a database, instead one is interested in a comprehensive list of all occurrences. This naturally leads to several interesting enumeration problems.

In the case of UCCs, we showed that their enumeration is equivalent to the famous transversal hypergraph problem, i.e., computing all hitting sets of a hypergraph. While the computational complexity of the transversal hypergraph problem is a major open question, this equivalence opened intriguing new perspectives and in turn facilitated new algorithms for data profiling.

The results of this project are continuously published at the leading scientific conferences, including at IPEC 2016 and ALENEX 2019. Additionally, the late-breaking developments are presented at workshops like the 2018 Dagstuhl Seminar on Algorithmic Enumeration and WEPA 2019.

The project includes research conducted together with our students. Julius Lischeid wrote his bachelor's thesis on Lexicographic Enumeration of Hitting Sets in Hypergraphs as part of this project. Johann Birnick developed and implemented a sampling-and-restart enumeration algorithm.

Publications

The project has resulted in the following scientific publications.

[ 2019 ] [ 2016 ]

2019 [ to top ]

  • Efficiently Enumerating H... - Download
    Bläsius, Thomas; Friedrich, Tobias; Lischeid, Julius; Meeks, Kitty; Schirneck, Martin Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling. Algorithm Engineering and Experiments (ALENEX) 2019: 130-143
     

2016 [ to top ]

  • The Parameterized Complex... - Download
    Bläsius, Thomas; Friedrich, Tobias; Schirneck, Martin The Parameterized Complexity of Dependency Detection in Relational Databases. International Symposium on Parameterized and Exact Computation (IPEC) 2016: 6:1-6:13
     

Code

A C++ implementation of the UCC enumeration algorithm described in the above ALENEX paper is avaiable at GitHub. Special thanks go to Julius Lischeid for authoring and maintaining the code.

Data

Some hypergraphs together with their respective databases can be found here (ZIP, 39.1MB). The data samples with 100,000 records were used for testing in the paper Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling.

Additional Resources

People

Supervisor

Algorithm Engineering Group
Hasso Plattner Institute
Potsdam, Germany

Supervisor

Information Systems Group
Hasso Plattner Institute
Potsdam, Germany

Researcher

Algorithm Engineering Group
Hasso Plattner Institute
Potsdam, Germany

Researcher

School of Computing Science
University of Glasgow
Glasgow, United Kingdom

Researcher

Information Systems Group
Hasso Plattner Institute
Potsdam, Germany

Researcher

Computer Science Department
ETH Zurich
Zurich, Switzerland

Benjamin Feldmann

Researcher

Information Systems Group
Hasso Plattner Institute
Potsdam, Germany

Researcher

Department of Computer Science and Technology
University of Cambridge
Cambridge, United Kingdom

Researcher

Algorithm Engineering Group
Hasso Plattner Institute
Potsdam, Germany