We are happy to announce that our PVLDB paper “Hitting Set Enumeration with Partial Information for Unique Column Combination Discovery” has been accepted. The research will be presented at VLDB 2020. Find more information about the VLDB conference in 2020 here: www.vldb2020.org.
Authors: Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock and Martin Schirneck
Find short abstract below:
Unique column combinations (UCCs) are a fundamental con-cept in relational databases. They identify entities in the data and support various data management activities. Still, UCCs are usually not explicitly defined and need to be dis-covered. State-of-the-art data profiling algorithms are able to eÿciently discover UCCs in moderately sized datasets, but they tend to fail on large and, in particular, on wide datasets due to run time and memory limitations. In this paper, we introduce HPIValid, a novel UCC discov-ery algorithm that implements a faster and more resource-saving search strategy. HPIValid models the metadata dis-covery as a hitting set enumeration problem in hypergraphs. In this way, it combines eÿcient discovery techniques from data profiling research with the most recent theoretical in-sights into enumeration algorithms. Our evaluation shows that HPIValid is not only orders of magnitude faster than related work, it also has a much smaller memory footprint.