Birnick, Johann; Bläsius, Thomas; Friedrich, Tobias; Naumann, Felix; Papenbrock, Thorsten; Schirneck, Martin Hitting Set Enumeration with Partial Information for Unique Column Combination DiscoveryProceedings of the VLDB Endowment 2020: 2270–2283
Unique column combinations (UCCs) are a fundamental concept in relational databases. They identify entities in the data and support various data management activities. Still, UCCs are usually not explicitly defined and need to be discovered. State-of-the-art data profiling algorithms are able to efficiently discover UCCs in moderately sized datasets, but they tend to fail on large and, in particular, on wide datasets due to run time and memory limitations. In this paper, we introduce HPIValid, a novel UCC discovery algorithm that implements a faster and more resource-saving search strategy. HPIValid models the metadata discovery as a hitting set enumeration problem in hypergraphs. In this way, it combines efficient discovery techniques from data profiling research with the most recent theoretical insights into enumeration algorithms. Our evaluation shows that HPIValid is not only orders of magnitude faster than related work, it also has a much smaller memory footprint.
Bläsius, Thomas; Friedrich, Tobias; Schirneck, Martin The Minimization of Random HypergraphsEuropean Symposium on Algorithms (ESA) 2020: 21:1–21:15
We investigate the maximum-entropy model \(\mathcal{B}_{n,m,p}\) for random \(n\)-vertex, \(m\)-edge multi-hypergraphs with expected edge size \(pn\). We show that the expected size of the minimization \(\min(\mathcal{B}_{n,m,p})\), i.e., the number of inclusion-wise minimal edges of \(\mathcal{B}_{n,m,p}\), undergoes a phase transition with respect to \(m\). If \(m\) is at most \(1/(1-p)^(1-p)n}\), then \(\mathrm{E[|\min(\mathcal{B}_{n,m,p})|]\) is of order \(\Theta(m)\), while for \(m ge 1/(1-p)^(1-p+\varepsilon)n}\) for any \(\varepsilon > 0\), it is \(\Theta( 2^(\mathrm{H(\alpha) + (1-\alpha) \log_2 p) n/ \sqrt{n})\). Here, \(\mathrm{H}\) denotes the binary entropy function and \(alpha = - (\log_{1-p m)/n\). The result implies that the maximum expected number of minimal edges over all \(m\) is \(\Theta((1+p)^n/\sqrt{n})\). Our structural findings have algorithmic implications for minimizing an input hypergraph. This has applications in the profiling of relational databases as well as for the Orthogonal Vectors problem studied in fine-grained complexity. We make several technical contributions that are of independent interest in probability. First, we improve the Chernoff--Hoeffding theorem on the tail of the binomial distribution. In detail, we show that for a binomial variable \(Y sim \mathrm{Bin(n,p)\) and any \(0 < x < p\), it holds that \(\mathrm{P[Y le xn] = \Theta( 2^-\!\mathrm{D(x \,\|}\, p) n}/\sqrt{n})\), where \(\mathrm{D}\) is the binary Kullback--Leibler divergence between Bernoulli distributions. We give explicit upper and lower bounds on the constants hidden in the big-O notation that hold for all \(n\). Secondly, we establish the fact that the probability of a set of cardinality \(i\) being minimal after \(m\) i.i.d. maximum-entropy trials exhibits a sharp threshold behavior at \(i^* = n + \log_{1-p m\).