Kruse, Sebastian; Naumann, Felix
Proceedings of the VLDB Endowment
See abstract for errata
Functional dependencies (FDs) and unique column combinations (UCCs) form a valuable ingredient for many data management tasks, such as data cleaning, schema recovery, and query optimization. Because these dependencies are unknown in most scenarios, their automatic discovery has been well researched. However, existing methods mostly discover only exact dependencies, i.e., those without violations. Realworld dependencies, in contrast, are frequently approximate due to data exceptions, ambiguities, or data errors. This relaxation to approximate dependencies renders their discovery an even harder task than the already challenging exact dependency discovery. To this end, we propose the novel and highly efficient algorithm Pyro to discover both approximate FDs and approximate UCCs. Pyro combines a separate-and-conquer search strategy with sampling-based guidance that quickly detects dependency candidates and verifies them. In our broad experimental evaluation, Pyro outperforms existing discovery algorithms by a factor of up to 33, scales to larger datasets, and at the same time requires the least main memory. --------------------- Errata / Corrigendum for Efficient Discovery of Approximate Dependencies Sebastian Kruse and Felix Naumann Proceedings of the VLDB Endowment 11 (7), 759-772 Readers of the paper have pointed out a few minor errors, which we document here to ease the understanding of the algorithm. Erratum 1) In Section 5.1, the PLI for “Last name” should read 1,4, 3,5. Erratum 2) In Section 5.3, Example 4, the tuple pairs (t1, t3), (t1, t5), and (t2, t3) should yield the agree set sample AS = (, 1), (First_name, Town, 1), (ZIP, 1)). Erratum 3) In Section 5.3, the example AUCC error of the attribute combination A1...An should be 0.0099.