Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

15.02.2018

Efficient Discovery of Approximate Dependencies

Sebastian Kruse and Felix Naumann

Our paper "Efficient Discovery of Approximate Dependencies", which introduces the Pyro algorithm, has been accepted for presentation at the 44th International Conference on Very Large Data Bases (VLDB 2018). VLDB 2018 will take place in Rio de Janeiro, Brazil, from August 27th to August 31st, 2018.

Abstract

Functional dependencies (FDs) and unique column combinations (UCCs) form a valuable ingredient for many data management tasks, such as data cleaning, schema recovery, and query optimization. Because these dependencies are unknown in most scenarios, their automatic discovery has been well researched. However, existing methods mostly discover only exact dependencies, i.e., those without violations. Real-world dependencies, in contrast, are frequently approximate due to data exceptions, ambiguities, or data errors. This relaxation to approximate dependencies renders their discovery an even harder task than the already challenging exact dependency discovery.

To this end, we propose the novel and highly efficient algorithm \pyro to discover both approximate FDs and approximate UCCs. \Pyro combines a separate-and-conquer search strategy with sampling-based guidance that quickly detects dependency candidates and verifies them. In our broad experimental evaluation, we observe that Pyro outperforms existing discovery algorithms by a factor of up to 33 and scales to larger datasets. At the same time, it requires the least main memory among the evaluated algorithms.