Ehrlich, Jens; Roick, Mandy; Schulze, Lukas; Zwiener, Jakob; Papenbrock, Thorsten; Naumann, Felix
Proceedings of the conference on Database Systems for Business, Technology, and Web (BTW)
Data profiling is the discipline of examining an unknown dataset for its structure and statistical information. It is a preprocessing step in a wide range of applications, such as data integration, data cleansing, or query optimization. For this reason, many algorithms have been proposed for the discovery of different kinds of metadata. When analyzing a dataset, these profiling algorithms are often applied in sequence, but they do not support one another, for instance, by sharing I/O cost or pruning information. We present the holistic algorithm MUDS, which jointly discovers the three most important metadata: inclusion dependencies, unique column combinations, and functional dependencies. By sharing I/O cost and data structures across the different discovery tasks, MUDS can clearly increase the efficiency of traditional sequential data profiling. The algorithm also introduces novel inter-task pruning rules that build upon different types of metadata, e.g., unique column combinations to infer functional dependencies. We evaluate MUDS in detail and compare it against the sequential execution of state-of-the-art algorithms. A comprehensive evaluation shows that our holistic algorithm outperforms the baseline by up to factor 48 on datasets with favorable pruning conditions.