40th International Conference on Very Large Data Bases (VLDB), Hangzhou, China, 1st - 5th September 2014
Scalable Discovery of Unique Column Combinations
Arvid Heise, Jorge Arnulfo Quiané-Ruiz, Ziawasch Abjedian, Anja Jentzsch, and Felix Naumann
The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving efficiency and scalability in this context is a tremendous challenge by itself.
In this paper, we devise Ducc, a scalable and efficient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning eect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk
combination. This strategy allows Ducc to typically depend on the solution set size and hence to prune large swaths of the lattice. Ducc also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, Ducc runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate Ducc using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare Ducc with related work: Gordian and HCA. The results show that Ducc outperforms Gordian up to 2.8 orders of magnitude and HCA by up to 2.6 orders of magnitude. Finally, a series of scalability experiments shows the efficiency of Ducc to scale up and out.