Heise, Arvid; Quiane-Ruiz, Jorge-Arnulfo; Abedjan, Ziawasch; Jentzsch, Anja; Naumann, Felix
Proceedings of the VLDB Endowment (PVLDB)
Jorge's presentation at VLDB 2014 was awarded the "Excellent Presentation Award".
The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving efficiency and scalability in this context is a tremendous challenge by itself. In this paper, we devise DUCC, a scalable and efficient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows DUCC to typically depend on the solution set size and hence to prune large swaths of the lattice. DUCC also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, DUCC runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate DUCC using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare DUCC with related work: Gordian and HCA. The results show that DUCC is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the efficiency of DUCC to scale up and out.