28.11.2016

Two papers on data profiling accepted at BTW 2017

Two papers on data profiling have been accepted at the BTW 2017. Both papers describe novel methods to discover different types of dependencies on unknown datasets, thereby improving over the efficiency of prior methods. Please find the details below.

Fast Approximate Discovery of Inclusion Dependencies

Inclusion dependencies (INDs) are relevant to several data management tasks, such as foreign key detection and data integration, and their discovery is a core concern of data profiling. However, n-ary IND discovery is computationally expensive, so that existing algorithms often perform poorly on complex datasets. To this end, we present FAIDA, the first approximate IND discovery algorithm. FAIDA combines probabilistic and exact data structures to approximate the INDs in relational datasets. In fact, FAIDA guarantees to find all INDs and only with a low probability false positives might occur due to the approximation. This little inaccuracy comes in favor of significantly increased performance, though. In our evaluation, we show that FAIDA scales to very large datasets and outperforms the state-of-the-art algorithm by a factor of up to six in terms of runtime without reporting any false positives. This shows that FAIDA strikes a good balance between efficiency and correctness.

A Hybrid Approach for Efficient Unique Column Combination Discovery

Unique column combinations (UCCs) are groups of attributes in relational datasets that contain no value-entry more than once. Hence, they indicate keys and serve data management tasks, such as schema normalization, data integration, and data cleansing. Because the unique column combinations of a particular dataset are usually unknown, UCC discovery algorithms have been proposed to find them. All previous such discovery algorithms are, however, inapplicable to datasets of typical real-world size, e.g., datasets with more than 50 attributes and a million records.
We present the hybrid discovery algorithm HyUCC, which uses the same discovery techniques as the recently proposed functional dependency discovery algorithm HyFD: A hybrid combination of fast approximation techniques and efficient validation techniques. With it, the algorithm discovers all minimal unique column combinations in a given dataset. HyUCC does not only outperform all existing approaches, it also scales to much larger datasets.