Two papers on data profiling accepted at SIGMOD 2016
The two papers "A Hybrid Approach to Functional Dependency Discovery" and "RDFind: Finding Conditional Inclusion Dependencies in Very Large RDF Datasets" have been accepted at the SIGMOD 2016 conference. Both papers have been developed in the context of our data profiling research (Metanome and Stratosphere II) and propose novel algorithms for the efficient discovery of functional dependencies in relational data and conditional inclusion dependencies in RDF data, respectively.
A Hybrid Approach to Functional Dependency Discovery (Abstract): Functional dependencies are structural metadata that can be used for schema normalization, data integration, data cleansing, and many other data management tasks. Despite their importance, the functional dependencies of a specific dataset are usually unknown and almost impossible to discover manually. For this reason, database research has proposed various algorithms for functional dependency discovery. None, however, are able to process datasets of typical real-world size, e.g., datasets with more than 50 attributes and a million records.
We present a hybrid discovery algorithm called HyFD, which combines fast approximation techniques with efficient validation techniques in order to find all minimal functional dependencies in a given dataset. While operating on compact data structures, HyFD not only outperforms all existing approaches, it also scales to much larger datasets.
(Thorsten Papenbrock, Felix Naumann)
RDFind: Finding Conditional Inclusion Dependencies in Very Large RDF Datasets (Abstract): Inclusion dependencies (inds) form an important integrity constraint on relational databases, supporting data management tasks, such as join path discovery and query optimization. Conditional inclusion dependencies (cinds), which define including and included data in terms of conditions, allow to transfer these capabilities to rdf data. However, cind discovery is computationally much more complex than ind discovery and the number of cinds even on small rdf datasets is intractable.
To cope with both problems, we first introduce the notion of pertinent cinds with an adjustable relevance criterion to filter and rank cinds based on their extent and implications among each other. Second, we present RDFind, a distributed system to efficiently discover all pertinent cinds in rdf data. RDFind employs a lazy pruning strategy to drastically reduce the cind search space. Also, its exhaustive parallelization strategy and robust data structures make it highly scalable. In our experimental evaluation, we show that RDFind is up to 419 times faster than the state-of-the-art, while considering a more general class of cinds. Furthermore, it is capable of processing a very large dataset of billions of triples, which was entirely infeasible before.
(Sebastian Kruse, Anja Jentzsch, Thorsten Papenbrock, Zoi Kaoudi, Jorge-Arnulfo Quiane-Ruiz, Felix Naumann)