Research Paper
Abstract. The discovery of all unique (and non-unique) column combinations in an unknown dataset is at the core of any data profiling effort. Unique column combinations resemble candidate keys of a relational dataset. Several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are suitable for applications on dynamic datasets, such as transactional databases, social networks, and scientific applications. In these cases, data profiling techniques should be able to efficiently discover new uniques and non-uniques (and validate old ones) after tuple inserts or deletes, without re-profiling the entire dataset.
We present the first approach to efficiently discover unique and non-unique constraints on dynamic datasets that is independent of the initial dataset size. In particular, Swan makes use of intelligently chosen indices to minimize access to old data. We perform an exhaustive analysis of Swan and compare it with two state-of-the-art techniques for unique discovery: Gordian and Ducc. The results show that Swan significantly outperforms both, as well as their incremental adaptations. For inserts, Swan is more than 63x faster than Gordian and up to 50x faster than Ducc. For deletes, Swan is more than 15x faster than Gordian and up to 1 order of magnitude faster than Ducc. In fact, the results also show that Swan even improves on the static case by dividing the dataset into a static part and a set of inserts.
Demo Paper
Abstract. Before reaping the benefits of open data to add value to an organizations internal data, such new, external datasets must be analyzed and understood already at the basic level of data types, constraints, value patterns etc. Such data profiling, already difficult for large relational data sources, is even more challenging for RDF datasets, the preferred data model for linked open data.
We present ProLOD++, a novel tool for various profiling and mining tasks to understand and ultimately improve open RDF data. ProLOD++ comprises various traditional data profiling tasks, adapted to the RDF data model. In addition, it features many specific profiling results for open data, such as schema discovery for user-generated attributes, association rule discovery to uncover synonymous predicates, and uniqueness discovery along ontology hierarchies. ProLOD++ is highly efficient, allowing interactive profiling for users interested in exploring the proper- ties and structure of yet unknown datasets.