We are happy to announce that in the past days two papers were accepted to be presented at the SIGMOD conference in 2024. We are really happy. Please find more information below.
Title Paper I: Discovering Functional Dependencies through Hitting Set Enumeration
AUTHORS
Tobias Bleifuss (Hasso Plattner Institute)
Thorsten Papenbrock (University of Marburg)
Thomas Bläsius (Karlsruhe Institute of Technology)
Martin Schirneck (University of Vienna)
Felix Naumann (Hasso Plattner Institute)
ABSTRACT
IFunctional dependencies (FDs) are among the most important integrity constraints in databases. They serve to normalize datasets and thus resolve redundancies, they contribute to query optimization, and they are frequently used to guide data cleaning efforts. Because the FDs of a particular dataset are usually unknown, automatic profiling algorithms are needed to discover them. These algorithms have made considerable advances in the past few years, but they still require a significant amount of time and memory to process datasets of practically relevant sizes. We present FDhits, a novel FD discovery algorithm that finds all valid, minimal FDs in a given relational dataset. FDhits is based on several discovery optimizations that include a hybrid validation approach, effective hitting set enumeration techniques, one-pass candidate validations, and parallelization. Our experiments show that FDhits, even without parallel execution, is on average about 6x-7x times faster compared to state-of-the-art FD discovery algorithms while using significantly less memory. This allows the discovery of all FDs even on datasets that could not be processed by the current state-of-the-art.
Title Paper II: Determining the Largest Overlap between Tables
AUTHORS
Luca Zecchini (University of Modena and Reggio Emilia)
Tobias Bleifuss (Hasso Plattner Institute)
Giovanni Simonini (University of Modena and Reggio Emilia)
Sonia Bergamaschi (University of Modena and Reggio Emilia)
Felix Naumann (Hasso Plattner Institute)
ABSTRACT
Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides significant information about the relatedness of the tables. Unfortunately, efficiently quantifying the overlap between two tables is not trivial. In particular, detecting their largest overlap, i.e., their largest common subtable, is a computationally challenging problem. As the information overlap may not occur in contiguous portions of the tables, only the ability to permute columns and rows can reveal it. The detection of the largest overlap can help us in relevant tasks such as the discovery of multiple coexisting versions of the same table, which can present differences in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, matching tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. We present the first formal definition of this problem, and with it Sloth, our solution to efficiently detect the largest overlap between two tables. We experimentally demonstrate on real-world datasets its efficacy in solving this task, analyzing its performance and showing its impact on multiple use cases.