We are happy to report that in the past days three papers were accepted to be presented at database conferences in 2023. Two papers will appear in the proceedings of the SIGMOD 2023 conference, which will be held in Seattle. One paper will appear in the Proceedings of the VLDB and will be presented at the VLDB 2023 conference in Vancouver.
Title: Discovering Similarity Inclusion Dependencies (SIGMOD 2023)
Authors: Youri Kaminsky (Hasso Plattner Institute), Eduardo Eduardo H. M. Pena (Federal University of Technology
Campo Mourão, Paraná, Brazil), Felix Naumann (Hasso Plattner Institute)
ABSTRACT
Inclusion dependencies (INDs) are a well-known type of data dependency, specifying that the values of one column are contained in those of another column. INDs can be used for various purposes, such as foreign-key candidate selection or join partner discovery. The traditional notion of INDs is based on clean data, where the dependencies hold without exceptions. Unfortunately, data often contain errors, preventing otherwise valid INDs from being discovered. A typical response to this problem is to relax the dependency definition using a similarity measure to account for minor data errors, such as typos or different formatting. While this relaxation is known for functional dependencies, for inclusion dependencies no such relaxation has been defined.
We formally introduce similarity inclusion dependencies, which relax the inclusion by demanding the existence only of sufficiently similar values. Similarity inclusion dependencies can fulfill traditional IND use cases, such as foreign-key candidate discovery, even in the presence of dirty data. We present Sawfish, the first algorithm to discover all similarity inclusion dependencies in a given dataset efficiently. Our algorithm combines approaches for the discovery of traditional INDs and string similarity joins with a novel sliding-window approach and lazy candidate validation. Our experimental evaluation shows that Sawfish can outperform a baseline by a factor of up to 6.5.
Title: Matching Roles from Temporal Data (SIGMOD 2023)
Why Joe Biden is not only President, but also Commander-in-Chief
Authors: Leon Bornemann (Hasso Plattner Institute), Tobias Bleifuß (Hasso Plattner Institute), Dmitri V. Kalashnikov (USA), Fatemeh Nargesian (University of Rochester Rochester, New York, USA), Felix Naumann (Hasso Plattner Institute), Divesh Srivastava (AT&T Chief Data Office)
ABSTRACT
We present role matching, a novel, fine-grained integrity constraint on temporal fact data, i.e., ⟨subject, predicate, object, timestamp⟩-quadruples. A role is a combination of subject and predicate and can be associated with different objects as the real world evolves and the data changes over time. A role matching states that the associated object of two or more roles should always match across time. Once discovered, role matchings can serve as integrity constraints to improve data quality, for instance of structured data in Wikipedia [3]. If violated, role matchings can alert data owners or editors and thus allow them to correct the error. Finding all role matchings is challenging due both to the inherent quadratic complexity of the matching problem and the need to identify true matches based on the possibly short history of the facts observed so far.
To address the first challenge, we introduce several blocking methods both for clean and dirty input data. For the second challenge, the matching step,we showhowthe entity resolution method Ditto [27] can be adapted to achieve satisfactory performance for the role matching task. We evaluate our method on datasets from Wikipedia infoboxes, showing that our blocking approaches can achieve 95% recall, while maintaining a reduction ratio of more than 99.99%, even in the presence of dirty data. In the matching stage, we achieve a macro F1-score of 88% on our datasets, using automatically generated labels.
Title: Fast Algorithms for Denial Constraint Discovery (PVLDB / VLDB 2023)
Authors: Eduardo H. M. Pena (Federal University of Technology
Campo Mourão, Paraná, Brazil), Fabio Porto (LNCC-DEXL
Petropolis, Rio de Janeiro, Brazil), Felix Naumann (Hasso Plattner Institute)
ABSTRACT
Denial constraints (DCs) are an integrity constraint formalism widely used to detect inconsistencies in data. Several algorithms have been devised to discover DCs from data, as manually specifying them is burdensome and, worse yet, error-prone. The existing algorithms follow two basic steps: building an intermediate data structure from records, then enumerating the DCs from that intermediate. However, current algorithms are often inefficient in computing these intermediates. Also, it is still unclear which enumeration algorithm performs best since some of the available algorithms have not yet been compared to each other.
In response, we present a set of new algorithms with improved design choices. We introduce a parallel pipeline for rapidly computing the intermediate using custom data representations, algorithms, and indexes. For DC enumeration, we propose an inverted index, pruning, and parallel search strategies. We present hybrid approaches that integrate our techniques with previous enumeration algorithms, improving their performance in many scenarios. Our experimental study shows that the proposed DC discovery algorithms are consistently much faster (up to an order of magnitude) than the current state-of-the-art.