Role Matching in Temporal Data

Currently submitted to VLDB 2022

Abstract

We present role matchings, a novel, fine-grained integrity constraint on temporal fact data, i.e., ⟨subject, predicate, object, timestamp⟩-quadruples. A role is a combination of subject and predicate and can be associated with different objects as the real world evolves and the data changes over time. A role matching is a novel constraint that states that the associated object of two or more different roles should always match at the same timestamps. Once discovered, role matchings can serve as integrity constraints that, if violated, can alert editors and thus allow them to correct the error. We present compatibility-based role matching (CBRM), an algorithm to discover role matchings in large datasets, based on their change histories.

We evaluate our method on datasets from the Socrata open government data portal, as well as Wikipedia infoboxes, showing that our approach can process large datasets of up to 3.5 million roles containing up to 17 million changes. Our approach consistently outperforms baselines, achieving almost 30 percentage points more F-Measure on average.

Datasets

The following datasets will be made available soon. The extracted roles are immediately available for use in implementation (see below). The raw data is the original data source that from which the prepared role-sets were extracted.

Extracted Roles from Wikipedia Infoboxes
Extracted Roles from Socrata
Raw Results (csv) for CBRM (needed for the Evaluation scripts to generate the plots/ print results):
- Input statistics for the Tuning of the Evidence Based Weighting
- Results for the final evaluation (Figure 10)
Raw data for Wikipedia Infoboxes (See Structured Object Matching Across Web Page Revisions to see how this data was obtained)
Raw data for Socrata (available soon)

Code Repositories

The following code-repositories are made available:

CBRM and baselines (Scala): The complete code of our approach (except for weight function tuning, MDMCP and plot generation)
MDMCP (c++): Our solution to the clique partitioning problem makes use of an existing evolutionary algorithm, called MDMCP, which we had to fork from the original author's repository to make some small adjustments.
Evaluation of results / evidence based weight function tuning / generation of plots (Python 3 jupyter notebook): Weight function tuning and plot generation.

Role Matching in Temporal Data

Abstract

Datasets

Code Repositories

Chair

News

16.07.2024 | Congratulations Dr. Leon Bornemann-Paulus!

23.05.2024 | Paper accepted at NLDB 2024

29.04.2024 | Paper accepted at ITISE 2024

03.04.2024 | Congratulations to the EDBT Best Paper Award!

05.03.2024 | Another Paper marked as reproducible by pVLDB Reproducibility Committee

Project highlights

People and open positions