Role Matching in Temporal Data
Currently submitted to VLDB 2022
Abstract
We present role matchings, a novel, fine-grained integrity constraint on temporal fact data, i.e., ⟨subject, predicate, object, timestamp⟩-quadruples. A role is a combination of subject and predicate and can be associated with different objects as the real world evolves and the data changes over time. A role matching is a novel constraint that states that the associated object of two or more different roles should always match at the same timestamps. Once discovered, role matchings can serve as integrity constraints that, if violated, can alert editors and thus allow them to correct the error. We present compatibility-based role matching (CBRM), an algorithm to discover role matchings in large datasets, based on their change histories.
We evaluate our method on datasets from the Socrata open government data portal, as well as Wikipedia infoboxes, showing that our approach can process large datasets of up to 3.5 million roles containing up to 17 million changes. Our approach consistently outperforms baselines, achieving almost 30 percentage points more F-Measure on average.
Datasets
The following datasets will be made available soon. The extracted roles are immediately available for use in implementation (see below). The raw data is the original data source that from which the prepared role-sets were extracted.
- Extracted Roles from Wikipedia Infoboxes
- Extracted Roles from Socrata
- Raw Results (csv) for CBRM (needed for the Evaluation scripts to generate the plots/ print results):
- Raw data for Wikipedia Infoboxes (See Structured Object Matching Across Web Page Revisions to see how this data was obtained)
- Raw data for Socrata (available soon)
Code Repositories
The following code-repositories are made available:
- CBRM and baselines (Scala): The complete code of our approach (except for weight function tuning, MDMCP and plot generation)
- MDMCP (c++): Our solution to the clique partitioning problem makes use of an existing evolutionary algorithm, called MDMCP, which we had to fork from the original author's repository to make some small adjustments.
- Evaluation of results / evidence based weight function tuning / generation of plots (Python 3 jupyter notebook): Weight function tuning and plot generation.