Structured Object Matching across Web Page Revisions

This page provides additional information and artifacts for our ICDE2021 paper:

[1]Bleifuß, Tobias, Leon Bornemann, Dmitri V. Kalashnikov, Felix Naumann, and Divesh Srivastava. Structured Object Matching across Web Page Revisions. In IEEE International Conference on Data Engineering (ICDE), pages 1284–1295, 2021.

[ Abstract ] [ BibTeX ] [ URL ] [ Download ]

Abstract

A considerable amount of useful information on the web is (semi-)structured, such as tables and lists. An extensive corpus of prior work addresses the problem of making these human-readable representations interpretable by algorithms. Most of these works focus only on the most recent snapshot of these web objects. However, their evolution over time represents valuable information that has barely been tapped, enabling various applications, including visual change exploration and trust assessment. To realize the full potential of this information, it is critical to match such objects across page revisions.

In this work, we present novel techniques that match tables, infoboxes and lists within a page across page revisions. We are, thus, able to extract the evolution of structured information in various forms from a long series of web page revisions. We evaluate our approach on a representative sample of pages and measure the number of correct matches. Our approach achieves a significant improvement in object matching over baselines and over related work.

Datasets

Here we provide the output datasets of our matching algorithm for infobox, list and table matching.

Infobox matching: Download

List matching: Download

Table matching: Download

Table matching including parsed content: Download

Please cite the paper above to refer to these datasets.

Documentation

All datasets contain JSON files (with one JSON object per line, so called JSON Lines). The objects themselves look similar to this example:

{
"similarityFirst":0.19148936170212766,
"pageTitle":"Arctic Monkeys",
"validFrom":"2019-07-13T19:05:09Z",
"pageID":1720451,
"content":[
"[[Alex Turner (musician)|Alex Turner]] – lead vocals, keyboards and synthesizers, rhythm and occasional lead guitar, piano (2002–present)",
"[[Matt Helders]] – drums, backing and lead vocals (2002–present)",
"[[Jamie Cook]] – lead and occasional rhythm guitar (2002–present); occasional keyboards (2018–present); backing vocals (2002–2006)",
"[[Nick O\u0027Malley]] – bass, backing vocals (2006–present)"
],
"contentHash":-354743718,
"itemCount":4,
"revisionId":906116756,
"similarityLast":1.0,
"contextType":"UPDATE",
"headings":"Band members",
"comment":"Reverted edits by [[Special:Contribs/104.176.172.183|104.176.172.183]] ([[User talk:104.176.172.183|talk]]) to last version by Robvanvee",
"position":1,
"user":{
"username":"C.Fred",
"id":461300
},
"contentType":"UNMODIFIED",
"key":"19239717-0",
"validTo":"2019-08-24T07:08:05Z"
}

Reproducibility

Here we explain how you can recreate our results in the paper.

Install required software (Java 11, R 4.0.2, Latex, jq 1.6)
Download this archive here and extract it
Create a "downloads" folder in the root folder of the extracted archive
Put all artifacts that you can find here in this "downloads" folder
Execute in order:
1. extract.sh (to extract the gold standard and put everything in place)
2. getMatchingResults.sh (to run the actual experiments)
3. runAggregation.sh (to aggregate the results for plotting)
4. createPlots.sh (to finally create the plots)

Structured Object Matching across Web Page Revisions

Abstract

Datasets

Documentation

Reproducibility

Chair

News

06.10.2024 | Paper accepted at EDBT 2025

06.09.2024 | Congratulations Dr. Phillip Wenig

06.09.2024 | Congratulations Dr. Mazhar Hameed!

16.07.2024 | Congratulations Dr. Leon Bornemann-Paulus!

23.05.2024 | Paper accepted at NLDB 2024

Project highlights

People and open positions