Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Structured Object Matching across Web Page Revisions

This page provides additional information and artifacts for our ICDE2021 paper:

  • Bleifuß, Tobias, Leon Bornemann, Dmitri V. Kalashnikov, Felix Naumann, and Divesh Srivastava. Structured Object Matching across Web Page Revisions. InIEEE International Conference on Data Engineering (ICDE), 2021.
     

Abstract

A considerable amount of useful information on the web is (semi-)structured, such as tables and lists. An extensive corpus of prior work addresses the problem of making these human-readable representations interpretable by algorithms. Most of these works focus only on the most recent snapshot of these web objects. However, their evolution over time represents valuable information that has barely been tapped, enabling various applications, including visual change exploration and trust assessment. To realize the full potential of this information, it is critical to match such objects across page revisions.

In this work, we present novel techniques that match tables, infoboxes and lists within a page across page revisions. We are, thus, able to extract the evolution of structured information in various forms from a long series of web page revisions. We evaluate our approach on a representative sample of pages and measure the number of correct matches. Our approach achieves a significant improvement in object matching over baselines and over related work.

Datasets

Here we provide the output datasets of our matching algorithm for infobox, list and table matching.

Infobox matching: Download

List matching: Download

Table matching: Download

Please cite the paper above to refer to these datasets.

Reproducibility

Here we explain how you can recreate our results in the paper.

  1. Install required software (Java 11, R 4.0.2, Latex, jq 1.6)
  2. Download this archive here and extract it
  3. Create a "downloads" folder in the root folder of the extracted archive
  4. Put all artifacts that you can find here in this "downloads" folder
  5. Execute in order:
    1. extract.sh (to extract the gold standard and put everything in place)
    2. getMatchingResults.sh (to run the actual experiments)
    3. runAggregation.sh (to aggregate the results for plotting)
    4. createPlots.sh (to finally create the plots)