Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Structured Object Matching Across Web Page Revisions

A considerable amount of useful information on the web is (semi-)structured, such as tables or lists. An extensive corpus of prior work addresses the problem of making these human-readable representations interpretable by algorithms. Most of these works focus only on a snapshot of these web objects at a certain point in time. However, their evolution over time represents valuable information that has barely been tapped, enabling various applications, including visual change exploration and trust assessment. To realize the full potential of this information, it is critical to match such objects across page revisions.

In this work, we present novel techniques that match tables, infoboxes and lists within a page across page revisions. We are, thus, able to extract the evolution of structured information in various forms from a long series of web page revisions. We evaluate our approach on a representative sample of pages and measure the number of correct matches. Our approach achieves a significant improvement in object matching over baselines and over related work.

Datasets

Here we provide the gold standard as well as the output of our matching algorithm for infobox, list and table matching.

Infobox matching:

  • Identity graph
    • Gold standard (Version 0 ZIP)
    • Output (Version 0 ZIP)

List matching:

  • Identity graph
    • Gold standard (Version 0 ZIP)
    • Output (Version 0 ZIP)

Table matching:

 

Version 1.0 will be available at the time of publication of our corresponding paper.