Web tables constitute a rich and often free source of information. Unfortunately, not all information contained in them is always trustworthy and up-to-date. However, the table's change-history can help assess the trustworthiness and timeliness of its information: Who has authored the table? When was the table last updated? How much was changed? This analysis can, of course, be performed at table-level, but it is much more valuable if done on a cell- or value-level. Only then there is a chance to recognize quality discrepancies within a table.
Tracing the change history by hand is a very tedious task. Tables often change their layout and schema over the course of their lifetime, making it difficult to track which cells are predecessors or successors to each other. In addition, individual cells can be split into multiple cells and, vice-versa, merged into one. The goal of this thesis is to develop an algorithm that automatically constructs the edit history of individual cells in web tables as a graph. Each node in this graph represents one version of one cell at a particular point in time and each edge represents the relationship of these cell versions.
In the context of the Janus project [1] we have already developed an algorithm for matching tables over time and also a baseline algorithm that matches cells (with some constraints) over time. The master's thesis can build on this work and in particular overcome its limitations. For example, the work can focus on improved semantic interpretation of the tables and on scaling aspects. There is already a small gold standard for cell matching, but part of the task is to extend and possibly adapt it. The previous implementation is available in Java, so knowledge of this programming language is beneficial.
[1] https://www.IANVS.org
For more information please contact Prof. Felix Naumann or Tobias Bleifuß.