Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

15.10.2012

Article accepted at Information Systems Journal (IS)

Cross-lingual Entity Matching and Infobox Alignment in Wikipedia

Daniel Rinser, Dustin Lange, and Felix Naumann

Abstract. Wikipedia has grown to a huge, multi-lingual source of encyclopedic knowledge. Apart from textual content, a large and ever-increasing number of articles feature so-called infoboxes, which provide factual information about the articles' subjects. As the different language versions evolve independently, they provide different information on the same topics. Correspondences between infobox attributes in different language editions can be leveraged for several use cases, such as automatic detection and resolution of inconsistencies in infobox data across language versions, or the automatic augmentation of infoboxes in one language with data from other language versions.

We present an instance-based schema matching technique that exploits information overlap in infoboxes across different language editions. As a prerequisite we present a graph-based approach to identify articles in different languages representing the same real-world entity using (and correcting) the interlanguage links in Wikipedia. To account for the untyped nature of infobox schemas, we present a robust similarity measure that can reliably quantify the similarity of strings with mixed types of data. The qualitative evaluation on the basis of manually labeled attribute correspondences between infoboxes in four of the largest Wikipedia editions demonstrates the effectiveness of the proposed approach.