Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

People

Zhe Zuo, Gjergji Kasneci, Toni Gruetze, and Felix Naumann

Overview

BEL (Bagging for Entity Linking) is an entity linking technique for establishing a mapping from textual mentions of named entities in a given English text to canonical representations of those entities (e.g., entity IDs) in a knowledge base (e.g., YAGO). BEL has three main characteristics that allow an efficient and high-quality linking: (1) it operates on a textual range of relevant terms, (2) it adequately aggregates decisions from an ensemble of simple classifiers, each of which operates on a randomly sampled subset of terms from the above range, and (3) it follows a local reasoning strategy by exploiting previous decisions whenever possible.

Entity Linking Strategy

In a preprocessing step, a named entity recognizer (i.e., Stanford Parser) is employed to derive mentions from a given document. Then, for each mention, a list of promising candidates is derived from Wikepdia by computing the probability of a Wikipedia article e being referred to by the mention m (i.e., P(e|m)). BEL uses the local context of a mention by operating on a textual range of relevant terms surrounding the mention. Multiple subsets are generated by randomly drawing terms from the relevant range based on bootstrapping. For each candidate entity, a statistical language model is applied on each random subset calculate the contextual similarity score and generates a ranking of the candidates based on the context captured by the subset. Each ranking classifier combines the contextual similarity score and the probability of a candidate being referred to by the mention in question. The combined score yields the final ranking of each classifier. If the majority of the ranking classifiers has the same candidate as top-ranked entity, the mention is linked to that candidate. Otherwise, we consider that the corresponding entity is not the knowledge base.

Evaluation

BEL has been evaluated in comparison of LED [1], AIDA-GRAPH [2], and AIDA-KORE [3]. For all the scores in the left table, 99% confidence intervals have been calculated.

Datasets Used for Evaluation

CoNLL-YAGO: The subset of the CoNLL-YAGO dataset that we used in the experiments contains 76 articles that were carefully manually (re-)annotated by our team members.

CUCERZAN: This dataset consists of 350 Wikipedia articles that were randomly selected by Silviu Cucerzan [1].  The original annotations can be found here. Since some of the original Wikipedia articles are missing, we have recovered 339 out of 350 articles. In this version of the articles, the mentions were annotated with entities from YAGO instead of the original Wikipedia pages. 

KORE: This dataset contains 50 short hand-crafted articles including highly ambiguous mentions of named entities. It was produced in the realm of the AIDA project [3], which is available here.  

Publications

  • BEL: Bagging for Entity Linking. Zuo, Zhe; Kasneci, Gjergji; Gruetze, Toni; Naumann, Felix (2014).
     

References

[1] S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 708–716, 2007.

[2] J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 782–792, 2011.

[3] J. Hoffart, S. Seufert, D. B. Nguyen, M. Theobald, and G. Weikum. KORE: keyphrase overlap relatedness for entity disambiguation. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 545–554, 2012.