Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

CohEEL - Coherent and Efficient Named Entity Linking through Random Walks

In recent years, the ever-growing amount of documents on the Web as well as in digital libraries led to a considerable increase of valuable textual information about entities. Harvesting entity knowledge from these large text collections is a major challenge. It requires the linkage of textual mentions within the documents with their real-world entities. This process is called entity linking.

This project aims at the automatic creation of entity links from texts to a knowledge base. In contrast to recent research that usually balances the rate of linking correctness (precision) and the linking coverage rate (recall), this project focuses on creating reliable links by favoring the linking precision. Linking precision is the decisive factor for subsequent tasks, building upon the linking results, such as, text summarization, document classification, or topic-based clustering.

CohEEDistributed

This project aims at enabling the entity recognition and alignment of huge text collections with dozens of millions of documents. To reach this goal, this distributed implementation of CohEEL is built on the Apache Flink framework and the applied knowledge base is Wikipedia. The source code is available on GitHub.

Datasets

News: The news article dataset contains 100 randomly picked Reuters articles from the CoNLL-YAGO dataset [1]. The articles were carefully manually annotated with entities from YAGO by our team members and can be found here.

Encyclopedic: The encyclopedic text corpus consists of Wikipedia articles selected in 2006 by Silviu Cucerzan [2]. The original annotations are available here. However, some of the original Wikipedia articles were missing and the YAGO alignments had to be determined. The updated dataset with the annotated YAGO entities can be found here.

Micro: The synthetic micro corpus consists of 50 short text snippets and was introduced in the AIDA project [4]. Every text snippet consists of few (usually one) hand-crafted sentences about different ambiguous mentions of named entities and has similar properties as content of microblogging platforms, such as Twitter. It was produced in the realm of the AIDA project and is available as the KORE dataset here.

Publications

CohEEL: Coherent and Efficient Named Entity Linking through Random Walks

Gruetze, Toni and Kasneci, Gjergji and Zuo, Zhe and Naumann, Felix
Web Semantics: Science, Services and Agents on the World Wide Web, vol. 37(C):75–89 3 2016
http://dx.doi.org/10.1016/j.websem.2016.03.001

DOI: 10.1016/j.websem.2016.03.001

Abstract:

In recent years, the ever-growing amount of documents on the Web as well as in digital libraries led to a considerable increase of valuable textual information about entities. Harvesting entity knowledge from these large text collections is a major challenge. It requires the linkage of textual mentions within the documents with their real-world entities. This process is called entity linking. Solutions to this entity linking problem have typically aimed at balancing the rate of linking correctness (precision) and the linking coverage rate (recall). While entity links in texts could be used to improve various Information Retrieval tasks, such as text summarization, document classification, or topic-based clustering, the linking precision is the decisive factor. For example, for topic-based clustering a method that produces mostly correct links would be more desirable than a high-coverage method that leads to more but also more uncertain clusters. We propose an efficient linking method that uses a random walk strategy to combine a precision-oriented and a recall-oriented classifier in such a way that a high precision is maintained, while recall is elevated to the maximum possible level without affecting precision. An evaluation on three datasets with distinct characteristics demonstrates that our approach outperforms seminal work in the area and shows higher precision and time performance than the most closely related state-of-the-art methods.

Keywords:

Entity Linking, Named Entity Disambiguation, Random Walk, Machine Learning

BibTeX file

@article{Gruetze2016,
author = { Gruetze, Toni and Kasneci, Gjergji and Zuo, Zhe and Naumann, Felix },
title = { CohEEL: Coherent and Efficient Named Entity Linking through Random Walks },
journal = { Web Semantics: Science, Services and Agents on the World Wide Web },
year = { 2016 },
volume = { 37 },
number = { C },
pages = { 75--89 },
month = { 3 },
abstract = { In recent years, the ever-growing amount of documents on the Web as well as in digital libraries led to a considerable increase of valuable textual information about entities. Harvesting entity knowledge from these large text collections is a major challenge. It requires the linkage of textual mentions within the documents with their real-world entities. This process is called entity linking. Solutions to this entity linking problem have typically aimed at balancing the rate of linking correctness (precision) and the linking coverage rate (recall). While entity links in texts could be used to improve various Information Retrieval tasks, such as text summarization, document classification, or topic-based clustering, the linking precision is the decisive factor. For example, for topic-based clustering a method that produces mostly correct links would be more desirable than a high-coverage method that leads to more but also more uncertain clusters. We propose an efficient linking method that uses a random walk strategy to combine a precision-oriented and a recall-oriented classifier in such a way that a high precision is maintained, while recall is elevated to the maximum possible level without affecting precision. An evaluation on three datasets with distinct characteristics demonstrates that our approach outperforms seminal work in the area and shows higher precision and time performance than the most closely related state-of-the-art methods. },
keywords = { Entity Linking, Named Entity Disambiguation, Random Walk, Machine Learning },
url = { http://dx.doi.org/10.1016/j.websem.2016.03.001 },
publisher = { Elsevier B.V. },
issn = { 1570-8268 },
priority = { 0 }
}

Copyright Notice

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

last change: Fri, 12 Aug 2016 17:23:06 +0200

References

[1] Z. Zuo, G. Kasneci, T. Gruetze, F. Naumann. BEL: Bagging for Entity Linking. In Proceedings of the International Conference on Computational Linguistics (COLING), pages 2075–2086, 2014.

[2] S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 708–716, 2007.

[3] J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 782–792, 2011.

[4] J. Hoffart, S. Seufert, D. B. Nguyen, M. Theobald, and G. Weikum. KORE: keyphrase overlap relatedness for entity disambiguation. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 545–554, 2012.