Prof. Dr. Felix Naumann

CohEEL - Coherent and Efficient Named Entity Linking through Random Walks

In recent years, the ever-growing amount of documents on the Web as well as in digital libraries led to a considerable increase of valuable textual information about entities. Harvesting entity knowledge from these large text collections is a major challenge. It requires the linkage of textual mentions within the documents with their real-world entities. This process is called entity linking.

This project aims at the automatic creation of entity links from texts to a knowledge base. In contrast to recent research that usually balances the rate of linking correctness (precision) and the linking coverage rate (recall), this project focuses on creating reliable links by favoring the linking precision. Linking precision is the decisive factor for subsequent tasks, building upon the linking results, such as, text summarization, document classification, or topic-based clustering.


This project aims at enabling the entity recognition and alignment of huge text collections with dozens of millions of documents. To reach this goal, this distributed implementation of CohEEL is built on the Apache Flink framework and the applied knowledge base is Wikipedia. The source code is available on GitHub.


News: The news article dataset contains 100 randomly picked Reuters articles from the CoNLL-YAGO dataset [1]. The articles were carefully manually annotated with entities from YAGO by our team members and can be found here.

Encyclopedic: The encyclopedic text corpus consists of Wikipedia articles selected in 2006 by Silviu Cucerzan [2]. The original annotations are available here. However, some of the original Wikipedia articles were missing and the YAGO alignments had to be determined. The updated dataset with the annotated YAGO entities can be found here.

Micro: The synthetic micro corpus consists of 50 short text snippets and was introduced in the AIDA project [4]. Every text snippet consists of few (usually one) hand-crafted sentences about different ambiguous mentions of named entities and has similar properties as content of microblogging platforms, such as Twitter. It was produced in the realm of the AIDA project and is available as the KORE dataset here.


What was Hillary Clinton doing in Katy, Texas?

Gruetze, Toni; Krestel, Ralf; Lazaridou, Konstantina; Naumann, Felix in Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, 3-7 April, 2017 ACM , 2017 .

During the last presidential election in the United States of America, Twitter drew a lot of attention. This is because many leading persons and organizations, such as U.S. president Donald J. Trump, showed a strong affection to this medium. In this work we neglect the political contents and opinions shared on Twitter and focus on the question: Can we determine and track the physical location of the presidential candidates based on posts in the Twittersphere?
Further Information
Tags isg


[1] Z. Zuo, G. Kasneci, T. Gruetze, F. Naumann. BEL: Bagging for Entity Linking. In Proceedings of the International Conference on Computational Linguistics (COLING), pages 2075–2086, 2014.

[2] S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 708–716, 2007.

[3] J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 782–792, 2011.

[4] J. Hoffart, S. Seufert, D. B. Nguyen, M. Theobald, and G. Weikum. KORE: keyphrase overlap relatedness for entity disambiguation. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 545–554, 2012.