Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

CohEEL - Coherent and Efficient Named Entity Linking through Random Walks

In recent years, the ever-growing amount of documents on the Web as well as in digital libraries led to a considerable increase of valuable textual information about entities. Harvesting entity knowledge from these large text collections is a major challenge. It requires the linkage of textual mentions within the documents with their real-world entities. This process is called entity linking.

This project aims at the automatic creation of entity links from texts to a knowledge base. In contrast to recent research that usually balances the rate of linking correctness (precision) and the linking coverage rate (recall), this project focuses on creating reliable links by favoring the linking precision. Linking precision is the decisive factor for subsequent tasks, building upon the linking results, such as, text summarization, document classification, or topic-based clustering.

CohEEDistributed

This project aims at enabling the entity recognition and alignment of huge text collections with dozens of millions of documents. To reach this goal, this distributed implementation of CohEEL is built on the Apache Flink framework and the applied knowledge base is Wikipedia. The source code is available on GitHub.

Datasets

News: The news article dataset contains 100 randomly picked Reuters articles from the CoNLL-YAGO dataset [1]. The articles were carefully manually annotated with entities from YAGO by our team members and can be found here.

Encyclopedic: The encyclopedic text corpus consists of Wikipedia articles selected in 2006 by Silviu Cucerzan [2]. The original annotations are available here. However, some of the original Wikipedia articles were missing and the YAGO alignments had to be determined. The updated dataset with the annotated YAGO entities can be found here.

Micro: The synthetic micro corpus consists of 50 short text snippets and was introduced in the AIDA project [4]. Every text snippet consists of few (usually one) hand-crafted sentences about different ambiguous mentions of named entities and has similar properties as content of microblogging platforms, such as Twitter. It was produced in the realm of the AIDA project and is available as the KORE dataset here.

Publications

1.
Gruetze, Toni and Kasneci, Gjergji and Zuo, Zhe and Naumann, Felix
Web Semantics: Science, Services and Agents on the World Wide Web, vol. 37(C):75–89 3 2016
http://dx.doi.org/10.1016/j.websem.2016.03.001
2.
Gruetze, Toni and Krestel, Ralf and Naumann, Felix
In Proceedings of the 21st International Conference on Applications of Natual Language to Information Systems (NLDB), volume 9612 pages 213–221, 6 2016 Springer.

References

[1] Z. Zuo, G. Kasneci, T. Gruetze, F. Naumann. BEL: Bagging for Entity Linking. In Proceedings of the International Conference on Computational Linguistics (COLING), pages 2075–2086, 2014.

[2] S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 708–716, 2007.

[3] J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 782–792, 2011.

[4] J. Hoffart, S. Seufert, D. B. Nguyen, M. Theobald, and G. Weikum. KORE: keyphrase overlap relatedness for entity disambiguation. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 545–554, 2012.