Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

CohEEL - Coherent and Efficient Named Entity Linking through Random Walks

In recent years, the ever-growing amount of documents on the Web as well as in digital libraries led to a considerable increase of valuable textual information about entities. Harvesting entity knowledge from these large text collections is a major challenge. It requires the linkage of textual mentions within the documents with their real-world entities. This process is called entity linking.

This project aims at the automatic creation of entity links from texts to a knowledge base. In contrast to recent research that usually balances the rate of linking correctness (precision) and the linking coverage rate (recall), this project focuses on creating reliable links by favoring the linking precision. Linking precision is the decisive factor for subsequent tasks, building upon the linking results, such as, text summarization, document classification, or topic-based clustering.

CohEEDistributed

This project aims at enabling the entity recognition and alignment of huge text collections with dozens of millions of documents. To reach this goal, this distributed implementation of CohEEL is built on the Apache Flink framework and the applied knowledge base is Wikipedia. The source code is available on GitHub.

Datasets

News: The news article dataset contains 100 randomly picked Reuters articles from the CoNLL-YAGO dataset [1]. The articles were carefully manually annotated with entities from YAGO by our team members and can be found here.

Encyclopedic: The encyclopedic text corpus consists of Wikipedia articles selected in 2006 by Silviu Cucerzan [2]. The original annotations are available here. However, some of the original Wikipedia articles were missing and the YAGO alignments had to be determined. The updated dataset with the annotated YAGO entities can be found here.

Micro: The synthetic micro corpus consists of 50 short text snippets and was introduced in the AIDA project [4]. Every text snippet consists of few (usually one) hand-crafted sentences about different ambiguous mentions of named entities and has similar properties as content of microblogging platforms, such as Twitter. It was produced in the realm of the AIDA project and is available as the KORE dataset here.

Publications

Topic Shifts in StackOverflow: Ask it like Socrates

Gruetze, Toni and Krestel, Ralf and Naumann, Felix
In Proceedings of the 21st International Conference on Applications of Natual Language to Information Systems (NLDB), volume 9612 pages 213–221, 6 2016 Springer.
hpi.de/fileadmin/user_upload/fachgebiete/naumann/publications/2015/tsiso_nldb.pdf

DOI: 10.1007/978-3-319-41754-7_18

Abstract:

Community based question-and-answer (Q&A) sites rely on well posed and appropriately tagged questions. However, most platforms have only limited capabilities to support their users in finding the right tags. In this paper, we propose a temporal recommendation model to support users in tagging new questions and thus improve their acceptance in the community. To underline the necessity of temporal awareness of such a model, we first investigate the changes in tag usage and show different types of collective attention in StackOverflow, a community-driven Q&A website for computer programming topics. Furthermore, we examine the changes over time in the correlation between question terms and topics. Our results show that temporal awareness is indeed important for recommending tags in Q&A communities.

BibTeX file

@inproceedings{GruetzeStackOverflow2016,
author = { Gruetze, Toni and Krestel, Ralf and Naumann, Felix },
title = { Topic Shifts in StackOverflow: Ask it like Socrates },
journal = { Lecture Notes in Computer Science },
year = { 2016 },
volume = { 9612 },
pages = { 213--221 },
month = { 6 },
abstract = { Community based question-and-answer (Q&A) sites rely on well posed and appropriately tagged questions. However, most platforms have only limited capabilities to support their users in finding the right tags. In this paper, we propose a temporal recommendation model to support users in tagging new questions and thus improve their acceptance in the community. To underline the necessity of temporal awareness of such a model, we first investigate the changes in tag usage and show different types of collective attention in StackOverflow, a community-driven Q&A website for computer programming topics. Furthermore, we examine the changes over time in the correlation between question terms and topics. Our results show that temporal awareness is indeed important for recommending tags in Q&A communities. },
url = { hpi.de/fileadmin/user_upload/fachgebiete/naumann/publications/2015/tsiso_nldb.pdf },
publisher = { Springer },
booktitle = { Proceedings of the 21st International Conference on Applications of Natual Language to Information Systems (NLDB) },
priority = { 0 }
}

Copyright Notice

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

last change: Thu, 02 Mar 2017 13:16:08 +0100

References

[1] Z. Zuo, G. Kasneci, T. Gruetze, F. Naumann. BEL: Bagging for Entity Linking. In Proceedings of the International Conference on Computational Linguistics (COLING), pages 2075–2086, 2014.

[2] S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 708–716, 2007.

[3] J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 782–792, 2011.

[4] J. Hoffart, S. Seufert, D. B. Nguyen, M. Theobald, and G. Weikum. KORE: keyphrase overlap relatedness for entity disambiguation. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 545–554, 2012.