Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Zhe Zuo

Hasso-Plattner-Institut
für Softwaresystemtechnik
Prof.-Dr.-Helmert-Straße 2-3
D-14482 Potsdam

Phone: +49 331 5509 177
Fax: +49 331 5509 287
Room: G-3.2.09
Email:  Zhe Zuo

 


Research Interests

  • Entity Linking
  • Web Mining
  • Information Extraction

Publications

Improving Company Recognition from Unstructured Text by using Dictionaries

Michael Loster, Zhe Zuo, Felix Naumann, Oliver Maspfuhl, Dirk Thomas
Proceedings of the International Conference on Extending Database Technology (EDBT), 2017 accepted

Abstract:

While named entity recognition is a much addressed research topic, recognizing companies in text is of particular difficulty. Company names are extremely heterogeneous in structure, a given company can be referenced in many different ways, their names include person names, locations, acronyms, numbers, and other unusual tokens. Further, instead of using the official company name, quite different colloquial names are frequently used by the general public. We present a machine learning (CRF) system that reliably recognizes organizations in German texts. In particular, we construct and employ various dictionaries, regular expressions, text context, and other techniques to improve the results. In our experiments we achieved a precision of 91.11% and a recall of 78.82%, showing significant improvement over related work. Using our system we were able to extract 263,846 company mentions from a corpus of 141,970 newspaper articles.

Keywords:

NER, named entity recognition, companies, company names, CRF, conditional random fields, recognition

BibTeX file

@article{Michael2017a,
author = { Michael Loster, Zhe Zuo, Felix Naumann, Oliver Maspfuhl, Dirk Thomas },
title = { Improving Company Recognition from Unstructured Text by using Dictionaries },
journal = { Proceedings of the International Conference on Extending Database Technology (EDBT) },
year = { 2017 },
month = { 0 },
abstract = { While named entity recognition is a much addressed research topic, recognizing companies in text is of particular difficulty. Company names are extremely heterogeneous in structure, a given company can be referenced in many different ways, their names include person names, locations, acronyms, numbers, and other unusual tokens. Further, instead of using the official company name, quite different colloquial names are frequently used by the general public. We present a machine learning (CRF) system that reliably recognizes organizations in German texts. In particular, we construct and employ various dictionaries, regular expressions, text context, and other techniques to improve the results. In our experiments we achieved a precision of 91.11% and a recall of 78.82%, showing significant improvement over related work. Using our system we were able to extract 263,846 company mentions from a corpus of 141,970 newspaper articles. },
affiliation = { Hasso Plattner Institute, Potsdam, Germany },
keywords = { NER, named entity recognition, companies, company names, CRF, conditional random fields, recognition },
isbn = { 978-3-89318-073-8 },
priority = { 0 }
}

Copyright Notice

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

last change: Fri, 10 Feb 2017 11:53:00 +0100