Company Name Dataset
Features
The following table belongs to the submission "Dissecting Company Names using Sequence Labeling", in which we have trained a classifier capable of decomposing company names into their constituent parts. It shows the relative and absolute information gain per feature and the corresponding performance improvement of the CRF classifier.
| Features | 1.21% |
|---|
Dataset
This dataset contains the documents used in training and testing our company name splitting classifier. It contains a total of 46 documents, which are divided into 39 trainings and 7 test documents.
The training and test documents are text files that contain one company name per line. They are accompanied by a corresponding *.ann file containing the associated annotations.
brat annotation tool and are defined as follows:
[Annotation ID] - The ID of the annotation
[Tag Name] - The corresponding tag.
[Start Offset] - The starting offset of the labeled mention.
[End Offset] - The ending offset of the labeled mention.
[Surface] - The surface form of the labeled entity.
Example:
Company Name: bSolar GmbH
Elements in the annotation file:
T1 PRO_NO 0 6 bSolar
T2 LEG 7 11 GmbH
Download
- Dataset (171 KB)