Company Name Dataset

Features

The following table belongs to the submission "Dissecting Company Names using Sequence Labeling", in which we have trained a classifier capable of decomposing company names into their constituent parts. It shows the relative and absolute information gain per feature and the corresponding performance improvement of the CRF classifier.

Features		1.21%

Dataset

This dataset contains the documents used in training and testing our company name splitting classifier. It contains a total of 46 documents, which are divided into 39 trainings and 7 test documents.

The training and test documents are text files that contain one company name per line. They are accompanied by a corresponding *.ann file containing the associated annotations.

brat annotation tool and are defined as follows:

[Annotation ID] - The ID of the annotation

[Tag Name] - The corresponding tag.

[Start Offset] - The starting offset of the labeled mention.

[End Offset] - The ending offset of the labeled mention.

[Surface] - The surface form of the labeled entity.

Example:

Company Name: bSolar GmbH

Elements in the annotation file:

T1 PRO_NO 0 6 bSolar

T2 LEG 7 11 GmbH

Download

Dataset (171 KB)