This dataset contains the documents used in training and testing our company name splitting classifier. It contains a total of 46 documents, which are divided into 39 trainings and 7 test documents.
The training and test documents are text files that contain one company name per line. They are accompanied by a corresponding *.ann file containing the associated annotations.
The annotations were created using the brat annotation tool and are defined as follows:
[Annotation ID] - The ID of the annotation
[Tag Name] - The corresponding tag.
[Start Offset] - The starting offset of the labeled mention.
[End Offset] - The ending offset of the labeled mention.
[Surface] - The surface form of the labeled entity.
Example:
Company Name: bSolar GmbH
Elements in the annotation file:
T1 PRO_NO 0 6 bSolar
T2 LEG 7 11 GmbH