Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Features

The following table belongs to the submission "Dissecting Company Names using Sequence Labeling", in which we have trained a classifier capable of decomposing company names into their constituent parts. It shows the relative and absolute information gain per feature and the corresponding performance improvement of the CRF classifier. 

Features Tags Colloquial Name
abs. IG rel. IG CRF abs. IG rel. IG CRF
Surface Form 1.59 0.00 0.00% 0.82 0.00 0.00%
Stemmed Surface 1.50 0.00 2.30% 0.76 0.00 0.02%
Compound Word 1.62 0.12 1.71% 0.85 0.09 0.03%
Prefix 1 0.33 0.00 0.16% 0.20 0.00 1.30%
Prefix 2 0.87 0.00 2.80% 0.44 0.00 0.69%
Prefix 3 1.30 0.00 2.54% 0.65 0.00 0.45%
Prefix 4 1.48 0.00 2.47% 0.74 0.00 0.27%
Prefix 5 1.53 0.00 1.89% 0.78 0.00 0.16%
Suffix 1 0.40 0.00 2.35% 0.22 0.00 1.18%
Suffix 2 0.84 0.00 4.88% 0.42 0.00 0.65%
Suffix 3 1.19 0.00 4.75% 0.59 0.00 0.25%
Suffix 4 1.42 0.00 4.24% 0.71 0.00 0.07%
Suffix 5 1.51 0.00 2.25% 0.77 0.00 0.10%
Prefix+Suffix 1 0.79 0.00 1.79% 0.40 0.00 0.78%
Prefix+Suffix 2 1.44 0.00 3.25% 0.72 0.00 0.12%
Prefix+Suffix 3 1.52 0.00 1.73% 0.77 0.00 -0.02%
Prefix+Suffix 4 1.53 0.00 1.53% 0.78 0.00 -0.01%
Prefix+Suffix 5 1.53 0.00 1.33% 0.78 0.00 -0.11%
Normalized Position 1.26 0.66 3.23% 1.16 0.57 4.55%
Absolute Position 0.70 0.46 2.67% 0.64 0.37 5.21%
Absolute Position (rev.) 0.54 0.50 0.02% 0.46 0.37 2.40%
Length 0.15 0.00 1.10% 0.08 0.00 1.37%
Shape 0.21 0.00 2.44% 0.13 0.00 1.58%
Long Shape 0.40 0.00 3.06% 0.23 0.00 1.54%
Number 0.00 0.00 -1.93% 0.00 0.00 1.59%
Special Character 0.01 0.00 -1.86% 0.01 0.00 1.61%
Letter Case 0.16 0.00 -0.10% 0.08 0.00 1.31%
Abbreviation 0.04 0.02 -1.67% 0.00 0.01 1.41%
Parenthesis 0.02 0.00 -2.03% 0.01 0.00 1.46%
Window Shape 1.94 0.81 5.15% 1.29 0.61 2.61%
Word ID 1.00 0.45 3.80% 0.83 0.37 3.80%
Legal From Regex 0.15 0.03 -1.20% 0.14 0.02 2.35%
First Name 0.05 0.00 -1.28% 0.03 0.00 1.33%
Surname 0.06 0.00 1.10% 0.02 0.00 1.28%
Location 0.01 0.00 -1.14% 0.01 0.00 1.24%
Phone Book Sectors 0.01 0.02 -0.64% 0.00 0.01 1.49%
Sector SimString 0.08 0.21 0.84% 0.03 0.13 0.94%
Sector Keyword 0.00 0.00 -2.09% 0.00 0.00 1.56%
Sector Description Keyword 0.01 0.00 -1.65% 0.00 0.00 1.46%
Name Frequency 0.28 0.00 0.87% 0.21 0.00 1.58%
Text Frequency 0.15 0.00 0.70% 0.08 0.00 1.21%

 

 

 

Dataset

This dataset contains the documents used in training and testing our company name splitting classifier. It contains a total of 46 documents, which are divided into 39 trainings and 7 test documents.


The training and test documents are text files that contain one company name per line. They are accompanied by a corresponding *.ann file containing the associated annotations.


The annotations were created using the brat annotation tool and are defined as follows:


[Annotation ID] - The ID of the annotation

[Tag Name] - The corresponding tag.

[Start Offset] - The starting offset of the labeled mention.

[End Offset] - The ending offset of the labeled mention.

[Surface] - The surface form of the labeled entity.


Example:

Company Name: bSolar GmbH


Elements in the annotation file:

T1 PRO_NO 0 6 bSolar

T2 LEG 7 11 GmbH

Download