Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Features

The following table belongs to the submission "Dissecting Company Names using Sequence Labeling", in which we have trained a classifier capable of decomposing company names into their constituent parts. It shows the relative and absolute information gain per feature and the corresponding performance improvement of the CRF classifier. 

FeaturesTagsColloquial Name
abs. IGrel. IGCRFabs. IGrel. IGCRF
Surface Form1.590.000.00%0.820.000.00%
Stemmed Surface1.500.002.30%0.760.000.02%
Compound Word1.620.121.71%0.850.090.03%
Prefix 10.330.000.16%0.200.001.30%
Prefix 20.870.002.80%0.440.000.69%
Prefix 31.300.002.54%0.650.000.45%
Prefix 41.480.002.47%0.740.000.27%
Prefix 51.530.001.89%0.780.000.16%
Suffix 10.400.002.35%0.220.001.18%
Suffix 20.840.004.88%0.420.000.65%
Suffix 31.190.004.75%0.590.000.25%
Suffix 41.420.004.24%0.710.000.07%
Suffix 51.510.002.25%0.770.000.10%
Prefix+Suffix 10.790.001.79%0.400.000.78%
Prefix+Suffix 21.440.003.25%0.720.000.12%
Prefix+Suffix 31.520.001.73%0.770.00-0.02%
Prefix+Suffix 41.530.001.53%0.780.00-0.01%
Prefix+Suffix 51.530.001.33%0.780.00-0.11%
Normalized Position1.260.663.23%1.160.574.55%
Absolute Position0.700.462.67%0.640.375.21%
Absolute Position (rev.)0.540.500.02%0.460.372.40%
Length0.150.001.10%0.080.001.37%
Shape0.210.002.44%0.130.001.58%
Long Shape0.400.003.06%0.230.001.54%
Number0.000.00-1.93%0.000.001.59%
Special Character0.010.00-1.86%0.010.001.61%
Letter Case0.160.00-0.10%0.080.001.31%
Abbreviation0.040.02-1.67%0.000.011.41%
Parenthesis0.020.00-2.03%0.010.001.46%
Window Shape1.940.815.15%1.290.612.61%
Word ID1.000.453.80%0.830.373.80%
Legal From Regex0.150.03-1.20%0.140.022.35%
First Name0.050.00-1.28%0.030.001.33%
Surname0.060.001.10%0.020.001.28%
Location0.010.00-1.14%0.010.001.24%
Phone Book Sectors0.010.02-0.64%0.000.011.49%
Sector SimString0.080.210.84%0.030.130.94%
Sector Keyword0.000.00-2.09%0.000.001.56%
Sector Description Keyword0.010.00-1.65%0.000.001.46%
Name Frequency0.280.000.87%0.210.001.58%
Text Frequency0.150.000.70%0.080.001.21%

 

 

Dataset

This dataset contains the documents used in training and testing our company name splitting classifier. It contains a total of 46 documents, which are divided into 39 trainings and 7 test documents.


The training and test documents are text files that contain one company name per line. They are accompanied by a corresponding *.ann file containing the associated annotations.


The annotations were created using the brat annotation tool and are defined as follows:


[Annotation ID] - The ID of the annotation

[Tag Name] - The corresponding tag.

[Start Offset] - The starting offset of the labeled mention.

[End Offset] - The ending offset of the labeled mention.

[Surface] - The surface form of the labeled entity.


Example:

Company Name: bSolar GmbH


Elements in the annotation file:

T1 PRO_NO 0 6 bSolar

T2 LEG 7 11 GmbH

Download