Features

The following table belongs to the submission "Dissecting Company Names using Sequence Labeling", in which we have trained a classifier capable of decomposing company names into their constituent parts. It shows the relative and absolute information gain per feature and the corresponding performance improvement of the CRF classifier.

Features		Tags			Colloquial Name
	abs. IG	rel. IG	CRF	abs. IG	rel. IG	CRF
Surface Form	1.59	0.00	0.00%	0.82	0.00	0.00%
Stemmed Surface	1.50	0.00	2.30%	0.76	0.00	0.02%
Compound Word	1.62	0.12	1.71%	0.85	0.09	0.03%
Prefix 1	0.33	0.00	0.16%	0.20	0.00	1.30%
Prefix 2	0.87	0.00	2.80%	0.44	0.00	0.69%
Prefix 3	1.30	0.00	2.54%	0.65	0.00	0.45%
Prefix 4	1.48	0.00	2.47%	0.74	0.00	0.27%
Prefix 5	1.53	0.00	1.89%	0.78	0.00	0.16%
Suffix 1	0.40	0.00	2.35%	0.22	0.00	1.18%
Suffix 2	0.84	0.00	4.88%	0.42	0.00	0.65%
Suffix 3	1.19	0.00	4.75%	0.59	0.00	0.25%
Suffix 4	1.42	0.00	4.24%	0.71	0.00	0.07%
Suffix 5	1.51	0.00	2.25%	0.77	0.00	0.10%
Prefix+Suffix 1	0.79	0.00	1.79%	0.40	0.00	0.78%
Prefix+Suffix 2	1.44	0.00	3.25%	0.72	0.00	0.12%
Prefix+Suffix 3	1.52	0.00	1.73%	0.77	0.00	-0.02%
Prefix+Suffix 4	1.53	0.00	1.53%	0.78	0.00	-0.01%
Prefix+Suffix 5	1.53	0.00	1.33%	0.78	0.00	-0.11%
Normalized Position	1.26	0.66	3.23%	1.16	0.57	4.55%
Absolute Position	0.70	0.46	2.67%	0.64	0.37	5.21%
Absolute Position (rev.)	0.54	0.50	0.02%	0.46	0.37	2.40%
Length	0.15	0.00	1.10%	0.08	0.00	1.37%
Shape	0.21	0.00	2.44%	0.13	0.00	1.58%
Long Shape	0.40	0.00	3.06%	0.23	0.00	1.54%
Number	0.00	0.00	-1.93%	0.00	0.00	1.59%
Special Character	0.01	0.00	-1.86%	0.01	0.00	1.61%
Letter Case	0.16	0.00	-0.10%	0.08	0.00	1.31%
Abbreviation	0.04	0.02	-1.67%	0.00	0.01	1.41%
Parenthesis	0.02	0.00	-2.03%	0.01	0.00	1.46%
Window Shape	1.94	0.81	5.15%	1.29	0.61	2.61%
Word ID	1.00	0.45	3.80%	0.83	0.37	3.80%
Legal From Regex	0.15	0.03	-1.20%	0.14	0.02	2.35%
First Name	0.05	0.00	-1.28%	0.03	0.00	1.33%
Surname	0.06	0.00	1.10%	0.02	0.00	1.28%
Location	0.01	0.00	-1.14%	0.01	0.00	1.24%
Phone Book Sectors	0.01	0.02	-0.64%	0.00	0.01	1.49%
Sector SimString	0.08	0.21	0.84%	0.03	0.13	0.94%
Sector Keyword	0.00	0.00	-2.09%	0.00	0.00	1.56%
Sector Description Keyword	0.01	0.00	-1.65%	0.00	0.00	1.46%
Name Frequency	0.28	0.00	0.87%	0.21	0.00	1.58%
Text Frequency	0.15	0.00	0.70%	0.08	0.00	1.21%

Dataset

This dataset contains the documents used in training and testing our company name splitting classifier. It contains a total of 46 documents, which are divided into 39 trainings and 7 test documents.

The training and test documents are text files that contain one company name per line. They are accompanied by a corresponding *.ann file containing the associated annotations.

The annotations were created using the brat annotation tool and are defined as follows:

[Annotation ID] - The ID of the annotation

[Tag Name] - The corresponding tag.

[Start Offset] - The starting offset of the labeled mention.

[End Offset] - The ending offset of the labeled mention.

[Surface] - The surface form of the labeled entity.

Example:

Company Name: bSolar GmbH

Elements in the annotation file:

T1 PRO_NO 0 6 bSolar

T2 LEG 7 11 GmbH

Download

Dataset (171 KB)

Features

Dataset

Download

Chair

News

06.10.2024 | Paper accepted at EDBT 2025

06.09.2024 | Congratulations Dr. Phillip Wenig

06.09.2024 | Congratulations Dr. Mazhar Hameed!

16.07.2024 | Congratulations Dr. Leon Bornemann-Paulus!

23.05.2024 | Paper accepted at NLDB 2024

Project highlights

People and open positions