Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

23.05.2024

Paper accepted at NLDB 2024

We are excited to announce that the paper "Shact: Disentangling and Clustering Latent Syntactic Structures from Transformer Encoders" was accepted to be presented at the 29th International Conference on Natural Language & Information Systems (NLDB 2024)

Authors:

Alejandro Sierra-Múnera (Hasso Plattner Institute)
Ralf Krestel (ZBW - Leibniz Information Center for Economics, Kiel University)

Abstract:

Transformer-encoder architectures for language modeling provide rich contextualized vectors, representing both, syntactic and semantic information captured during pre-training. These vectors are useful for multiple downstream tasks, but directly using the final layer representations might hide interesting elements represented in the hidden layers.

In this paper, we propose Shact Syntactic Hierarchical Agglomerative
Clustering from Transformer-Encoders, a model that disentangles syntactic span representations from these hidden representations, into a latent vector space. In our model, spans are expressed in terms of token distances. We propose a loss function that optimizes the neural disentanglement model from ground truth spans, and we propose to integrate these latent space vectors into a two-phase model via hierarchical clustering, suitable for multiple span recognition tasks.
We evaluated our approach on flat and nested named entity recognition as well as chunking, showing the model's ability to discover these spans, as well as having competitive results on the full recognition and classification tasks.