Named entity recognition (NER) of artwork titles
Artworks are an essential entity in the art domain, and their titles are the surface form used to mention these entities in art-historic documents. However, the nature of artwork titles makes their recognition a difficult task because they might be ambiguous, they contain mentions of other entities like locations and persons, and often they are composed of tokens that without enough context could be categorized like other syntactic constructs. Take for example "Guernica" by Pablo Picasso: without the proper context, a mention to this artwork might be confused for the place instead of the artwork depicting the events which took place there.
Although deep learning models can improve the performance for the task of NER, these models require large amounts of labeled data, which can be costly and time-consuming to obtain. Therefore, one of the approaches which we are experimenting with is to adapt models and datasets used in different domains, to reduce the amount of labeled data needed to recognize artwork titles. This approach has been previously defined as Cross-domain NER.
Latent Syntactic Structures for Named Entity Recognition
The transformer model, which is the basis of most recent pre-trained language models (PLMs), encodes multiple aspects of text into rich contextualized vectors, representing lexical, syntactic, and semantic information captured during pre-training.
We define a model to train the disentanglement of a latent space from pre-trained transformer encoders, in which multi-word named entities are represented in terms of token distances. We define a loss function which clusters the tokens of an entity together in the latent space, optimizing the intra- and extra-cluster distances.