07.03.2020

Two Full Papers Accepted at JCDL 2020

We are happy to announce that our papers "Hierarchical Document Classification as a Sequence Generation Task" and "Visualising Large Document Collections by Jointly Modeling Text and Network Structure" have been accepted as full papers at the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2020).

Hierarchical Document Classification as a Sequence Generation Task

[More information][Preprint]

Authors
Julian Risch, Samuele Garda, Ralf Krestel

Abstract
Hierarchical classification schemes are an effective and natural way to organize large document collections. However, complex schemes make the manual classification time-consuming and require domain experts. Current machine learning approaches for hierarchical classification do not exploit all the information contained in the hierarchical schemes. During training, they do not make full use of the inherent parent-child relation of classes. For example, they neglect to tailor document representations, such as embeddings, to each individual hierarchy level.

Our model overcomes these problems by addressing hierarchical classification as a sequence generation task. To this end, our neural network transforms a sequence of input words into a sequence of labels, which represents a path through a tree-structured hierarchy scheme. The evaluation uses a patent corpus, which exhibits a complex class hierarchy scheme and high-quality annotations from domain experts and comprises millions of documents. We re-implemented five models from related work and show that our basic model achieves competitive results in comparison with the best approach. A variation of our model that uses the recent Transformer architecture outperforms the other approaches. The error analysis reveals that the encoder of our model has the strongest influence on its classification performance.

Visualising Large Document Collections by Jointly Modeling Text and Network Structure

This work is part of the Mímir Project.

Authors
Tim Repke and Ralf Krestel

Abstract
Many large text collections exhibit graph structures, either inherent to the content itself or encoded in the metadata of the individual documents.
Example graphs extracted from document collections are co-author networks, citation networks, or named-entity-cooccurrence networks.
Furthermore, social networks can be extracted from email corpora, tweets, or social media.
When it comes to visualising these large corpora, either the textual content or the network graph are used.

In this paper, we propose to incorporate both, text and graph, to not only visualise the semantic information encoded in the documents' content but also the relationships expressed by the inherent network structure.
To this end, we introduce a novel algorithm based on multi-objective optimisation to jointly position embedded documents and graph nodes in a two-dimensional landscape.
We illustrate the effectiveness of our approach with real-world datasets and show that we can capture the semantics of large document collections better than other visualisations based on either the content or the network information.