Our paper titled "Offensive Language Identification using a German BERT Model" has been accepted for publication at the GermEval workshop co-located with KONVENS (Conference on Natural Language Processing). The paper by Julian Risch, Anke Stoll, Marc Ziegele, and Ralf Krestel is a result of the research collaboration with colleagues from the Heinrich Heine University Düsseldorf and describes our submission to this year's shared task on the identification of offensive language in German tweets. The paper can be found here.
Abstract
Pre-training language representations on large text corpora, for example, with BERT, has recently shown to achieve impressive performance at a variety of downstream NLP tasks. So far, applying BERT to offensive language identification for German-language texts failed due to the lack of pre-trained, German-language models. In this paper, we fine-tune a BERT model that was pre-trained on 12 GB of German texts to the task of offensive language identification. This model significantly outperforms our baselines and achieves a macro F1 score of 76% on coarse-grained, 51% on fine-grained, and 73% on implicit/explicit classification. We analyze the strengths and weaknesses of the model and derive promising directions for future work.