13.06.2021

Paper on Data Integration for Toxic Comment Classification Accepted at ACL Workshop

We are excited to announce that the paper Data Integration for Toxic Comment Classification: Making More Than 40 Datasets Easily Accessible in One Unified Format by Julian Risch, Philipp Schmidt and Ralf Krestel has been accepted for publication at the Workshop on Online Abuse and Harms, co-located with the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021).

A preprint can be accessed here. The code is available on GitHub.

Abstract

With the rise of research on toxic comment classification, more and more annotated datasets have been released. The wide variety of the task (different languages, different labeling processes and schemes) has led to a large amount of heterogeneous datasets that can be used for training and testing very specific settings. Despite recent efforts to create web pages that provide an overview, most publications still use only a single dataset. They are not stored in one central database, they come in many different data formats and it is difficult to interpret their class labels and how to reuse these labels in other projects. To overcome these issues, we present a collection of more than forty datasets in the form of a software tool that automatizes downloading and processing of the data and presents them in a unified data format that also offers a mapping of compatible class labels. Another advantage of that tool is that it gives an overview of properties of available datasets, such as different languages, platforms, and class labels to make it easier to select suitable training and test data.

Data Integration for Toxic Comment Classification: Making More Than 40 Datasets Easily Accessible in One Unified Format. Risch, Julian; Schmidt, Philipp; Krestel, Ralf (2021). 157–163.

[ Abstract ] [ BibTeX ] [ URL ] [ DOI ] [ Download ]