Risch, J., Repke, T., Kohlmeyer, L., Krestel, R.: ComEx: Comment Exploration on Online News Platforms. Joint Proceedings of the ACM IUI 2021 Workshops co-located with the 26th ACM Conference on Intelligent User Interfaces (IUI). pp. 1–7. CEUR-WS.org (2021).
The comment sections of online news platforms have shaped the way in which people express their opinion online. However, due to the overwhelming number of comments, no in-depth discussions emerge. To foster more interactive and engaging discussions, we propose our ComEx interface for the exploration of reader comments on online news platforms. Potential discussion participants can get a quick overview and are not discouraged by an abundance of comments. It is our goal to represent the discussion in a graph of comments that can be used in an interactive user interface for exploration. To this end, a processing pipeline fetches comments from several different platforms and adds edges in the graph based on topical similarity or meta-data and ranks nodes on metrics such as controversy or toxicity. By interacting with the graph, users can explore and react to single comments or entire threads they are interested in.
Risch, J., Künstler, V., Krestel, R.: HyCoNN: Hybrid Cooperative Neural Networks for Personalized News Discussion Recommendation. Proceedings of the International Joint Conferences on Web Intelligence and Intelligent Agent Technologies (WI-IAT). pp. 41–48 (2020).
Many online news platforms provide comment sections for reader discussions below articles. While users of these platforms often read comments, only a minority of them regularly write comments. To encourage and foster more frequent engagement, we study the task of personalized recommendation of reader discussions to users. We present a neural network model that jointly learns embeddings for users and comments encoding general properties. Based on explicit and implicit user feedback, we sample relevant and irrelevant reader discussions to build a representative training dataset. We compare to several baselines and state-of-the-art approaches in an evaluation on two datasets from The Guardian and Daily Mail. Experimental results show that the recommendations of our approach are superior in terms of precision and recall. Further, the learned user embeddings are of general applicability because they preserve the similarity of users who share interests in similar topics.
Risch, J., Krestel, R.: A Dataset of Journalists’ Interactions with Their Readership: When Should Article Authors Reply to Reader Comments?. Proceedings of the International Conference on Information and Knowledge Management (CIKM). pp. 3117–3124. ACM (2020).
The comment sections of online news platforms are an important space to indulge in political conversations andto discuss opinions. Although primarily meant as forums where readers discuss amongst each other, they can also spark a dialog with the journalists who authored the article. A small but important fraction of comments address the journalists directly, e.g., with questions, recommendations for future topics, thanks and appreciation, or article corrections. However, the sheer number of comments makes it infeasible for journalists to follow discussions around their articles in extenso. A better understanding of this data could support journalists in gaining insights into their audience and fostering engaging and respectful discussions. To this end, we present a dataset of dialogs in which journalists of The Guardian replied to reader comments and identify the reasons why. Based on this data, we formulate the novel task of recommending reader comments to journalists that are worth reading or replying to, i.e., ranking comments in such a way that the top comments are most likely to require the journalists' reaction. As a baseline, we trained a neural network model with the help of a pair-wise comment ranking task. Our experiment reveals the challenges of this task and we outline promising paths for future work. The data and our code are available for research purposes from: hpi.de/naumann/projects/repeatability/text-mining.html.
Risch, J., Ruff, R., Krestel, R.: Explaining Offensive Language Detection. Journal for Language Technology and Computational Linguistics (JLCL). 34, 29–47 (2020).
Machine learning approaches have proven to be on or even above human-level accuracy for the task of offensive language detection. In contrast to human experts, however, they often lack the capability of giving explanations for their decisions. This article compares four different approaches to make offensive language detection explainable: an interpretable machine learning model (naive Bayes), a model-agnostic explainability method (LIME), a model-based explainability method (LRP), and a self-explanatory model (LSTM with an attention mechanism). Three different classification methods: SVM, naive Bayes, and LSTM are paired with appropriate explanation methods. To this end, we investigate the trade-off between classification performance and explainability of the respective classifiers. We conclude that, with the appropriate explanation methods, the superior classification performance of more complex models is worth the initial lack of explainability.
Risch, J., Ruff, R., Krestel, R.: Offensive Language Detection Explained. Proceedings of the Workshop on Trolling, Aggression and Cyberbullying (TRAC@LREC). pp. 137–143. European Language Resources Association (ELRA) (2020).
Many online discussion platforms use a content moderation process, where human moderators check user comments for offensive language and other rule violations. It is the moderator's decision which comments to remove from the platform because of violations and which ones to keep. Research so far focused on automating this decision process in the form of supervised machine learning for a classification task. However, even with machine-learned models achieving better classification accuracy than human experts in some scenarios, there is still a reason why human moderators are preferred. In contrast to black-box models, such as neural networks, humans can give explanations for their decision to remove a comment. For example, they can point out which phrase in the comment is offensive or what subtype of offensiveness applies. In this paper, we analyze and compare four attribution-based explanation methods for different offensive language classifiers: an interpretable machine learning model (naive Bayes), a model-agnostic explanation method (LIME), a model-based explanation method (LRP), and a self-explanatory model (LSTM with an attention mechanism). We evaluate these approaches with regard to their explanatory power and their ability to point out which words are most relevant for a classifier's decision. We find that the more complex models achieve better classification accuracy while also providing better explanations than the simpler models.
Risch, J., Krestel, R.: Bagging BERT Models for Robust Aggression Identification. Proceedings of the Workshop on Trolling, Aggression and Cyberbullying (TRAC@LREC). pp. 55–61. European Language Resources Association (ELRA) (2020).
Modern transformer-based models with hundreds of millions of parameters, such as BERT, achieve impressive results at text classification tasks. This also holds for aggression identification and offensive language detection, where deep learning approaches consistently outperform less complex models, such as decision trees. While the complex models fit training data well (low bias), they also come with an unwanted high variance. Especially when fine-tuning them on small datasets, the classification performance varies significantly for slightly different training data. To overcome the high variance and provide more robust predictions, we propose an ensemble of multiple fine-tuned BERT models based on bootstrap aggregating (bagging). In this paper, we describe such an ensemble system and present our submission to the shared tasks on aggression identification 2020 (team name: Julian). Our submission is the best-performing system for five out of six subtasks. For example, we achieve a weighted F1-score of 80.3% for task A on the test dataset of English social media posts. In our experiments, we compare different model configurations and vary the number of models used in the ensemble. We find that the F1-score drastically increases when ensembling up to 15 models, but the returns diminish for more models.
Risch, J., Krestel, R.: Top Comment or Flop Comment? Predicting and Explaining User Engagement in Online News Discussions. Proceedings of the International Conference on Web and Social Media (ICWSM). pp. 579–589. AAAI (2020).
Comment sections below online news articles enjoy growing popularity among readers. However, the overwhelming number of comments makes it infeasible for the average news consumer to read all of them and hinders engaging discussions. Most platforms display comments in chronological order, which neglects that some of them are more relevant to users and are better conversation starters. In this paper, we systematically analyze user engagement in the form of the upvotes and replies that a comment receives. Based on comment texts, we train a model to distinguish comments that have either a high or low chance of receiving many upvotes and replies. Our evaluation on user comments from TheGuardian.com compares recurrent and convolutional neural network models, and a traditional feature-based classifier. Further, we investigate what makes some comments more engaging than others. To this end, we identify engagement triggers and arrange them in a taxonomy. Explanation methods for neural networks reveal which input words have the strongest influence on our model's predictions. In addition, we evaluate on a dataset of product reviews, which exhibit similar properties as user comments, such as featuring upvotes for helpfulness.
Risch, J., Krestel, R.: Toxic Comment Detection in Online Discussions. In: Agarwal, B., Nayak, R., Mittal, N., and Patnaik, S. (eds.) Deep Learning-Based Approaches for Sentiment Analysis. pp. 85–109. Springer (2020).
With the exponential growth in the use of social media networks such as Twitter, Facebook, and many others, an astronomical amount of big data has been generated. A substantial amount of this user-generated data is in form of text such as reviews, tweets, and blogs that provide numerous challenges as well as opportunities to NLP (Natural Language Processing) researchers for discovering meaningful information used in various applications. Sentiment analysis is the study that analyses people’s opinion and sentiment towards entities such as products, services, person, organisations etc. present in the text. Sentiment analysis and opinion mining is the most popular and interesting research problem. In recent years, Deep Learning approaches have emerged as powerful computational models and have shown significant success to deal with a massive amount of data in unsupervised settings. Deep learning is revolutionizing because it offers an effective way of learning representation and allows the system to learn features automatically from data without the need of explicitly designing them. Deep learning algorithms such as deep autoencoders, convolutional and recurrent neural networks (CNN) (RNN), Long Short-Term Memory (LSTM) and Generative Adversarial Networks (GAN) have reported providing significantly improved results in various natural language processing tasks including sentiment analysis.
Risch, J., Stoll, A., Ziegele, M., Krestel, R.: hpiDEDIS at GermEval 2019: Offensive Language Identification using a German BERT model. Proceedings of the 15th Conference on Natural Language Processing (KONVENS). pp. 403–408. German Society for Computational Linguistics & Language Technology, Erlangen, Germany (2019).
Pre-training language representations on large text corpora, for example, with BERT, has recently shown to achieve impressive performance at a variety of downstream NLP tasks. So far, applying BERT to offensive language identification for German- language texts failed due to the lack of pre-trained, German-language models. In this paper, we fine-tune a BERT model that was pre-trained on 12 GB of German texts to the task of offensive language identification. This model significantly outperforms our baselines and achieves a macro F1 score of 76% on coarse-grained, 51% on fine-grained, and 73% on implicit/explicit classification. We analyze the strengths and weaknesses of the model and derive promising directions for future work.
Risch, J., Krebs, E., Löser, A., Riese, A., Krestel, R.: Fine-Grained Classification of Offensive Language. Proceedings of GermEval (co-located with KONVENS). pp. 38–44 (2018).
Social media platforms receive massive amounts of user-generated content that may include offensive text messages. In the context of the GermEval task 2018, we propose an approach for fine-grained classification of offensive language. Our approach comprises a Naive Bayes classifier, a neural network, and a rule-based approach that categorize tweets. In addition, we combine the approaches in an ensemble to overcome weaknesses of the single models. We cross-validate our approaches with regard to macro-average F1-score on the provided training dataset.
van Aken, B., Risch, J., Krestel, R., Löser, A.: Challenges for Toxic Comment Classification: An In-Depth Error Analysis. Proceedings of the 2nd Workshop on Abusive Language Online (co-located with EMNLP). pp. 33–42 (2018).
Toxic comment classification has become an active research field with many recently proposed approaches. However, while these approaches address some of the task’s challenges others still remain unsolved and directions for further research are needed. To this end, we compare different approaches on a new, large comment dataset and propose an ensemble that outperforms all individual models. Further, we validate our findings on a second dataset. The results of the ensemble enable us to perform an extensive error analysis, which reveals open challenges for state-of- the-art methods and directions towards pending future research. These challenges include missing paradigmatic context and inconsistent dataset labels.
Risch, J., Krestel, R.: Aggression Identification Using Deep Learning and Data Augmentation. Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (co-located with COLING). pp. 150–158 (2018).
Social media platforms allow users to share and discuss their opinions online. However, a minority of user posts is aggressive, thereby hinders respectful discussion, and — at an extreme level — is liable to prosecution. The automatic identification of such harmful posts is important, because it can support the costly manual moderation of online discussions. Further, the automation allows unprecedented analyses of discussion datasets that contain millions of posts. This system description paper presents our submission to the First Shared Task on Aggression Identification. We propose to augment the provided dataset to increase the number of labeled comments from 15,000 to 60,000. Thereby, we introduce linguistic variety into the dataset. As a consequence of the larger amount of training data, we are able to train a special deep neural net, which generalizes especially well to unseen data. To further boost the performance, we combine this neural net with three logistic regression classifiers trained on character and word n-grams, and hand-picked syntactic features. This ensemble is more robust than the individual single models. Our team named “Julian” achieves an F1-score of 60% on both English datasets, 63% on the Hindi Facebook dataset, and 38% on the Hindi Twitter dataset.
Risch, J., Krestel, R.: Delete or not Delete? Semi-Automatic Comment Moderation for the Newsroom. Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (co-located with COLING). pp. 166–176 (2018).
Comment sections of online news providers have enabled millions to share and discuss their opinions on news topics. Today, moderators ensure respectful and informative discussions by deleting not only insults, defamation, and hate speech, but also unverifiable facts. This process has to be transparent and comprehensive in order to keep the community engaged. Further, news providers have to make sure to not give the impression of censorship or dissemination of fake news. Yet manual moderation is very expensive and becomes more and more unfeasible with the increasing amount of comments. Hence, we propose a semi-automatic, holistic approach, which includes comment features but also their context, such as information about users and articles. For evaluation, we present experiments on a novel corpus of 3 million news comments annotated by a team of professional moderators.
Ambroselli, C., Risch, J., Krestel, R., Loos, A.: Prediction for the Newsroom: Which Articles Will Get the Most Comments?. Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). pp. 193–199. ACL, New Orleans, Louisiana, USA (2018).
The overwhelming success of the Web and mobile technologies has enabled millions to share their opinions publicly at any time. But the same success also endangers this freedom of speech due to closing down of participatory sites misused by individuals or interest groups. We propose to support manual moderation by proactively drawing the attention of our moderators to article discussions that most likely need their intervention. To this end, we predict which articles will receive a high number of comments. In contrast to existing work, we enrich the article with metadata, extract semantic and linguistic features, and exploit annotated data from a foreign language corpus. Our logistic regression model improves F1-scores by over 80% in comparison to state-of-the-art approaches.
Godde, C., Lazaridou, K., Krestel, R.: Classification of German Newspaper Comments. Proceedings of the Conference Lernen, Wissen, Daten, Analysen. pp. 299–310. CEUR-WS.org (2016).
Online news has gradually become an inherent part of many people’s every day life, with the media enabling a social and interactive consumption of news as well. Readers openly express their perspectives and emotions for a current event by commenting news articles. They also form online communities and interact with each other by replying to other users’ comments. Due to their active and significant role in the diffusion of information, automatically gaining insights of these comments’ content is an interesting task. We are especially interested in finding systematic differences among the user comments from different newspapers. To this end, we propose the following classification task: Given a news comment thread of a particular article, identify the newspaper it comes from. Our corpus consists of six well-known German newspapers and their comments. We propose two experimental settings using SVM classifiers build on comment- and article-based features. We achieve precision of up to 90% for individual newspapers.