Hasso-Plattner-InstitutSDG am HPI
Hasso-Plattner-InstitutDSG am HPI



Between Discourse and Defamation: HPI Researchers Analyze Online Comments

It has long ceased to exist on the sites of “Bloomberg,” “Reuters,” and “The Verge.” In the Swiss newspaper “Neuer Zürcher Zeitung” and the German daily “Süddeutsche Zeitung,” it appears only sporadically, and under a watchful eye. We are talking about the comments section that accompanies online articles. In recent years, comment posting has become a genuine challenge for almost every journalistic medium. While initially harboring the promise of a deliberative democracy, the comment column has now often become a place for defamation, polemics and aggression. In their research work, senior researcher Ralf Krestel and HPI PhD candidate Julian Risch are putting Internet user comments to the test.  They presented their first research results on 4 June at the NAACL Conference in New Orleans.

Researchers Analyze Online Comments
HPI Researchers analyze online comments. (Photo: HPI/K. Herschelmann)

Further information

The scientists from the Information Systems research group use articles and reader comments from various online newspapers for their analyses. These include British news sites, such as “The Guardian” and “The Telegraph,” and the American sites “The Washington Post” and “Fox News.” In German-speaking countries, researchers use online articles and readers’ comments from ZEIT ONLINE. The data collected in the process allows for a precise analysis of the comments and shows the change of comment behavior and moderation effort over time.

With the help of variables linked with each comment, different aspects can be evaluated, including opinion tonality (e.g., research on hate speech), gender differences or the topics themselves. In the first step of their research, Krestel and Risch concentrate on predicting the volume of comments editorial staff can expect after an article is published. Moderators of journalistic online media should thereby be able to estimate the effort for interaction with users and better be able to plan ahead.

 “Our goal is to identify the top 10 percent of articles discussed every week,” says Krestel. To satisfy this objective, the team includes the metadata of the article, the contextual information, the title and also the text of the article itself. “For the evaluation we primarily use decision trees and statistical regression models. While great strides have been made in the field of neural networks, editors not only need a reliable predication but also a justification of what factors contribute strongly or very strongly to that prediction,” Krestel points out.  

According to the calculations of HPI researchers, determining factors for the expected number of comments are, in particular, the words contained in the article (a bag-of-words model was used), the superordinate topic of an article (via topic modeling) the key words in the title (n-grams or words groups consisting of 1-3 words and keywords provided by the author of the article), as well as the metadata of the article (among them, the source and the resource). Thus, Krestel and Risch could determine that articles, in the period examined, containing the terms “Clausnitz,” “Gauland,” “Beatrix” and “Storch,” had comments most likely to violate the netiquette at ZEIT ONLINE. These comments did not raise objections based on the terms per se, but because of irregularities that overstepped boundaries in terms of formulations (insulting or discriminatory) in the comment text.

“We were able to evaluate which topics lead to high or low levels of comments that have violated the netiquette according to the moderators," says HPI PhD student Risch. For instance, in articles containing the term "Clausnitz" on the average every sixteenth comment had to be deleted; for articles with the term "CO2" about every hundred and nineteenth comment violated the rules. “Based on this data, we can see, for example, which topics are causing discussions to get out of hand. In this sense, the analysis is also a testimony to the culture of debate within a specific period," says Risch.

Netiquette in the Network

At ZEIT Online, an especially high number of comments have been deleted in articles containing the followimg keywords. The diagram shows the average number of deleted comments in relation to the average number of total comments. 

Diagram of average number of deleted comments in relation to the average number of total comments.
Source: HPI/ZEIT Online Article & Comments January 2016–April 2017

Also, entry into the discussion directly after publication of an article, is, according to the HPI scientists, an important indicator of the expected comment volume. “In the face of the dozens―sometimes hundreds―of comments, readers often only recognize the most visible ones at the top. Similar to what has been found in social media research, we view it as highly likely that particularly controversial opinions at the beginning also lead to a larger total comment volume,” said Risch. In order to test this hypothesis, the team translated each of the first four comments of an article into English. This allowed the possibility of using scientifically tested classification algorithms that classify the tone of the expressed opinion. “Based on our model, those particularly persuasive, controversial or negative comments are in fact the best indicators of the comment volume to come,” Risch said.

Through the application of both prognosis models, Krestel and Risch succeeded in raising the prediction accuracy of comment volumes to 81% over current models. “On this basis, concepts could be developed on how to make discussions more balanced in the future―for instance with a comment ranking,” Risch suggested. At the same time, journalists can and must critically examine the reactions they trigger in the public, for instance, with their choice of headline. In a recent HPI seminar for master students Risch examines the comments of the British paper “The Guardian.” “Here we are dealing with about 60 million reader comments, which, for example, provide information on topic trends as well as national differences of reader opinions,” said Rische. In any case, Krestel and Rische will have no lack of issues to tackle in the foreseeable future.