Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

This page presents our work on social media mining, specifically our Twitter analyses. We combine millions of tweets with different data sources to gain new insights on various topics and opinions, especially in the political domain. There are currently two ongoing projects on Twitter, namely celebrity tracking via entity mentions in tweets and political debate analysis in social media. For an overview of all our text mining projects please visit this page.

Datasets

To retrieve relevant tweets for our tasks and generally the political domain, we were constantly harvesting Twitter using the Public API. Because the API is limited to a maximum of 1% of the overall traffic on Twitter, we have to make sure to retrieve as many relevant tweets as possible, without exceeding the rate limits.

U.S. Candidates on Twitter

Considering the access restrictions of the Twitter API which only allowed us to analyze a small portion of the overall published tweets, we designed our query terms very carefully so that they cover various political discussions. Hence, a query term like Ben for the candidate Ben Carson is not appropriate, because it yields too many false positive tweets. Our approach was to manually select a set of 241 queries. Based on these queries, we collected a set of over one billion tweets by over 33 million users mentioning candidates and other persons relevant for the U.S. presidential election during the 18-month period starting on November 2015 ending in April 2017 (a short time period after the inauguration of Donald J. Trump as the 45th President of the United States).

An overview of the number of daily extracted tweets is depicted on the right side. The number of collected tweets continuously grows and exhibits peaks around the day of Super Tuesday 2016 , the Republican and Democratic National Conventions, the presidential debates, until the election day, followed by a decline that leads to another local maximum around the inauguration day of Donald J. Trump.

Four trails extracted for Februrary 29th 2016
Google Maps overview of the four trails of Donald J. Trump (red), Hillary Rodham Clinton (blue), Bernie Sanders (green), and Ted Cruz (yellow) one day prior to the Super Tuesday 2016.

Project 1: Tracking Celebrities in Social Media

Case Study: Tracking politicians locations during the U.S. Election in 2016

The goal of this case study is to track the physical locations of U.S. presidential candidates during the U.S. election campaigns in 2016 based on twitter messages. To do so, we make use of tweet textual content (i.e., not the geolocation of the tweets as it is present very rarely) published during this period and analyze it in order to detect the location of the politicians (and in the future possibly other high profile celebrities). 

The applied method is described in more details in our recent poster publication: "What was Hillary Clinton doing in Katy, Texas?". The results shown below are based on the 'U.S. Candidates on Twitter' dataset that spans from November 2015 until April 2017. We share the daily extracted politician-location-pairs and the IDs of all tweets referring to this location event. On the right side, you can see an excerpt of our tracking results for four prominent politicians on February 29th 2016. The plot shows the different destinations each politician appeared according to the valuable knowledge found in Twitter. Since this is a very challenging problem and our approach also creates false positives, we show with the thunder symbol the incorrect locations for this day. The estimated times are based on EST and the complete dataset can be found here.

Publications

  • What was Hillary Clinton doing in Katy, Texas?. Gruetze, Toni; Krestel, Ralf; Lazaridou, Konstantina; Naumann, Felix (2017).
     

Project 2: Large-scale topic-based analysis of political discussions on Twitter

In recent years, social media have been emerging as a new means for various kinds of discussions, such as political ones. Social network users, e.g. on Twitter, Facebook, etc., share their thoughts about the current affairs, exchange opinions about them and also keep their peers up to date. Additionally, politicians use social media as well in order to promote their campaigns or political standpoints.

In this ongoing master thesis, we are analyzing over one billion of tweets related to the U.S. general election in 2016. Our dataset (described above) spans from November 2015 till April 2017. This vast amount of textual information enables us to understand the political debates on the Twittersphere and thus identify the general public’s aspects on essential matters in the U.S., e.g. job creation, nuclear weapon strategy, cyber security, etc. We combine this tweet collection with additional data sources, i.e. the presidential candidate debate transcripts in 2016. We aim to to gain insights from the two different perspectives: the politicians and the general public, represented by the Twittersphere.

Our goal consists of two sequential tasks: identify the topics discussed on Twitter during this time period and later on discover the expressed opinions towards these topics and the involved politicians. Once each tweet has been assigned a topic, we can  analyze people’s point of view about a given matter and also discover the sentiment that they express towards the involved politicians. Our target is to answer interesting questions, such as: what do people think about Donald Trump’s statements regarding healthcare? How often do the users quote Hillary Clinton’s tweets with positive and negative comments respectively? Does this frequency change for different topics and could it be predicted?

People