Data Story

Introduction

Social media went from exciting to being ingrained in our societies and its influence exceeded the personal sphere to attain even our legal framework and political institutions. The question of regulating social networks is currently a hot topic especially with the rise of the concept of cyber warfare and what that entails in terms of disruption and propaganda. It is therefore very important to study how social media can be used to create narratives and tip the scales into one way or the other. In particular, the Internet Research Agency, a Russian troll factory was accused of interfering in multiple foreign political processes like the the 2016 US presidential elections. We dispose of a dataset of roughly 3 millions tweets spanned between 2012 and 2018 related to accounts of this agency. The aim is to understand what these tweets focus on and how do they adapt with respect to major political and politically divisive events.

Datasets

We dispose of a total of 3 datasets: 2 related to the Russian Internet Research Agency IRA and one to the Iranian goverment. The first Russian dataset was released by FiveThirtyEight in July/August 2018 and contains a total of roughly 3 million tweets. The tweets in this database were sent between February 2012 and May 2018, with the vast majority posted from 2015 through 2017. This dataset was labeled by category and has the following structure:

Header Definition
external_author_id An author account ID from Twitter
author The handle sending the tweet
content The text of the tweet
region A region classification, as determined by Social Studio
language The language of the tweet
publish_date The date and time the tweet was sent
harvested_date The date and time the tweet was collected by Social Studio
following The number of accounts the handle was following at the time of the tweet
followers The number of followers the handle had at the time of the tweet
updates The number of “update actions” on the account that authored the tweet, including tweets, retweets and likes
post_type Indicates if the tweet was a retweet or a quote-tweet
account_type Specific account theme, as coded by Linvill and Warren
retweet A binary indicator of whether or not the tweet is a retweet
account_category General account theme, as coded by Linvill and Warren

The two other datasets were released by twitter in October 2018. We have got an extra 9 millions Russian troll data and 1 million affiliated to Iran. These datasets are not labeled in terms of account category but contain more information than the previous one. Unfortunately the location of the twitter users for almost all the dataset is unknown and it’s not possible to visualize the location of these trolls. We have an extra dataset of 1.8 million tweets consisting of general public tweets that we extracted from a large twitter internet archive. We filtered 150 Gb of tweets dating August 2017 based on a small list of generic politically related words. This dataset serves as a baseline to compare with the Russian and Iran data especially for the topics. The choice of keywords for the filtering is very important: if we choose words that are correlated to certain topics or events then we will bias the topics’ comparison. But also, if we put a big list that would ultimately cover all possible political/politically related topics then we will increase noise in the data and it will be hard to get a good separation of topics.

Methodology

Our analysis focuses on 2 dimensions: First, we take the first Russian dataset and see the tweets activity versus politically related events in the US. We will take advantage of the account category labeling to see how right wing trolls and left wing trolls compare. Then, we explore topics that the Russians targeted and compare them to the Iranian ones to have a better understanding of each strategy. We also talk about how to classify the unlabeled Russian and Iranian datasets to different categories.

General Analysis

Looking at the first Russian troll twitter data, we find 2772 user accounts all linked to the IRA. Among all the authors only 119 are responsible for roughly half of the tweets and there a few users with a large number of tweets.
What is interesting to see here is that 3 authors from the top 6 authors are dedicated to talking about news and the top author of the data is categorized as a Commercial account. This shows that the operation is sophisticated in the sense that trolls try to play both sides right and left but also put the hat of the news media, commercial outlets and other more casual categories to blend into the public and engage them into reacting and following a carefully selected set of events.

Now let’s look at the account categories:

Account category                                                          N° Tweets                                                         
Commercial 113256
Fearmonger 10855
HashtagGamer 216048
LeftTroll 413711
NewsFeed 597656
NonEnglish 27632
RightTroll 706120
unknown 7261
Authors                              N° Tweets                              Accouny_category                             
EXQUOTE 59652 Commercial
SCREAMYMONKEY 44041 NewsFeed
WORLDNEWSPOLI 36974 RightTroll
AMELIEBALDWIN 35371 RightTroll
TODAYPITTSBURGH 33602 NewsFeed
SPECIALAFFAIR 32588 NewsFeed

In the above histogram, we can see that the dataset contains a total of 56 languages but is mainly composed of Russian (20.72%) and English (71.83%) tweets We will therefore restrict further analysis to only the tweets in English. It’s otherwise hard to perform topic extraction without understanding the language and having the tools to do. However, tt’s interesting to see that the IRA uses a large number of languages in its disruptive process.

Tweet Activity

Daily

We can see in the monthly and daily aggregated plots, that in some periods of time the tweet activity increases in a significant way. This indicates that there is a predefined strategy behind the trolls. Hourly

We count the tweets monthly and aggregate at the last of the month where we have tweets. Here we can see the 10 busiest months for the trolls before, during and after the compaign.

Before the compaign: The tweets activity during the month where Donald Trump announced his candidcay (June 2015) was by far the most intense in the precompaign period. This can be an indicator that the Russians were enthousiatics about his compaign from the early days. Before the middle of 2015, there was little activity. We can't be conclusive here because we do not have all logs of all the IRA activity but it seems that the US 2016 elections was the first important target in the US for the IRA

During the compaign: We can see a surge in the tweets in September and October 2016 corresponding to the months preceding the presidential elections. When we looked closely to the data, we've seen that particularly on October 6th and 7th, the frequency is very high. This can be probably explained by the second presdidential debate on the October 9th. In the same sense, there is an increase (although less important) on October 19th for the 3rd presidential debate.

August 2017 is the trolls favorite. We can see that the spike in tweets from August 12 to August 18 2016. This can be explained by the fact that at the 12th a white supremacist rally occured and was followed by a peacefull protests in CharlottesVille Viriginia. A car crashed into the protestors killing one and injuring 28. This incident was intensely covered by the US media and stayed in the spotlight for a while especially after the famous comment tweeted by Trump: “You had some very bad people in that group. But you also had people that were very fine people, on both sides.” We will focus more on this month in the following section

Charlottesville car attack

Different

After the compaign at any hour of the day the number of tweets is higher than the maximum hourly number of tweets during the compaign. The same applies to during the compaign and before it. This is actually very insightfull. It means the IRA was much less interested in interfering with the political process before Donald Trump announced his candidacy. Then when he did, they strongly increased the tweeting frequency until he won. After the elections, the IRA was even more interested in compromising the public opinion. Of course, this is only an approximation since we don’t have the logs of all the IRA’s activity but this is a good indicator that they beef up their effort everytime they achieve something.

Leftright

Overall, the number of tweets by right trolls is larger than the left trolls. The difference grows to almost the double after the elections. The idea here is that left trolls will divide the left and liberal agenda and to some extent give ways for the conservative and right agenda to prevail. It’s also an effective way to polarize the American society, increase stigma and make people focus on what divides them. After the compaign, this is not anymore the objective. the goal becomes how to better defend the president-elect and discredit his critics.

Topic Extraction

To extract the topics that the trolls are targetting, we used Latent Dirichlet Allocation (LDA) and iterated multiple times with different number of topics. Since this is an unsupervised learning algorithm, it is harder to automatically evaluate the topics we get. as a first measure, we used the coherence score to choose an optimal number of topics. Each topic is generated by a list of words and the coherence score give a pairwise word-similarity score. The resulted score is the aggregate of all scores of the correspondings list of words.

Coherence [1]

t : Topics coming in from the topic model
S : Segmented topics
P : Calculated probabilities
Phi vector : A vector of the “confirmed measures” coming out from the confirmation module
c : The final coherence value

We used pyLDAvis [2] to visualize the topics for different periods and datasets. pyLDAvis is a interactive LDA visualization python package. What my results look like? I took one screenshot of pyLDAvis result as shown in Figure 1. The area of circle represents the importance of each topic over the entire corpus, the distance between the center of circles indicate the similarity between topics. For each topic, the histogram on the right side listed the top 30 most relevant terms.

Given a dataset of tweets we would like to extract, for each tweet, each related topic. More so our idea is to associate such topics to specific windows of time and events which happened on that period. We are expecially interested in the evaluation of the period before, during and after the US presidential campain of 2016, where Trump took office as the 45th President on January 20, 2017. In order to simplify the topic extraction we will focus on the three months, one before the campain, one during the campain and the last one after the campain, where the number of tweets are the highest with respect to all the other months. These months were extracted during the Exploratory Analysis for the Russia dataset.

We will proceed by topic extraction of the following dataset:

  • russian troll tweets
  • iranian troll tweets
  • general public

For the first dataset, which is the russian one, we will distinguished among right, left trolls and right with left trolls analyzed all together. This procedure is possible since each russian troll tweet is labeled with respect to the latter information. Such an analysis will give us an overall evaluation of the topics trend for different categories of trolls. In particular the questions we are interested to answer in this stage are the following:

  • Are the topics for the left and the right tweets meaningful?
  • Are the topics for the left and the right tweets similar or do they are different?
  • Do we extract more information by analyzing the left and right tweets all together?

After this analysis we will consider the iranian dataset. For the latter, the analysis of the entire dataframe, not distinguish between left and right trolls, is done. Indeed the iranian tweets are not labeled with respect to the trolls categories. The main questions we would like to address in this part of the work are the following:

Are there differences among iranian tweets in the three different periods (e.g. before, during and after the campain)? Are the iranian tweets consistent in the different periods with respect to the russian ones? Finally we looked at a specific event, which is the Charlottesville car attack on August 12, 2017. For this event the entire month of August 2017 has been analyzed for russian and iranian troll’s tweets. More so we also extracted a tweets dataset related to general public for August 2017. Here we are interested in answering the following questions:

  • Can we find in each dataset the expected topic, which is the Charlottesville car attack?
  • Are the other topic for each dataset, e.g. russian, irianian and general public tweets, consistent?
    The analysis is structured as follows:

Definition of functions for for the topic analysis

  • Russian trolls tweets analysis
  • Iranian trolls tweets analusis
  • General public analysis

It is important to underline that, although a consistent procedure has been adopted there is some heuristics in the topic label assignment for each sequence of keywords which is related to the personal evaluation of the human operator. More so the topic extraction in this context is more challenging since the dataset is related to the same big topic, which is mainly politics, and the overlap among some subtopics is always present.

Matching
Based on the keywords of each topic, we can manually extract the topic based on prior knowledge about the subject or by simply by googling.

Media on terrorism , Malala in America, Yemen Missile Attacks, Bomb Kuwait, Nuclear deal with Iran, Ramadam

For what we refer to as “before campaign” period, right and left tweets topic analysis shows a nice picture of the principle events of June 2015. They are therefore very meaningful by themselves. If one look at right and left trolls separately, some interesting evaluation can be done.
First of all the number of topics covered by the right trolls is longer than the one of the left. More so, in terms of political themes, the right trolls are more active, in particular Obama is highly cited from the left and the right trolls. Several political topics are covered. For example the coalition between Barack Obama and Hillary Clinton both right trolls, in topic 34, and the left trolls, in topic 27. (as described in the following article Democrats together. Also the Asia trade deal, topic 1 for right trolls Asia trade deal e.g. in the tweet “ObamaRyanTrade Defenses This write trade rules China” or for the same-sex marriage LGBTQ rights in topic (3) of right trolls. Obama appears for the left troll in topic 3 e.g. in the tweet “York Times release President Obama College Transcripts fooled they blew Rubio traffic tickets”. Also Michelle Obama is cited by right trolls in topic 6, as in the tweet “TopVideo there make happen Michelle Obama graduates” Michelle Obama commencement speech.

Very interestingly Trump is cited only by the right trolls in topic 25 (e.g. in the tweet “Trump Obama MANY Decisions Anybody That Incompetent Intentional Degradation tcot).

Topics related to the news of that period are also covered, for example tha case of Charleston shooter Dylann Roof cited by right trolls Charleston shooting or the Detroit pastor charged with killing transgender woman again cited by the right trolls Pastor indicted for murder in topic 21. From the left trolls side, we can see the citation of the event related to the family member charged with pregnant teen’s beating, forced abortion in topic 0 Teen abortion and the McKinney video in topic 16 McKinney

Several other topics can be also found, for example both right and left trolls commented on apple: the right trolls considered Apple investiments (“Spotify gets investment faces competition from Apple business,United States,English”) while left trolls commented on Apple events (“Apple biggest event year later already heard lots about what happening”). Finally also a comment about Florida city bans Florida officials ban drinking on the beach by the right trolls.

Finally we could evaluate that considering all the trolls together did not add any new information as expected.

In the period during the campaign, the topics related to Trump are more frequent with respect to the previous analysis. Right and Left trolls commented on different events related to Trump, for example if one look at the right trolls it is possible to find themes related to the the Trump and Pence relationship (topic 0 Trump and Pence, the presidential election (topic 7), the final Trump-Clinton debate (topic 26) Final presidential debate, Donald Trump travel to Gettysburg Trump in Gettyburg and many others. On the other hand left trolls have only two topics related to Trump, one that we found also in right trolls topic, which is the final Trump-Clinton debate (topic 6). It is therefore clear the highest attention of the right trolls on the figure of Trump.
The rest of left trolls topics are really sparse: some are related to movies (e.g. topic 13), others to magazine and gossip (e.g. in topic 35) etc. Obama is still a commented topic but less frequent with respect to the period before the campaign.

Hillary Clinton is also reported, as in topic 3 of right trolls where the possibility that WikiLeaks was working with Russian state actors seeking to elect Donald Trump is commented.

Also in this case the topics for each troll categories are clear and there are some overalap, for example in the case of Trump. However clear differences can be found between the right and left trolls, for example in the frequency of the evaluation of some topics and in the fact that again the right trolls tweets seem more related to the politic field.

Again, considering all trolls together did not add a lot to the analysis.

The period after the campaign is rich of topics both for the right and the left trolls. The 54% of the topics of the right trolls tweets concern Donald Trump in different context, for example in the case of his support to the white nationalist movement (topic 6) Charlottesville white nationalists or the fact that he forgave the notorious Sheriff Joe Arpaio (topic 16) Trump pardon Sheriff Arpaio. While only one topic of the left trolls relates to Trump, in particular related to the binomial Obama-Trump (e.g. in the tweets “Congressional eager investigate Obama administration notably cooler idea that Trump” and “first month Trump family trips security will cost more than roughly that years Obama”). Other topics in the left trolls tweets are really varied.

Very interesting is the Charlottesville topic commented by right trolls in topic 3 and 8 and by left trolls in topic 27. Indeed in Charlottesville, Virginia, white nationalists descended to protest the removal of a statue of a Confederate general on Saturday, but the protest soon spiraled out of control into a violent, chaotic and deadly scene as the day went on (http://sunshinestatenews.com/story/florida-voices-react-violent-charlottesville-protest).
Let’s dig deeper into August 2017 and in particular the Charlottesville attack.

In all the three cases (russian, iranian trolls and general public) one can find something related to Charlottesville. The association with Trump can be also found in most of the cases for each category (russian, iranian trolls and general public). While looking at some tweets associated to Charlottesville it is possible to find few differences mostly related to the russian trolls with respect to the iranian ones and the general public. Russian trolls related to Charlottesville seems to be more politically oriented from a first qualitative analysis (e.g. some tweets are Charlottesville Terrorist ANTI TRUMP - President Trump Reacts Charlottesville Tragedy - Hillary Reacts Charlottesville Blaming White Supremacists - Marine Veteran Reveals TRUTH about Charlottesville Tragedy). In the other cases the tweets seems to be less aggressive both in the case of iranian trolls (e.g. some tweets are trust Facebook more than news Trump supporters explain they think Charlottesville setup - Trump says racism evil condemns Nazis white supremacists following Charlottesville protester death - Sessions concedes Charlottesville domestic terrorism DomesticTerrorism JeffSessions Social year since Charlottesville Time bring back cartoon drew last year Rather than condemn racist - White supremacists hold torch rally Charlottesville) and in the case of general pubblic (e.g. Republicans conservatives defending Trump Charlottesville morally bankrupt - Phoenix Mayor Greg Stanton Calls Trump Delay Rally Wake Charlottesville - heartbroken over Charlottesville will step down from Trump Manufacturing Council - Rosa Parks Daughter Praises Trump Response Charlottesville).
We can then conclude that in that period among other topics the one related to Charlottesville are present and are threated in a slightly different way among the three categories analyzed.

Let’s look now at some of the difference between the Russian and Iranian trolls:

After analysing russian and iranian troll’s tweets during the three periods before, during and after the campaign it is possible to notice that first of all the topics are more centered on iranian-related problems as well as the muslim aspects. The US component is more marginal expecially in the period “before” the campaign. In this last period it is possible also to notice a fewer activity of the iranain trolls with respect to the russian ones in terms of different topic analyzed. Iranian troll’s activity start increasing for the other two periods, during and after the campaign where Trump-related topics are well covered.

iran
The wordcloud above highlights the most ocurring words in the Iranian trolls dataset. We can see that the the subjects are mostly regional such as the Israeli-Iranian tension, Saudi Arabia and the conflict in Yemen, Syria and isis, Turkey and Russia for their direct influence over the situation in Syria and of course the United States through Trump and Obama. The Iranians’ troll focus is narrow and focalizes on matters that are closely related to Iran’s interests. russia
On the other hand, the Russians trolls dataset shows a wider spectrum of subjects such as US politics (Trump, Obama, Clinton), sports, news, buisness.

Comparing both wordclouds shows that the Russian operation is much more sohisticated than the Iranian operation in the sense that trolls tried to blend in the public and seem more casual by tweeting about apolitical topics whereas the Iranian trolls seem more robotic and more straightforward.

We used word2vec on the most ocurring words in both datasets:

In the IRA data, we find it interesting that there are many tweets talking about the world invoking words like Xinjiang, Tibet and Myanmar and Sudan all zones of conflicts with the respectives goverments.

Concerning the Iran troll data, we see the same regional interests as before with a focus on Israel, Yemen, Saudi Arabia. Both tables show Trump as a common ocurring word but the correlated words are different: in the IRA tweets, Trump doesn’t hold any connotation whereas in Iran’s we see “impeachtrump”, “notmypresident” and “lynch”. This is very telling in the sense that the Russian and Iranian perspectives are different: one is trying to boost Trump and aid him, the other tries to disparage him. If we look at the geopolotical situation, we know that Russia meddled in the US elections to help Trump win against Clinton. On the other hand, Trump revoked the Nuclear agreement with Iran and reinstated further economic sanction on the Irani goverment. What is ironic is that even though the operations are contradictory in goal, Russia and Iran are close allies in the Syrian conflict.

Labeling account categories

As we have said, only the first Russian data is labeled to different categories. It’s interesting to see the distribution of the left and right trolls for unlabeled data. The idea is to build a classifier using the Russian trolls labeled account categories and then classify the second Russian dataset into right troll, left troll and other classes. Althought when testing the model over unseen labeled data, we get a pretty good accuracy (89%), it’s hard to estimate the model’s performance on unlabeled data. Usually the hashtags give a good idea about the political affiliation of the tweet sender. We therefore mapped hashtags to the three classes neglecting noisy hahshtags that are ironic or covered by the context. We then used this noisy labeling to evaluate the performance of the model. We only get 60.3% of accurate classification.

We also wanted to run the classifier on the Iran trolls dataset but we noticed that the overlap between the hashtags is small and the noisy labeling we did doesn’t work in this situation. We can see the below the most popular hashtags in both the second Russian trolls dataset and the Iranian trolls dataset. russia iran

Thank you for reading our data story !

References

[1] Coherence score
[2] pyLDAvis
[3] Theme
[4] Code repository