Using tweets during the 2016 US presidential election period, this project analyzes groups of Twitter users based on political preferences and feelings about the election. Tweets fell into five clusters: three pertaining to candidate support and two referring to more general topics (voting and the election as a whole). Finally, the fraction of the tweets that were geolocation-enabled were mapped by sentiment and cluster.
Motivation and Application
Twitter acts as a canvas for spontaneous and raw emotional expression. Unlike most social platforms, it attracts users with its simplicity of content production rather than a breadth of features. Impulsive thoughts are often our most honest thoughts, and Twitter is a massive space where people speak their minds spontaneously. The world had much to say about the 2016 US election, and Twitter offers opportunities for exploration.
The datasets were gathered from the Twitter Streaming API over a three day period, starting the day before the 2016 Presidential Election and ending the day after the election. Tweets were loaded in as a collection of JSON objects. Each tweet contains user information, geolocation (if enabled), hashtags used, tweet body text, and additional fields pertaining to the tweet and user.
The entire stream of data contained approximately 12 million tweets. This was not feasible to process, so we used the reservoir sampling algorithm Algorithm R to randomly sample 300 thousand tweets representative of initial total. The workable sample of tweets were clustered according to tf-idf scores. After calculating the tf-idf scores, stemming and vectorizing the tweets, and applying principal component analysis, K-means clustering was used to find distinct groups in the tweets. Lastly, sentiment analysis was conducted on geolocated tweets so it was possible to map tweets by sentiment and cluster.
The dataset was built from a collection of tweets streamed using the Twitter Streaming API and the tweepy Python interface for this API. The time range of the tweets is 2016/07/11 13:27 to 2016/09/11 20:50 before, during, and after the 2016 US primary election. In total, 102GB of tweets were collected.
The following trending hashtags were used as filters in our data retrieval:
'election', 'donald', 'trump', 'hillary', 'clinton', 'debates', 'vote', 'politics', 'ballot', 'obama', 'equality', '#election2016', '#electionday', '#ivoted', '#imwithher', '#makeamericagreatagain', '#2016election', '#lockherup', '#deleteyouraccount', '#crookedhillary', '#nevertrump', '#feelthebern', '#blacklivesmatter', '#imvotingbecause', '#thirdparty', '#garyjohnson', '#electionfinalthoughts'
To retrieve a representative sample from a vast 12 million tweets, a reservoir sampling algorithm called Algorithm R was used. It is an O(n) algorithm that only requires one pass through the dataset S to get a random sample of desired size. The algorithm creates a reservoir array of size k and fills it with the first k items of S. The psuedo-code for the Algorithm is as follows:
ReservoirSample(S[1..n], R[1..k]) for i = 1 to k: // Initially fill Reservoir R R[i] := S[i] for i = k+1 to n: j := random(1, i) // randomly replace items in R if j <= k: R[j] := S[i]
After retrieving the sample, tweets were clustered based on tf-idf scores. Tf-idf was helpful because it allowed the to re-weight the counts of the words in tweets so we could focus on words that were more likely to be relevant for differentiation between clusters. After tf-idf scores were calculated for the tweets, the terms were stemmed and vectorized. Then, principal component analysis was applied to reduce the dimensionality of the vectors. A tool was needed to help distinguish multivariate data with many topics to be considered in the determination of tweet similarity. An appropriate solution for this was K-means clustering, which was ran to find inherent groups in the tweets. To find an appropriate number of clusters, Silhouette Score was plotted (on a sample of 10,000 tweets) for a range of cluster numbers from two to ten. Five clusters was a good compromise between efficiency and accuracy.
After defined clusters were found, for each geolocation-enabled tweet a compound sentiment score was calculated using NLTK’s Vader. This compound score is a weighted, normalized value between -1 and 1 (negative to positive) corresponding to the sentiment of a text collection.
Five distinct clusters of tweets were found:
These clusters fall along similar lines to what was expected, including two main clusters for each candidate. Based on our similarity plot of the clusters below, it’s apparent that the “Pro-Hillary” and “Pro-Trump” clusters are farthest from each other. It’s also interesting how the general election news cluster overlaps with the pro-Trump cluster. This could have implications about whether or not Trump is mentioned in news sources more often. It’s also notable that the anti-trump cluster is just as large as the pro-trump cluster. There were 76058 pro-Hillary tweeters, 74636 anti-Trump tweeters, 48333 pro-Trump tweeters out of our 300,000 reservoir sample.
The sentiment of the tweets around the election was surprisingly neutral. However, I suspect this has more to do with the limitations of the sentiment analysis algorithm’s capability to detect subtle nuances in tweet language. This election was extremely emotionally charged and left few people across the United States feeling neutral, so improvement to this algorithm is necessary.
In general, most tweets come from Eastern United States, especially at the coasts. International mapping of tweets by location produced fascinating results. The below visualization shows that countries outside the US express little support for Trump, and support is localized to the US.