R: Twitter Sentiment Analysis

Having a solid understanding of current public sentiment can be a great tool. When deciding if a new marketing campaign is being met warmly, or if a news release about the CEO is causing customers get angry, people in charge of handling a company’s public image need these answers fast. And in the world of social media, we can get those answers fast. One simple, yet effective, tool for testing the public waters is to run a sentiment analysis.

A sentiment analysis works like this. We take a bunch of tweets about whatever we are looking for (in this example we will be looking at President Obama). We then parse those tweets out into individual words and we count the number of positive words and compare it to the number of negative words.

Now the simplicity of this model misses out on some things. Sarcasm can easily missed. Ex. “Oh GREAT job Obama. Thanks for tanking the country once again”. Our model will count 2 positive words (Great and Thanks) and 1 negative word (tanking) giving us an overall score of positive 1.

There are more complex methods for dealing with the issue above, but you’ll be surprised at how good the system works all by itself. While, yes we are going to misread a few tweets, we have the ability to read thousands of tweets, so the larger volume of data negates the overall effect of the sarcastic ones.

First thing we need to do is go get a list of good and bad words. You could make your own up, but there are plenty of pre-populated lists on the Internet for free. The one I will be using is from the University of Illinois at Chicago. You can find the list here:

http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

Once you go to the page, click on Opinion Lexicon and then download the rar file.

You can dowload from the link below, but I want you to know the source in case this link breaks.

Now open the rar file and move the two text files to a folder you can work from.

Next let’s make sure we have the right packages installed. For this we will need, TwitteR, plyr, stringr, and xlsx. If you do not  have these packages installed, you can do so using the following code. (just change out TwitteR for whatever package you need to install)

install.packages("TwitteR")

Now load the libraries

library(stringr)
library(twitteR)
library(xlsx)
library(plyr)

and connect to the Twitter API. If you do not already have a connection set up, check out my lesson on connecting to Twitter: R: Connect to Twitter with R

api_key<- "insert consumer key here"
api_secret <- "insert consumer secret here"
access_token <- "insert access token here"
access_token_secret <- "insert access token secret here
setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)

Okay, so now remember where you stored the text files we just downloaded and set that location as your working directory (wd). Note that we use forward slashes here, even if you are on a Windows box.

setwd("C:/Users/Benjamin/Documents")
neg = scan("negative-words.txt", what="character", comment.char=";")
pos = scan("positive-words.txt", what="character", comment.char=";")

scan looks through the text files and pulls words that start with characters and ignores comment lines that start with ;

You should now have 2 lists of positive and negative words.

You can add words to either list using a  vector operation. Below I added wtf – a popular Internet abbreviation for What the F@#$@ to the negative words

neg = c(neg, 'wtf')

Okay, now here is the engine that runs our analysis. I have tried to comment on what certain commands you may not recognize do.  I have lessons on most features listed here, and will make more lessons on anything missing. If I were to try to explain this step by step, this page would be 10000 lines long and no one would read it.

score.sentiment = function(tweets, pos.words, neg.words)
 
{
 
require(plyr)
require(stringr)

scores = laply(tweets, function(tweet, pos.words, neg.words) {



tweet = gsub('https://','',tweet) # removes https://
tweet = gsub('http://','',tweet) # removes http://
tweet=gsub('[^[:graph:]]', ' ',tweet) ## removes graphic characters 
       #like emoticons 
tweet = gsub('[[:punct:]]', '', tweet) # removes punctuation 
tweet = gsub('[[:cntrl:]]', '', tweet) # removes control characters
tweet = gsub('\\d+', '', tweet) # removes numbers
tweet=str_replace_all(tweet,"[^[:graph:]]", " ") 

tweet = tolower(tweet) # makes all letters lowercase

word.list = str_split(tweet, '\\s+') # splits the tweets by word in a list
 
words = unlist(word.list) # turns the list into vector
 
pos.matches = match(words, pos.words) ## returns matching 
          #values for words from list 
neg.matches = match(words, neg.words)
 
pos.matches = !is.na(pos.matches) ## converts matching values to true of false
neg.matches = !is.na(neg.matches)
 
score = sum(pos.matches) - sum(neg.matches) # true and false are 
                #treated as 1 and 0 so they can be added
 
return(score)
 
}, pos.words, neg.words )
 
scores.df = data.frame(score=scores, text=tweets)
 
return(scores.df)
 
}

Now let’s get some tweets and analyze them. Note, if your computer is slow or old, you can lower the number of tweets to process. Just change n= to a lower number like 100 or 50

tweets = searchTwitter('Obama',n=2500)
Tweets.text = laply(tweets,function(t)t$getText()) # gets text from Tweets

analysis = score.sentiment(Tweets.text, pos, neg) # calls sentiment function

Now lets look at the results. The quickest method available to us is to simply run a histogram

hist(analysis$score)

My results looks like this

2016-11-25_14-39-39

If 0 is completely neutral most people are generally neutral about the president and more people have positives tweets then negatives ones. This is not uncommon for an outgoing president. They generally seem to get a popularity boost after the election is over.

Finally, if you want to save your results, you can export them to excel.

write.xlsx(analysis, "myResults.xlsx")

And you will end up with a file like this

2016-11-25_14-44-06.jpg

11 thoughts on “R: Twitter Sentiment Analysis

  1. Dear Mr. Larson,
    I really enjoy your tutorial on R. But right now I have a little problem: If I try to use the above code with the given instructions an error occurs that says, that the object ‘analysis.score’ couldn’t be found (used in the hist() method). Everything else works fine.

    Hope you can help me and that you keep going with tutorials on analytics with R.

    Warm regards
    Florian Dahlitz.

    1. Florian,

      Great Catch!! I fixed the code on the page. The issue was I was using hist(analysis.score) when I should have been using hist(analysis$score). Analysis is a dataframe and you call elements of a dataframe using the $ not a period “.”

      Sorry if this caused any frustration.

  2. Dear Mr. Larson,
    I really enjoy your tutorial on R. But right now I have a little problem: how to extract particular tweets from a particular user timeline,like if i want to extract the tweets of Infosys based on the topic manufacturing,and how can i extract the comments of a tweet to perform sentiment analysis on it.

  3. Hi please help me out with this error

    Error in sort.list(y) :
    invalid input ‘RT @FriendlyJMC: Pardons Granted by President Barack Obama (2009-2017) | PARDON | Department of Justice

    Lí ½í±€k how many drug dealers were on…’ in ‘utf8towcs’

  4. Nitesh

    Hi,
    It will create a dataframe scores.df with two columns “score” and “text”. Where scores and tweets variables are used to feed data into score and text which they have. Finally return(scores.df) is the value returned by the function and it will be returning scored.df as final result.

  5. DISHA MALHOTRA

    Error in curl::curl_fetch_memory(url, handle = handle) :
    Couldn’t connect to server

    i am getting this error at below execution

    setup_twitter_oauth(customer_key,customer_secret,access_token = NULL,access_secret = NULL)

    can you plz suggest????

Leave a Reply