R is an interpreted language – meaning the code is run as it is read – kind of like a musician who plays music while reading it off the sheet(note by note). To that end, R does not perform loops as efficiently as compiled languages like C or Java. So to address this issue, R has some interesting work-arounds. One of my favorite is gsub.
Here is how gsub works. Take the sentence “Bob likes dogs”. Using gsub I can replace any element of that sentence. So I can say replace “dogs” with “cats” and the sentence would read “Bob like cats”. Kind of cool all by itself, but it is even cooler when dealing with a larger data set.
Let’s set x to a vector of 3 elements
x <- c("the green ball", "Bob likes the dog", "Sally is the best runner in the group")
Now let’s run a gsub command (syntax: gsub(to be replace, what to replace with, date source)
x <- gsub("the", "a", x)
This short line of code replaces all the “the” in the vector with “a”. It does it for a vector of 1000 elements just as well as it does it for this small vector of 3 elements.
Okay, now my personal pet peeve when it comes to learning this stuff. Show me a practical approach to this. So here we go, a practical data science based use for this.
Check out this selection of tweets I pulled from Twitter. Notice the annoying RT “retweet” at the beginning of most of tweets. I want to get rid of it. When doing a sentiment analysis, knowing something is a RT does little for me.
gsub to get rid of RT
Now I have an empty space I want to get rid.
And if you are wondering how I got that Twitter data? Don’t worry, you don’t need any expensive software. I did it all with R for free and I will show you how to do it too. Stay tuned.