R: Text Mining (Pre-processing)

This is part 2 of my Text Mining Lesson series. If you haven’t already, please check out part 1 that covers Term Document Matrix: R: Text Mining (Term Document Matrix)

Okay, now I promise to get to the fun stuff soon enough here, but I feel that in most tutorials I have seen online, the pre-processing of text is often glanced over. It was always (and often still is) a real sore spot for me when assumptions are made as to my knowledge level. If you are going to throw up a block of code, at least give a line or two explanation as to what the code is there for, don’t just assume I know.

I remember working my way through many tutorials where I was able to complete the task by simply copying the code, but I didn’t have a full grasp of what was happening in the middle. I this lesson, I am going to cover some of the more common text pre-processing steps used in the TM library. I am going to go into some level of detail and make some purposeful mistakes so hopefully when you are done here you will have a firm grasp on this very important step in the text mining process.

Let’s start by getting our libraries

install.packages("tm") # if not already installed
install.packages("SnowballC")

library(tm)
library(SnowballC)

Now, let’s load our data. For this lesson we are going to use a simple vector.

wordVC <- c("I like dogs!", "A cat chased the dog.", "The dog ate a bone.", 
            "Cats make fun pets.", "The cat did what? Don't tell the dog.", 
            "A cat runs when it has too, but most dogs love running!")

Now let’s put this data into a corpus for text processing.

corpus <- (VectorSource(wordVC))
corpus <- Corpus(corpus)
summary(corpus)

Here is the output

2016-12-22_09-11-53.jpg

Now for a little frustration. Let’s say you want to see what is in text document 4. You could try

inspect(corpus[4])

But this will be your output, not really what you are looking for.

2016-12-22_09-23-30.jpg

If you want to see the actual text- try this instead

corpus[[4]]$content

Now you can see the text

2016-12-22_09-26-21.jpg

As we go through the text pre-processing, we are going to use the following For Loop to examine our corpus

for (i in 1:6) print (corpus[[i]]$content)

Output

2016-12-22_09-30-41.jpg

Punctuation

Punctuation generally adds no value to text mining when utilizing standard numeric based data mining algorithms like clustering or classification. So it behooves us to just remove it.

To do so, the tm package has a cool function called tm_map() that we can pass arguments to, such as removePunctuation

corpus <- tm_map(corpus, content_transformer(removePunctuation))

for (i in 1:6) print (corpus[[i]]$content)

Note, you do not need the for loop, I am simply running it each time to show you the progress. 

Notice all the punctuation is gone now.

2016-12-22_09-34-28.jpg

Stopwords

Next we are going to get rid of what are known as stopwords. Stopwords are common words such as (the, an, and, him, her). These words are so commonly used that they provide little insight as to the actual meaning of the given text.  To get rid of them, we use the following code.

corpus <- tm_map(corpus, content_transformer(removeWords), 
          stopwords("english"))
for (i in 1:6) print (corpus[[i]]$content)

If you look at line 2, “A cat chase  dog”, you will the word “the” has been removed. However, if you look at the next line down, you will notice “The” is still there.

2016-12-22_09-39-52

WHY?

Well it comes down to the fact that computers do not treat T and t as the same letter, even though they are. Capitalized letters are viewed by computers as separate entities. So “The” doesn’t match “the” found in the list of stopwords to remove.

For a full list of R stopwords, go to: https://github.com/arc12/Text-Mining-Weak-Signals/wiki/Standard-set-of-english-stopwords

So how do we fix this?

tolower

Using tm_map with the “tolower” argument will make all the letters lowercase. If we then re-run our stopwords command, you will see all the “the” are gone

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, content_transformer(removeWords), 
          stopwords("english"))
for (i in 1:6) print (corpus[[i]]$content)

Output

2016-12-22_09-49-25

stemming

Next we will stem our words. I covered this in the last lesson, but it bears repeating. What stemming does is attempt to remove variants of words. In our example, pay attention to the following words (dog, dogs, cat, cats, runs, running)

corpus <- tm_map(corpus, stemDocument)
for (i in 1:6) print (corpus[[i]]$content)

Notice the words are now (dog, cat, run)

2016-12-22_09-55-02.jpg

Whitespace

Finally, let’s get rid of all this extra white space we have now.

corpus <- tm_map(corpus, stripWhitespace) 
for (i in 1:6) print (corpus[[i]]$content)

Output

2016-12-22_10-01-01.jpg

removeNumbers

I didn’t use this argument with my tm_map() function today because I did not have any numbers in my text. But if I, the command would be as follows

corpus <- tm_map(corpus, content_transformer(removeNumbers))