This is part 2 of my Text Mining Lesson series. If you haven’t already, please check out part 1 that covers Term Document Matrix: R: Text Mining (Term Document Matrix)
Okay, now I promise to get to the fun stuff soon enough here, but I feel that in most tutorials I have seen online, the pre-processing of text is often glanced over. It was always (and often still is) a real sore spot for me when assumptions are made as to my knowledge level. If you are going to throw up a block of code, at least give a line or two explanation as to what the code is there for, don’t just assume I know.
I remember working my way through many tutorials where I was able to complete the task by simply copying the code, but I didn’t have a full grasp of what was happening in the middle. I this lesson, I am going to cover some of the more common text pre-processing steps used in the TM library. I am going to go into some level of detail and make some purposeful mistakes so hopefully when you are done here you will have a firm grasp on this very important step in the text mining process.
Let’s start by getting our libraries
install.packages("tm") # if not already installed install.packages("SnowballC") library(tm) library(SnowballC)
Now, let’s load our data. For this lesson we are going to use a simple vector.
wordVC <- c("I like dogs!", "A cat chased the dog.", "The dog ate a bone.", "Cats make fun pets.", "The cat did what? Don't tell the dog.", "A cat runs when it has too, but most dogs love running!")
Now let’s put this data into a corpus for text processing.
corpus <- (VectorSource(wordVC)) corpus <- Corpus(corpus) summary(corpus)
Here is the output
Now for a little frustration. Let’s say you want to see what is in text document 4. You could try
inspect(corpus[4])
But this will be your output, not really what you are looking for.
If you want to see the actual text- try this instead
corpus[[4]]$content
Now you can see the text
As we go through the text pre-processing, we are going to use the following For Loop to examine our corpus
for (i in 1:6) print (corpus[[i]]$content)
Output
Punctuation
Punctuation generally adds no value to text mining when utilizing standard numeric based data mining algorithms like clustering or classification. So it behooves us to just remove it.
To do so, the tm package has a cool function called tm_map() that we can pass arguments to, such as removePunctuation
corpus <- tm_map(corpus, content_transformer(removePunctuation))
for (i in 1:6) print (corpus[[i]]$content)
Note, you do not need the for loop, I am simply running it each time to show you the progress.
Notice all the punctuation is gone now.
Stopwords
Next we are going to get rid of what are known as stopwords. Stopwords are common words such as (the, an, and, him, her). These words are so commonly used that they provide little insight as to the actual meaning of the given text. To get rid of them, we use the following code.
corpus <- tm_map(corpus, content_transformer(removeWords), stopwords("english")) for (i in 1:6) print (corpus[[i]]$content)
If you look at line 2, “A cat chase dog”, you will the word “the” has been removed. However, if you look at the next line down, you will notice “The” is still there.
WHY?
Well it comes down to the fact that computers do not treat T and t as the same letter, even though they are. Capitalized letters are viewed by computers as separate entities. So “The” doesn’t match “the” found in the list of stopwords to remove.
For a full list of R stopwords, go to: https://github.com/arc12/Text-Mining-Weak-Signals/wiki/Standard-set-of-english-stopwords
So how do we fix this?
tolower
Using tm_map with the “tolower” argument will make all the letters lowercase. If we then re-run our stopwords command, you will see all the “the” are gone
corpus <- tm_map(corpus, content_transformer(tolower)) corpus <- tm_map(corpus, content_transformer(removeWords), stopwords("english")) for (i in 1:6) print (corpus[[i]]$content)
Output
stemming
Next we will stem our words. I covered this in the last lesson, but it bears repeating. What stemming does is attempt to remove variants of words. In our example, pay attention to the following words (dog, dogs, cat, cats, runs, running)
corpus <- tm_map(corpus, stemDocument) for (i in 1:6) print (corpus[[i]]$content)
Notice the words are now (dog, cat, run)
Whitespace
Finally, let’s get rid of all this extra white space we have now.
corpus <- tm_map(corpus, stripWhitespace)
for (i in 1:6) print (corpus[[i]]$content)
Output
removeNumbers
I didn’t use this argument with my tm_map() function today because I did not have any numbers in my text. But if I, the command would be as follows
corpus <- tm_map(corpus, content_transformer(removeNumbers))