There are a bounty of well known machine learning algorithms, both supervised (Decision Tree, K Nearest Neighbor, Logistical Regression) and unsupervised (clustering, anomaly detection). The only catch is that these algorithms are designed to work with numbers, not text. The act of using numeric based data mining methods on text is known as duo-mining.
So before you can utilize these algorithms, you first have to transform text into a format suitable for use in these number based algorithms. One of the most popular methods people first learn is how to create a Term Document Matrix (TDM). The easiest way to understand this concept is, of course, to do an example.
Let’s start by loading the required library
install.packages("tm") # if not already installed library(tm)
Now let’s create a simple vector of strings.
wordVC <- c("I like dogs", "A cat chased the dog", "The dog ate a bone", "Cats make fun pets")
Now we are going to place the strings into a data type designed for text mining (from the tm package) called corpus. A corpus simply means the full collection of text you want to work with.
corpus <- (VectorSource(wordVC)) corpus <- Corpus(corpus) summary(corpus)
Length Class Mode 1 2 PlainTextDocument list 2 2 PlainTextDocument list 3 2 PlainTextDocument list 4 2 PlainTextDocument list
As you can see from the summary, the corpus classified each string in the vector as a PlaintextDocument.
Now let’s create our first Term Document Matrix
TDM <- TermDocumentMatrix(corpus) inspect(TDM)
As you can see, we now have a numeric representation of our text. Each row represents a word in our text and each column represents an individual sentence. So if you look at the word dog, you can see it appears once in sentence 2 and 3, (0110). While bone appears once in sentence 3 only (0010).
One immediate issue that jumps out at me is that R now sees the words cat & cats and dog & dogs as different words, when really cats and dogs are just the plural versions of cat and dog. Now there may be some more advanced applications of text mining that you would want to keep the two words separate, but in most basic text mining applications, you would want to only keep one version of the word.
Luckily for us, R makes that simple. Use the function tm_map with the argument stemDocument
corpus2 <- tm_map(corpus, stemDocument)
make a new TDM
TDM2 <- TermDocumentMatrix(corpus2) inspect(TDM2)
Now you see only the singular of cat and dog exist in our list
If you would like, you can also work with the transpose of the TDM called the Document Term Matrix.
dtm = t(TDM2) inspect(dtm)
I’ll get deeper into more pre-processing tasks, as well as ways to work with your TDM in future lessons. But for now, practice making TDMs see if you can think of ways that you can use TDMs and DTMs with some machine learning algorithms you might already know (decision trees, logistic regression).