R: Text Mining (Term Document Matrix)

There are a bounty of well known machine learning algorithms, both supervised (Decision Tree, K Nearest Neighbor, Logistical Regression) and unsupervised (clustering, anomaly detection). The only catch is that these algorithms are designed to work with numbers, not text. The act of using numeric based data mining methods on text is known as duo-mining.

So before you can utilize these algorithms, you first have to transform text into a format suitable for use in these number based algorithms. One of the most popular methods people first learn is how to create a Term Document Matrix (TDM). The easiest way to understand this concept is, of course, to do an example.

Let’s start by loading the required library

install.packages("tm") # if not already installed
library(tm)

Now let’s create a simple vector of strings.

wordVC <- c("I like dogs", "A cat chased the dog", "The dog ate a bone", 
            "Cats make fun pets")

Now we are going to place the strings into a data type designed for text mining (from the tm package) called corpus. A corpus simply means the full collection of text you want to work with.

corpus <- (VectorSource(wordVC))
corpus <- Corpus(corpus)
summary(corpus)

Output:

 Length Class Mode
1 2 PlainTextDocument list
2 2 PlainTextDocument list
3 2 PlainTextDocument list
4 2 PlainTextDocument list

As you can see from the summary, the corpus classified each string in the vector as a PlaintextDocument.

Now let’s create our first Term Document Matrix

TDM <- TermDocumentMatrix(corpus)
inspect(TDM)

Output:

As you can see, we now have a numeric representation of our text. Each row represents a word in our text and each column represents an individual sentence. So if you look at the word dog, you can see it appears once in sentence 2 and 3, (0110). While bone appears once in sentence 3 only (0010).

Stemming

One immediate issue that jumps out at me is that R now sees the words cat & cats and dog & dogs as different words, when really cats and dogs are just the plural versions of cat and dog. Now there may be some more advanced applications of text mining that you would want to keep the two words separate, but in most basic text mining applications, you would want to only keep one version of the word.

Luckily for us, R makes that simple. Use the function tm_map with the argument stemDocument

corpus2 <- tm_map(corpus, stemDocument)

make a new TDM

TDM2 <- TermDocumentMatrix(corpus2) 
inspect(TDM2)

Now you see only the singular of cat and dog exist in our list

If you would like, you can also work with the transpose of the TDM called the Document Term Matrix.

dtm = t(TDM2)
inspect(dtm)

dfsa

2016-12-21_10-56-10

I’ll get deeper into more pre-processing tasks, as well as ways to work with your TDM in future lessons. But for now, practice making TDMs see if you can think of ways that you can use TDMs and DTMs with some machine learning algorithms you might already know (decision trees, logistic regression).

4 thoughts on “R: Text Mining (Term Document Matrix)”

Pingback: R: Text Mining (Pre-processing) – Analytics4All
Pingback: R: Text Mining (Term Document Matrix) – Paulo G.P.
Manvendra

Excellent article, you kind of saved my day. Thank you so much.

Loading...

July 31, 2018 at 6:12 am Reply
Henryhal

try these out https://1xslots-africa.site

Loading...

March 2, 2020 at 12:57 am Reply

	Anonymous on Python Web Scraping / Automati…
	rajendarqvkelly on PIG: Use GRUNT to Access PIG f…
	A Transient Historic… on Inverted Index Database
	Anonymous on XML Parsing: Advanced SQL
	Database Development… on SQL Server: Importing Excel Fi…

	Anonymous on Python Web Scraping / Automati…
	rajendarqvkelly on PIG: Use GRUNT to Access PIG f…
	A Transient Historic… on Inverted Index Database
	Anonymous on XML Parsing: Advanced SQL
	Database Development… on SQL Server: Importing Excel Fi…

	Anonymous on Python Web Scraping / Automati…
	rajendarqvkelly on PIG: Use GRUNT to Access PIG f…
	A Transient Historic… on Inverted Index Database
	Anonymous on XML Parsing: Advanced SQL
	Database Development… on SQL Server: Importing Excel Fi…

	Anonymous on Python Web Scraping / Automati…
	rajendarqvkelly on PIG: Use GRUNT to Access PIG f…
	A Transient Historic… on Inverted Index Database
	Anonymous on XML Parsing: Advanced SQL
	Database Development… on SQL Server: Importing Excel Fi…

Analytics4All

R: Text Mining (Term Document Matrix)

Stemming

Like this:

Related

4 thoughts on “R: Text Mining (Term Document Matrix)”

Leave a ReplyCancel reply

Stemming

Share this:

Like this:

Related

4 thoughts on “R: Text Mining (Term Document Matrix)”

Leave a ReplyCancel reply

Discover more from Analytics4All