Ensemble Modeling

In the world of analytics,modeling is a general term used to refer to the use of data mining (machine learning) methods to develop predictions. If you want to know what ad a particular user is more likely to click on, or which customers are likely to leave you for a competitor, you develop a predictive model.

There are a lot of models to choose from: Regression, Decision Trees, K Nearest Neighbor, Neural Nets, etc. They all will provide you with a prediction, but some will do better than others depending on the data you are working with. While there are certain tricks and tweaks one can do to improve the accuracy of these models, it never hurts to remember the fact that there is wisdom to be found in the masses.

The Jelly Bean Jar

I am sure everyone has come across some version of this in their life: you are at a fair or school fund raising event and someone has a large see-through jar full of jelly beans (or marbles or nickles). Next to the jar are some slips of paper with the instructions to “Guess the number of jelly beans in the jar you win!”

An interesting thing about this game, and you can try this out for yourself, is that given a reasonable number of participants, more often than not, the average guess of the group will perform better than the best individual guesser. Or in other words, imagine there are 200 jelly beans in the jar and the best guesser (the winner) guesses 215. More often than not, the average of all the guesses might be something like 210 or 190. The group cancels out their over and under guessing, resulting in a better answer than anyone individual.

How Do We Get the Average in Models?

There are countless ways to do it, and researchers are constantly trying new approaches to get that extra 2% improvement over the last model. For ease of understanding though, I am going to focus on 2 very popular methods of ensemble modeling : Random Forests & Boosted Trees.

2016-11-22_22-05-26

Random Forests:

Imagine you have a data set containing 50,000 records. We will start by randomly selecting 1000 records and creating a decision tree from those records. We will then put the records back into the data set and draw another 1000 records, creating another decision tree. The process is repeated over and over again for a predefined number of iterations (each time the data used is returned to the pool where it could possibly be picked again).

After all the sample decision trees have been created (let’s say we created 500 for the sake of argument), the model then takes the mean or average of all the models if you are looking at a regression or the mode of all the models if you are dealing with a classification.

For those unfamiliar with the terminology, a regression model looks for a numeric value as the answer. It could be the selling price of a house, a person’s weight, the price of a stock, etc. While a classification looks for classifying answers: yes or no, large – medium – small, fast or slow, etc.

Boosted Trees:

Another popular method of ensemble modeling is known as boosted trees. In this method, a simple (poor learner) model tree is created – usually 3-5 splits maybe. Then another small tree (3-5 splits) is built by using incorrect predictions for the first tree. This is repeated multiple times (say 50 in this example), building layers of trees, each one getting a little bit better than the one before it. All the layers are combined to make the final predictive model.

Oversimplified?

Now I know this may be an oversimplified explanation, and I will create some tutorials on actually building ensemble models, but sometimes I think just getting a feel for the concept is important.

So are ensemble models always the best? Not necessarily.

One thing you will learn when it comes to modeling is that no one method is the best. Each has their own strengths. The more complex the model, the longer it takes to run, so sometimes you will find speed outweighs desire for the added 2% accuracy bump. The secret is to be familiar with the different models, and to try them out in different scenarios. You will find that choosing the right model can be as much of an art as a science.

The Monty Hall Problem

The Monty Hall problem is an interesting exercise in conditional probability. It focuses on a 1970’s American television show called Let’s Make a Deal hosted by television personality Monty Hall.

The game would end with a contestant being shown 3 doors. Behind one of those doors, there was a prize. Behind the other 2, a goat. Monty Hall would ask the contestant to pick a door. He would then open the one of the two remaining doors showing the contestant a goat. Now with 2 doors remaining, he would ask the contestant if they wanted to change their selection. The question was – should they?

Let’s look at it. In the beginning of the game, we have 3 closed doors. We know 1 of these 3 doors has a prize behind it, the other 2 do not. So the probability of picking the right door at this point is 1/3.

monty.jpg

Let’s say you chose door number 1. Once you chose, Monty Hall opens door number 3 and shows you the prize is not behind that door. He then asks if you would like to choose door number 2 instead.

So what do you think the probability the prize is behind door number 2? You have 2 doors left, so you might think it would 1/2. Well, you would be wrong.

monty1.jpg

The reality is, the probability that the prize is behind door number 2 is actually 2/3.

Don’t believe me? Don’t worry, you aren’t alone. When Marilyn vos Savant wrote about this in her newspaper column, outraged mathematics professors from all over wrote her letters protesting her faulty logic. Her response was simple, play the game and see what results you come up with.

The trick to this neat little math problem is the unique constraints we are faced with. We start with 3 doors, from which we must pick 1. Giving us a the already agreed upon probability of 1/3.

Next, one of the two remaining options is removed the equations leaving only the door you have picked and 1 other door. Common sense would tell you that you now have a 50 – 50 situation where either door gives you the same odds of winning.

But common senses fails you. If you stick to your original choice, you will have a 1/3 probability of winning… but if you change your choice to the remaining door, your probability now shifts to 2/3.

Let’s see it in action.

Below is the layout of the three possible outcomes of the game.

2017-01-12_19-39-35.jpg

For the sake of argument, let us say that we have decided to choose Door 1 to start the game.

2017-01-12_19-41-10.jpg

After we have made that choice, one of the remaining doors with a goat behind it will be revealed.

2017-01-12_19-44-47.jpg

So, now we are left with the option to stick with door 1 or try the remaining door.

If we stick to our original door, we will win 1 out of the three possible games.

2017-01-12_19-47-16.jpg

If, however, you change your door after the reveal, you will win 2 times out of the 3 possible games.

2017-01-12_19-47-38.jpg

 

So what is the take away here?

If you ever find yourself on Let’s Make a Deal, always change your door after the reveal.

More importantly, if you are working with data for a living, make sure you have some healthy respect for the math behind statistics and probability. The Monty Hall Problem is, at its very heart, a conditional probability problem. Issues like this and Simpons’ Paradox can lead you to develop what appears to be logically sound models that are sure to fail once they come face to face with reality.

There is a lot of great software packages out there that make building machine learning models as simple as point and click. But without someone with some healthy skepticism as well as a firm handle on the math, many of these models are sure to fail.

 

Simpson’s Paradox: How to Lie with Statistics

We’ve all heard the saying from Benjamin Disreali, “Lies, damn lies, and statistics.”

While statistics has proven to be of great benefit to mankind in almost every endeavor, inexperienced, sloppy, and downright unscrupulous statisticians have made some pretty wild claims. And because these wild claims are often presented as statistical fact, people in all industries – from business, to healthcare, to education -have chased these white elephants right down the rabbit hole.

Anyone who has taken even an introductory statistics course can tell you how easily statistics can be misrepresented. One of my favorite examples involves using bar charts to confuse the audience. Look at the chart below. It represents the number of games won by two teams in a season of beer league softball.

2017-01-10_10-28-21

At first glance, you might think Team B won twice as many games as Team A, and that is indeed the intention of the person who made this chart. But when you look at the numbers to the left, you will see Team A won 15 games to Team B’s 20. While I am no mathematician, even I know 15 is not half of 20.

This deception was perpetrated by simply adjusting the starting point of the Y – Axis. When you reset it to 0, the chart tells a different story.

2017-01-10_10-31-54.jpg

Even Honest People Can Lie by Accident

In the example above, the person creating the chart was manipulating the data on purpose to achieve a desired effect. You may look at this and say I would never deceive people like that, the truth is – you just might do it by accident.

What do I mean? Let’s take an example from an industry fraught with horrible statistics – our education system.

Below you will find a chart depicting the average math scores on a standardized test since 2000 for Happy Town, USA. You will notice the test scores are significantly lower now than they were back in 2000.

2017-01-10_10-40-48

What does this mean? Are the kids getting stupider? Has teacher quality gone down? Who should be held accountable for this? Certainly those lazy tenured teachers who are only there to collect their pensions and leach off the tax payers.

I mean look at the test scores. The average score has dipped from around 90 to close to 70. Surely something in the system is failing.

Now what if I were to tell you that the chart above – while correct – does not tell the whole story. Test scores in Happy Town, USA are actually up – if you look at the data correctly.

What we are dealing with is something known in statistics as Simpson Paradox, and even some of the brightest academic minds have published research that ignored this very important concept.

What do I mean?

Let me tell you the whole story about Happy Town, USA. Happy Town was your average American middle class town. The economic make-up of this town in 2000 was 20% of the families made over $150K, 60% made between $150K and $50K, with 20% earning less than $50K a year.

In 2008, that all changed. The recession hit causing people to lose their jobs and default on their mortgages. Families moved out, housing prices fell. Due to the new lower housing prices, families from Non-So Happy Town, USA were able to afford houses in Happy Town. They moved their families there in hopes of a better education and better life for their children.

While the schools in Happy Town were better, the teachers were not miracle workers. These kids from Not So Happy Town did not have the strong educational foundation the pre-recession residents of Happy Town did. Many teachers found themselves starting almost from scratch.

No matter how hard these new kids and their teachers tried, they could never be expected to jump right in and perform as well as the pre-2008 Happy Town kids. The economic makeup of the town shifted. The under $50K’s now represent 60% of the town’s population, with 150K-50K making up only 30% and the top earners dwindling down to 10%.

So while taking an average of all the students is not a sign of someone necessarily trying to pull the wool over your eyes, it does not tell the whole story.

To see the whole story, and to unravel Simpson Paradox, you need to look at the scores across the different economic sectors of this town which has undergone drastic changes.

2017-01-10_11-26-34

Looking at from the standpoint of economic sector, you will see the scores in each sector have improved. With the under $50K improving at an impressive rate. Clearly the teachers and staff at Happy Town School are doing their job and then some.

So while the person who took the average of the whole school may not have intended to lie with their statistics, a deeper dive into the numbers showed that the truth was hidden inside the aggregate.

Keep this in mind next time someone shows you falling SAT scores, crime stats, or disease rates. All of these elements are easily affected by a shift in demographics. If you don’t see the breakdown, don’t believe the hype.

SQL: Case Statement

The Case Statement is SQL’s version of IF/THEN logic statements.

In this example, I am using the AdventureWorks2012 database, which is free to download. You do not need this to follow the lesson, but if you want to play along, follow this link: SQL: SELECT Statement

 

Let’s start by looking at our data, using the following Select Statement:

SELECT 
 [NationalIDNumber]
 ,[LoginID]
 ,[JobTitle]
 ,[BirthDate]
 ,[MaritalStatus]
 ,[Gender]
 ,[HireDate] 
 FROM [AdventureWorks2012].[HumanResources].[Employee]

Here is the output

sqlagg3

Now what if you don’t really care about the particular employee birthday, you simply want to know who is over 40 and who is 40 or younger.

To do that, we need a mechanism to determine an age above or below 40 from the birth date. In most programming languages, you would simply use an If / Then statement. In SQL, we need to use a Case Statement.

The Case Syntax is simple. Let’s assume a variable A, set to either 1 or 0 representing True or False.

Case
     when A = 1 Then "True"
     when B = 0 Then "False"
End

Note the statement is started with the keyword Case and closed with the keyword End.

In the middle, the logic is controlled using When and Then. When <condition> Then  <Action>

If needed, you can append an Else to your statement to cover conditions not specified in your statements.

Case
     when A = 1 Then "True"
     when B = 0 Then "False"
     else "Neither" 
End

 

 

Now, let’s apply it to our case above. Notice the Case statement is put right into my Select statement just as if it were a common field.

select loginID,
 [NationalIDNumber]
 ,[LoginID]
 ,[JobTitle]
 ,
 case 
    when BirthDate < '1977-01-01' then 'Over40'
    when BirthDate >= '1977-01-01' then 'Under40'
 end
 ,[MaritalStatus]
 ,[Gender]
 ,[HireDate]
 FROM [AdventureWorks2012].[HumanResources].[Employee]

Here are my results. You will note a new column with values Over40 or Under40

2017-01-09_11-29-42.jpg

But what does that column mean? Let’s name it to make it clear.  After the End keyword add as Over40

 select loginID,
 [NationalIDNumber]
 ,[LoginID]
 ,[JobTitle],
 case 
        when BirthDate < '1977-01-01' then 'Over40'
         when BirthDate >= '1977-01-01' then 'Under40'
 end as Over40,
 [MaritalStatus]
 ,[Gender]
 ,[HireDate]
 FROM [AdventureWorks2012].[HumanResources].[Employee]

Here, now the column has a name.

2017-01-09_11-32-09.jpg

 

 

Feedback Loops in Predictive Models

Predictive models are full of perilous traps for the uninitiated. With the ease of use of some modeling tools like JMP or SAS, you can literally point and click your way into a predictive model. These models will give you results. And a lot of times, the results are good. But how do you measure the goodness of the results?

I will be doing a series of lessons on model evaluation. This is one of the more difficult concepts for many to grasp, as some of it may seem subjective. In this lesson I will be covering feedback loops and showing how they can sometimes improve, and other times destroy, a model.

What is a feedback loop?

A feedback loop in modeling is where the results of the model are somehow fed back into the model (sometimes intentionally, other times not). One simple example might be an ad placement model.

Imagine you built a model determining where  on a page to place an ad based on the webpage visitor. When a visitor in group A sees an ad on the left margin, he clicks on it. This click is fed back into the model, meaning left margin placement will have more weight when selecting where to place the ad when another group A visitor comes to your page.

This is good, and in this case – intentional. The model is constantly retraining itself using a feedback loop.

When feedback loops go bad…

Gaming the system.

Build a better mousetrap.. the mice get smarter.

Imagine a predictive model  developed to determine entrance into a university. Let’s say when you initially built the model, you discovered that students who took German in high school seemed to be better students overall. Now as we all know, correlation is not causation. Perhaps this was just a blip in your data set, or maybe it was just the language most commonly offered at the better high schools. The truth is, you don’t actually know.

How can this be a problem?

Competition to get into universities (especially highly sought after universities) is fierce to say the least. There are entire industries designed to help students get past the admissions process. These industries use any insider knowledge they can glean, and may even try reverse engineering the admissions algorithm.

The result – a feedback loop

These advisers will learn that taking German greatly increases a student’s chance of admission at this imaginary university. Soon they will be advising prospective students (and their parents) who otherwise would not have any chance of being accepted into your school, to sign up for German classes. Well now you have a bunch of students, who may no longer be the best fit, making their way past your model.

What to do?

Feedback loops can be tough to anticipate, so one method to guard against them is to retrain your model every once in a while. I even suggest retooling the model (removing some factors in an attempt to determine if a rogue factor – i.e. German class, is holding too much weight in your model).

And always keep in mind that these models are just that – models. They are not fortune tellers. Their accuracy should constantly be criticized and methods questioned. Because while ad clicks or college admissions are one thing, policing and criminal sentencing algorithms run the risk of being much more harmful.

Left unchecked, the feedback loop of a predictive criminal activity model in any large city in the United States will almost always teach the computer to emulate the worst of human behavior – racism, sexism, and class discrimination.

Since minority males from poor neighborhoods dis-proportionally make up our current prison population, any model that takes race, sex, and economic status into account will inevitably determine a 19 year old black male from a poor neighborhood is a criminal. We will have then violated the basic tenant of our justice system – innocent until proven guilty.

 

Factor Analysis: Picking the Right Variables

Factor Analysis, what is it?

In layman’s terms, it means choosing which factors (variables) in a data set you should use for your model. Consider the following data set:

2017-01-03_19-43-48.jpg

In the above example, the columns (highlighted in light orange) would be our Factors. It can be very tempting, especially for new data science students, to want to include as many factors as possible. In fact, as you add more factors to a model, you will see many classic statistical markers for model goodness increase. This can give you a false sense of trust in the model.

The problem is, with too many poorly chosen factors, you model is almost guaranteed to under perform. To avoid this issue, try approaching a new model with the idea of minimizing factors, only using the factors that drive the greatest impact.

It may seem overwhelming at first. I mean where do you start? Looking at the list above, what do you get rid of? Well, for those who really love a little self torture, there are entire statistics textbooks dedicated to factor analysis. For the rest of us, consider some of the following concepts. While not an exhaustive list, these should get you started in the right direction.

Collinearity

In terms of regression analysis, collinearity concerns itself with factors that have strong correlations with each other. In my example above, think Height and Weight. In general, as Height increases so does Weight. You would expect a 6’4 senior to easily outweigh a 4’11 freshman. So as one factor (Height) increases or decreases (Weight) follows in kind. Correlations can also be negative with one factor decreasing as another  factors increases or visa versa.

2017-01-03_19-57-58.jpg

The problem with these factors is that when used in a model, they tend to amplify their effect. So the model is skewed placing too much weight on what is essentially a single factor.

So what do you do about it?

Simply enough, in cases like this. You pick one. Height or Weight will do. In more complex models you can use mathematical techniques like Singular Value Decomposition (SVD), but I won’t cover that in this lesson.

I am also not going to cover any of the methods for detecting collinearity in this lesson, I will be covering those in further lessons. But it should be noted that a lot of times domain knowledge is really all you need to spot it. It doesn’t take a doctor to realize that taller people are generally heavier.

But wait…

I know what you are thinking, what about the 250 lb 5’1 kid or the 120 lb 6’2 kid? Well if you have enough of these outliers in your data and you feel that being over or under weight is an important variable to consider, I would recommend using a proxy. In this case, you could substitute BMI (body mass index – a calculation based on height and weight) to replace both height and weight as factors.

Stepwise

Stepwise regression is a method for determining which factors provide value to the model. The way it works (in the most basic definition I can offer) is you run your regression model with all your factors, removing the weakest factor each time (based on statistical evaluation methods like R^2 values and P values). This is done repeatedly until only high value factors are left in the model.

 

NEXT….Not “technically Factor Analysis” but can be useful in removing bad factors from your model.

 

Binning or Categorizing Data

Let’s say, looking at the data example above, our data covered all grades from 1-12. What if you want to look a kids in a two year period. You would want to bin the data into equal groups of 2: 1-2,3-4,5-6,7-8,9-10,11-12. You can now analyze the data in these blocks.

What if you wanted to measure the effectiveness of certain schools in the system. You might be wise to categorize the data. What that means is, we will take grades 1-6 and place them in one category (elementary), 7-8 in another(middle school), 9-12(high school).

Incomplete Data

Imagine a factor called household income. This is a field that very likely may not be readily answered by parents. If there are only a few missing fields, some algorithms won’t be too affected, but if there are a lot, say 5%, you need to do something about it.

What are you options?

You could perform a simple mean or median replacement for all missing values, or try to calculate a best guess based on other factors. You could delete records missing this value. Or, as I often do, just toss this factor away. Most likely any value this adds to your model is going to be questionable at best. Don’t fall for the Big Data more is always better trap. Sometimes simplicity wins out in the end.

Outliers and Erroneous Data

Outliers can really skew you model, but even worse, erroneous data can make you model absolutely worthless. Look out for outliers, question strange looking data. Unless you can come up with a real good reason why these should stay in your model, I say chuck the records containing them.

 

 

 

 

Python 2.xx VS 3.xx

If you are trying to learn Python, especially for Data Science, you are going to come across a bunch a people (myself included) who have been very hesitant to move from Python 2.xx to Python 3.xx.

The reason?

Not Backwards Compatible

Well, when Python 3 first came out, it was made very clear that it was not backwards compatible. While most of the code remained the same, there were some changes that made programs coded in Python 2.xxx fail. A prime example is Print. In Python 2,

Print ‘Hello World!’

was perfectly acceptable, but in Python 3, it fails and throws up an exception. Python 3 had added the requirement for () with print statements

Print(‘Hello World’)   — Python 3 friendly

Well I have a whole bunch of code sitting on my hard drive I like to refer to, and having to go through it cleaning up all the new changes did not exactly seem like the best use of my time, since I am able to just continue using Python 2.xxx.

Libraries

Python is so great for Data Science because of the community of libraries out there providing the data horsepower we all love. (Pandas, Numpy, SciKit Learn, ect).  Well guess what, the changes to Python 3 made some of these libraries unstable. Strike 2, another reason not to waste my time with this new version.

Moving On

Well, it appears that enough time has passed and some people tell me all the bugs have been worked out with Python 3. While Python 2.7 will be supported til 2020 (those who love it, just keep using it I say), I have decided to try putting Python 3 through it’s paces.

My lessons will be reviewed one by one, with the 2.7 code being tested in a 3.4 environment. I will note any changes that need to made to the code to make it 3.4 compatible and add them to the  lesson. Each lesson that has been reviewed will have the following heading.

*Note: This lesson was written using Python 2.xx. If you are using Python 3.xxx any changes to the code will be annotated under headings: Python 3.xxx

2.7??

If you love 2, just keep using it. You’ve got 3 more years. By that point, whatever revision of 3 we are on may not even look like the current version. But if you want to look ahead, follow along with me as I update my code. (Note, all the original 2.7 code will remain on my site until the time that it is no longer supported)

Qlik Sense: Sorting by Expression

This can be a very frustrating problem for many Qlik users.

2016-12-29_10-49-34.jpg

I have a list of the days of the week, but as you can see they are not in order. Unfortunately, sorting them alphabetically will not help your situation either. The solution is to Sort by Expression

Go into to edit mode and click on the object in question (in this case a Filter). Now click on Sorting in the right side bar and unclick Auto Sorting

2016-12-29_10-51-26.jpg

Uncheck Sort numerically and Sort alphabetically, and check Sort by expression

2016-12-29_10-51-59.jpg

Click on the Expression box and the formula window opens.

2016-12-29_10-52-30.jpg

The formula we are going to use is the Match() formula:

Syntax: match(field, condition 1, condition 2, etc)

2016-12-29_10-53-26.jpg

Hit apply and check your Filter box.

2016-12-29_10-53-39.jpg

 

QLIK Sense: If Then Conditional Logic

One place Qlik really shines in my opinion is in their data modeling. While Qlik Sense brought user friendliness to the mix by allowing automatic data model creation, it still allows almost unlimited customization to those looking to peek under the hood.

In today’s example, I am going show you how to utilize a simple IF THEN statement in the data model to make building your visualizations much easier.

You can download the practice data set here: qlikalarms

This data set is made up data modeled off of cardiac patient alarms. Let’s look at our data:

As you can see, we have 5 columns. Count is just a distinct number for each row, day = day of the week, alarm trim down1 = the alarm condition, Alarm Class = severity of alarm – Red is the most severe with INOP being the least, timeday = time of day the event occurred.

2016-12-28_09-54-14.jpg

The Task

The task at hand here is simple enough. We want to analyze alarms on the weekend versus weekday.

Loading the data in Qlik is easy enough. If you have any questions on how to do it, refer to my earlier tutorial on building a Dashboard: QLIK: Build a Dashboard Part 1

Once your data is loaded:

Go to the App Overview. Select your new sheet, hit Edit and grab a Bar Chart from the left utility bar. We are going to set the Dimension to Alarm Class and the Measure to Count of alarm trim down1

2016-12-28_09-42-49

2016-12-28_10-19-45

This will give you the following Bar Chart

2016-12-28_09-43-38

Now grab a Filter Pane and drag it onto the sheet. Set its dimension to day.

Now by selecting days from the filter pane you can effectively compare weekdays to weekends. However, it involves a lot of unnecessary clicking for the end user. Let’s try a better method.

2016-12-28_10-10-15.jpg

Let’s go to the Data load editor

2016-12-28_09-25-05

click on the Auto-generated section in the left pane.

2016-12-28_09-25-24.jpg

Next click on the Unlock box in the upper right corner. You will be met with a warning window. Just click Okay.

2016-12-28_09-25-39

Let’s take a look at the Load script. This script was auto-generated by Qlik when you uploaded the Excel file. Note that it looks similar to an SQL script. We are going to LOAD the columns listed below FROM the Excel workbook.

2016-12-28_12-17-24

What we are going to do next is add a new line to the loading script. This line will be an IF THEN statement.

The syntax is as a follows: if ( conditional statement, THEN, ELSE) as [NAME FOR NEW COLUMN]

In our example I am stating if day is equal to ‘Saturday’ or ‘Sunday’ then 1 else 0 and I am naming this new column Weekend

*** note the , at the end of [timeday]. Make sure you add that there. Qlik will throw an error if the correct syntax is not used.

2016-12-28_12-18-21.jpg

Now select Load data. If successful, go back to your App Overview > edit sheet. If not successful, check your syntax!!

2016-12-28_12-25-20.jpg

Let’s replace the day filter pane with a Weekend filter pane

2016-12-28_09-44-17

Now you can compare weekdays to weekends with just a single click. 1 for weekends and 0 for weekdays.

This still is not ideal. The goal of a good BI solution is usability. The end user should be able to dig into their data without having to spend too much time trying to decipher what built.

Let’s correct this. Go back to data load editor. Let’s change our IF THEN statement to read if day = Saturday or Sunday, then ‘Weekend’, else ‘Weekday’

2016-12-28_12-17-45

Click Load data and go back to your sheet. Notice the filter pane now shows Weekday and Weekend as your options.

2016-12-28_09-45-31

 

 

 

 

R: Text Mining (Pre-processing)

This is part 2 of my Text Mining Lesson series. If you haven’t already, please check out part 1 that covers Term Document Matrix: R: Text Mining (Term Document Matrix)

Okay, now I promise to get to the fun stuff soon enough here, but I feel that in most tutorials I have seen online, the pre-processing of text is often glanced over. It was always (and often still is) a real sore spot for me when assumptions are made as to my knowledge level. If you are going to throw up a block of code, at least give a line or two explanation as to what the code is there for, don’t just assume I know.

I remember working my way through many tutorials where I was able to complete the task by simply copying the code, but I didn’t have a full grasp of what was happening in the middle. I this lesson, I am going to cover some of the more common text pre-processing steps used in the TM library. I am going to go into some level of detail and make some purposeful mistakes so hopefully when you are done here you will have a firm grasp on this very important step in the text mining process.

Let’s start by getting our libraries

install.packages("tm") # if not already installed
install.packages("SnowballC")

library(tm)
library(SnowballC)

Now, let’s load our data. For this lesson we are going to use a simple vector.

wordVC <- c("I like dogs!", "A cat chased the dog.", "The dog ate a bone.", 
            "Cats make fun pets.", "The cat did what? Don't tell the dog.", 
            "A cat runs when it has too, but most dogs love running!")

Now let’s put this data into a corpus for text processing.

corpus <- (VectorSource(wordVC))
corpus <- Corpus(corpus)
summary(corpus)

Here is the output

2016-12-22_09-11-53.jpg

Now for a little frustration. Let’s say you want to see what is in text document 4. You could try

inspect(corpus[4])

But this will be your output, not really what you are looking for.

2016-12-22_09-23-30.jpg

If you want to see the actual text- try this instead

corpus[[4]]$content

Now you can see the text

2016-12-22_09-26-21.jpg

As we go through the text pre-processing, we are going to use the following For Loop to examine our corpus

for (i in 1:6) print (corpus[[i]]$content)

Output

2016-12-22_09-30-41.jpg

Punctuation

Punctuation generally adds no value to text mining when utilizing standard numeric based data mining algorithms like clustering or classification. So it behooves us to just remove it.

To do so, the tm package has a cool function called tm_map() that we can pass arguments to, such as removePunctuation

corpus <- tm_map(corpus, content_transformer(removePunctuation))

for (i in 1:6) print (corpus[[i]]$content)

Note, you do not need the for loop, I am simply running it each time to show you the progress. 

Notice all the punctuation is gone now.

2016-12-22_09-34-28.jpg

Stopwords

Next we are going to get rid of what are known as stopwords. Stopwords are common words such as (the, an, and, him, her). These words are so commonly used that they provide little insight as to the actual meaning of the given text.  To get rid of them, we use the following code.

corpus <- tm_map(corpus, content_transformer(removeWords), 
          stopwords("english"))
for (i in 1:6) print (corpus[[i]]$content)

If you look at line 2, “A cat chase  dog”, you will the word “the” has been removed. However, if you look at the next line down, you will notice “The” is still there.

2016-12-22_09-39-52

WHY?

Well it comes down to the fact that computers do not treat T and t as the same letter, even though they are. Capitalized letters are viewed by computers as separate entities. So “The” doesn’t match “the” found in the list of stopwords to remove.

For a full list of R stopwords, go to: https://github.com/arc12/Text-Mining-Weak-Signals/wiki/Standard-set-of-english-stopwords

So how do we fix this?

tolower

Using tm_map with the “tolower” argument will make all the letters lowercase. If we then re-run our stopwords command, you will see all the “the” are gone

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, content_transformer(removeWords), 
          stopwords("english"))
for (i in 1:6) print (corpus[[i]]$content)

Output

2016-12-22_09-49-25

stemming

Next we will stem our words. I covered this in the last lesson, but it bears repeating. What stemming does is attempt to remove variants of words. In our example, pay attention to the following words (dog, dogs, cat, cats, runs, running)

corpus <- tm_map(corpus, stemDocument)
for (i in 1:6) print (corpus[[i]]$content)

Notice the words are now (dog, cat, run)

2016-12-22_09-55-02.jpg

Whitespace

Finally, let’s get rid of all this extra white space we have now.

corpus <- tm_map(corpus, stripWhitespace) 
for (i in 1:6) print (corpus[[i]]$content)

Output

2016-12-22_10-01-01.jpg

removeNumbers

I didn’t use this argument with my tm_map() function today because I did not have any numbers in my text. But if I, the command would be as follows

corpus <- tm_map(corpus, content_transformer(removeNumbers))