Ensemble Modeling

In the world of analytics,modeling is a general term used to refer to the use of data mining (machine learning) methods to develop predictions. If you want to know what ad a particular user is more likely to click on, or which customers are likely to leave you for a competitor, you develop a predictive model.

There are a lot of models to choose from: Regression, Decision Trees, K Nearest Neighbor, Neural Nets, etc. They all will provide you with a prediction, but some will do better than others depending on the data you are working with. While there are certain tricks and tweaks one can do to improve the accuracy of these models, it never hurts to remember the fact that there is wisdom to be found in the masses.

The Jelly Bean Jar

I am sure everyone has come across some version of this in their life: you are at a fair or school fund raising event and someone has a large see-through jar full of jelly beans (or marbles or nickles). Next to the jar are some slips of paper with the instructions to “Guess the number of jelly beans in the jar you win!”

An interesting thing about this game, and you can try this out for yourself, is that given a reasonable number of participants, more often than not, the average guess of the group will perform better than the best individual guesser. Or in other words, imagine there are 200 jelly beans in the jar and the best guesser (the winner) guesses 215. More often than not, the average of all the guesses might be something like 210 or 190. The group cancels out their over and under guessing, resulting in a better answer than anyone individual.

How Do We Get the Average in Models?

There are countless ways to do it, and researchers are constantly trying new approaches to get that extra 2% improvement over the last model. For ease of understanding though, I am going to focus on 2 very popular methods of ensemble modeling : Random Forests & Boosted Trees.

2016-11-22_22-05-26

Random Forests:

Imagine you have a data set containing 50,000 records. We will start by randomly selecting 1000 records and creating a decision tree from those records. We will then put the records back into the data set and draw another 1000 records, creating another decision tree. The process is repeated over and over again for a predefined number of iterations (each time the data used is returned to the pool where it could possibly be picked again).

After all the sample decision trees have been created (let’s say we created 500 for the sake of argument), the model then takes the mean or average of all the models if you are looking at a regression or the mode of all the models if you are dealing with a classification.

For those unfamiliar with the terminology, a regression model looks for a numeric value as the answer. It could be the selling price of a house, a person’s weight, the price of a stock, etc. While a classification looks for classifying answers: yes or no, large – medium – small, fast or slow, etc.

Boosted Trees:

Another popular method of ensemble modeling is known as boosted trees. In this method, a simple (poor learner) model tree is created – usually 3-5 splits maybe. Then another small tree (3-5 splits) is built by using incorrect predictions for the first tree. This is repeated multiple times (say 50 in this example), building layers of trees, each one getting a little bit better than the one before it. All the layers are combined to make the final predictive model.

Oversimplified?

Now I know this may be an oversimplified explanation, and I will create some tutorials on actually building ensemble models, but sometimes I think just getting a feel for the concept is important.

So are ensemble models always the best? Not necessarily.

One thing you will learn when it comes to modeling is that no one method is the best. Each has their own strengths. The more complex the model, the longer it takes to run, so sometimes you will find speed outweighs desire for the added 2% accuracy bump. The secret is to be familiar with the different models, and to try them out in different scenarios. You will find that choosing the right model can be as much of an art as a science.

Feedback Loops in Predictive Models

Predictive models are full of perilous traps for the uninitiated. With the ease of use of some modeling tools like JMP or SAS, you can literally point and click your way into a predictive model. These models will give you results. And a lot of times, the results are good. But how do you measure the goodness of the results?

I will be doing a series of lessons on model evaluation. This is one of the more difficult concepts for many to grasp, as some of it may seem subjective. In this lesson I will be covering feedback loops and showing how they can sometimes improve, and other times destroy, a model.

What is a feedback loop?

A feedback loop in modeling is where the results of the model are somehow fed back into the model (sometimes intentionally, other times not). One simple example might be an ad placement model.

Imagine you built a model determining where  on a page to place an ad based on the webpage visitor. When a visitor in group A sees an ad on the left margin, he clicks on it. This click is fed back into the model, meaning left margin placement will have more weight when selecting where to place the ad when another group A visitor comes to your page.

This is good, and in this case – intentional. The model is constantly retraining itself using a feedback loop.

When feedback loops go bad…

Gaming the system.

Build a better mousetrap.. the mice get smarter.

Imagine a predictive model  developed to determine entrance into a university. Let’s say when you initially built the model, you discovered that students who took German in high school seemed to be better students overall. Now as we all know, correlation is not causation. Perhaps this was just a blip in your data set, or maybe it was just the language most commonly offered at the better high schools. The truth is, you don’t actually know.

How can this be a problem?

Competition to get into universities (especially highly sought after universities) is fierce to say the least. There are entire industries designed to help students get past the admissions process. These industries use any insider knowledge they can glean, and may even try reverse engineering the admissions algorithm.

The result – a feedback loop

These advisers will learn that taking German greatly increases a student’s chance of admission at this imaginary university. Soon they will be advising prospective students (and their parents) who otherwise would not have any chance of being accepted into your school, to sign up for German classes. Well now you have a bunch of students, who may no longer be the best fit, making their way past your model.

What to do?

Feedback loops can be tough to anticipate, so one method to guard against them is to retrain your model every once in a while. I even suggest retooling the model (removing some factors in an attempt to determine if a rogue factor – i.e. German class, is holding too much weight in your model).

And always keep in mind that these models are just that – models. They are not fortune tellers. Their accuracy should constantly be criticized and methods questioned. Because while ad clicks or college admissions are one thing, policing and criminal sentencing algorithms run the risk of being much more harmful.

Left unchecked, the feedback loop of a predictive criminal activity model in any large city in the United States will almost always teach the computer to emulate the worst of human behavior – racism, sexism, and class discrimination.

Since minority males from poor neighborhoods dis-proportionally make up our current prison population, any model that takes race, sex, and economic status into account will inevitably determine a 19 year old black male from a poor neighborhood is a criminal. We will have then violated the basic tenant of our justice system – innocent until proven guilty.

 

R: K-Means Clustering

Note: This is an introductory lesson with a made up data set. After you are finished with this tutorial, if you want to see a nice real world example, head on over to Michael Grogan’s website:

http://www.michaeljgrogan.com/k-means-clustering-example-stock-returns-dividends/

K Means Cluster will be our introduction to Unsupervised Machine Learning. What is Unsupervised Machine Learning exactly? Well, the simplest explanation I can offer is that unlike supervised where our data set contains a result, unsupervised does not.

Think of a simple regression where I have the square footage and selling prices (result) of 100 houses. Taking that data, I can easily create a prediction model that will predict the selling price of a house based off of square footage. – This is supervised machine learning

Now, take a data set containing 100 houses with the following data: square footage, house style, garage/no garage, but no selling price. We can’t create a prediction model since we have no knowledge of prices, but we can group the houses together based on commonalities. These groupings (clusters) can be used to gain knowledge of your data set.

I think seeing it in action will help.

Here is the data set: cluster

The data we will be looking at test results for 149 students.

2016-12-06_22-09-28.jpg

The task at hand is to group the students into 3 groups based on the test results. Now one thing any teacher will let you know is that some kids perform well in one subject and perhaps not so well in another. So we can’t simply group them on the score performance on one test. And when you are dealing with real world data, you might be looking at 20 -100 test/quiz scores per student.

So what we are going to do is let the computer decide how to group (or cluster) them.

To do so, we are going to be using K-means clustering. K-means clustering works by choosing random points (centroids). It then groups the data points around the centroids based which centroid the points are closest to.

Let’s get started

Let’s start by loading the data

st <- read.csv(file.choose())
head(st)

our data

2016-12-06_22-23-40.jpg

Now let’s run the data through a Kmeans() algorithm

First, we are only going to want to focus on columns 2 and 3 in the data set since column 1 (studentID) is basically a label and provides no value in prediction.

To do this, we subset the data: st[,2:3] – which means I want all row ([,) and columns 2-3 (2:3])

2016-12-06_22-27-11.jpg

Now the code to make our clusters

stCl <- kmeans(st[, 2:3], 3, nstart = 20)
stCl

The syntax is kmeans(DATA, Number of clusters, Numbers of random starts)

Number of clusters I picked as 3 because I know this works well with the data, picking the right number usually takes a little trial and error in real life

Number of random starts is how many times you want the algorithm to be rerun (choosing new centroids each time) and choosing the result where the clusters are tightest.

Below is the output of our Kmeans – note the cluster means, this tells us the mean score for TestA and TestB set in each cluster.

2016-12-06_22-36-33.jpg

Hey, if you are a math junkie, this may be all you want. But if you are looking for some more practical value here, lets move on.

First, we need to add a column to our data set that shows our columns.

Now since we read our data from a csv, it is a data frame. If you can’t remember that, you can always run the command is.data.frame(st) to test it out.

Do you remember how to add a column to a data frame?

Well, there are multiple ways, but this is, in my opinion, the easiest way.

st$cluster <- stCl$cluster

is.data.frame(st)
st$cluster <- stCl$cluster
head(st)

Here is the result

2016-12-06_22-43-22.jpg

Now with the clusters, you can group your students based their assigned cluster.

Technically we are done here. We have successfully grouped the students. But what if you want to make sure you did a good job. One quick check is to graph your work.

Before we can graph, we have to make sure our st$cluster column is set as a factor, then using ggplot, we can graph it. (if you don’t have ggplot2 installed, you will need to run this line: install.packages(“ggplot2”)

library(ggplot2)
st$cluster <- as.factor(st$cluster)
ggplot(st, aes(TestA, TestB, color = cluster)) + geom_point()

And here is our output. The groups look pretty good.

graph.jpeg

Python: Linear Regression

Regression is still one of the most widely used predictive methods. If you are unfamiliar with Linear Regression, check out my: Linear Regression using Excel lesson. It will explain the more of the math behind what we are doing here. This lesson is focused more on how to code it in Python.

We will be working with the following data set: Linear Regression Example File 1

Import the data

Using pandas .read_excel()

linReg.jpg

What we have is a data set representing years worked at a company and salary.

Let’s plot it

Before we go any further, let’s plot the data.

Looking at the plot, it looks like there is a possible correlation.

linReg1.jpg

Linear Regression using scipy

scipy library contains some easy to use maths and science tools. In this case, we are importing stats from scipy

the method stats.linregress() produces the following outputs: slope, y-intercept, r-value, p-value, and standard error.

linReg2

I set slope to m and y-intercept to b: so we match the linear formula y = mx+b

Using the results of our regression, we  can create an easy function to predict a salary. In the example below, I want to predict the salary of a person who has been working there 10 years.

linReg3.jpg

Our p value is nice and low. This means our variables do have an effect on each other

linReg4.jpg

Our standard error is 250, but this can be misleading based on the size of the values in your regression. A better measurement is r squared. We find that by squaring our r output

linReg5.jpg

R squared runs from 0 (bad) to 1 (good). Our R squared is .44. So our regression is not that great. I prefer to keep a r squared value at least .6 or above.

Plot the Regression Line

We can use our pred() function to find the y-coords needed to plot our regression line.

Passing pred() a x value of 0 I get our bottom value. I pass pred() a x value of 35 to get our top value.

I then redo my scatter plot just like above. Then I plot my line using plt.plot().

linReg6.jpg


If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT. 

Follow this link for more Python content: Python