Ensemble Modeling

In the world of analytics,modeling is a general term used to refer to the use of data mining (machine learning) methods to develop predictions. If you want to know what ad a particular user is more likely to click on, or which customers are likely to leave you for a competitor, you develop a predictive model.

There are a lot of models to choose from: Regression, Decision Trees, K Nearest Neighbor, Neural Nets, etc. They all will provide you with a prediction, but some will do better than others depending on the data you are working with. While there are certain tricks and tweaks one can do to improve the accuracy of these models, it never hurts to remember the fact that there is wisdom to be found in the masses.

The Jelly Bean Jar

I am sure everyone has come across some version of this in their life: you are at a fair or school fund raising event and someone has a large see-through jar full of jelly beans (or marbles or nickles). Next to the jar are some slips of paper with the instructions to “Guess the number of jelly beans in the jar you win!”

An interesting thing about this game, and you can try this out for yourself, is that given a reasonable number of participants, more often than not, the average guess of the group will perform better than the best individual guesser. Or in other words, imagine there are 200 jelly beans in the jar and the best guesser (the winner) guesses 215. More often than not, the average of all the guesses might be something like 210 or 190. The group cancels out their over and under guessing, resulting in a better answer than anyone individual.

How Do We Get the Average in Models?

There are countless ways to do it, and researchers are constantly trying new approaches to get that extra 2% improvement over the last model. For ease of understanding though, I am going to focus on 2 very popular methods of ensemble modeling : Random Forests & Boosted Trees.


Random Forests:

Imagine you have a data set containing 50,000 records. We will start by randomly selecting 1000 records and creating a decision tree from those records. We will then put the records back into the data set and draw another 1000 records, creating another decision tree. The process is repeated over and over again for a predefined number of iterations (each time the data used is returned to the pool where it could possibly be picked again).

After all the sample decision trees have been created (let’s say we created 500 for the sake of argument), the model then takes the mean or average of all the models if you are looking at a regression or the mode of all the models if you are dealing with a classification.

For those unfamiliar with the terminology, a regression model looks for a numeric value as the answer. It could be the selling price of a house, a person’s weight, the price of a stock, etc. While a classification looks for classifying answers: yes or no, large – medium – small, fast or slow, etc.

Boosted Trees:

Another popular method of ensemble modeling is known as boosted trees. In this method, a simple (poor learner) model tree is created – usually 3-5 splits maybe. Then another small tree (3-5 splits) is built by using incorrect predictions for the first tree. This is repeated multiple times (say 50 in this example), building layers of trees, each one getting a little bit better than the one before it. All the layers are combined to make the final predictive model.


Now I know this may be an oversimplified explanation, and I will create some tutorials on actually building ensemble models, but sometimes I think just getting a feel for the concept is important.

So are ensemble models always the best? Not necessarily.

One thing you will learn when it comes to modeling is that no one method is the best. Each has their own strengths. The more complex the model, the longer it takes to run, so sometimes you will find speed outweighs desire for the added 2% accuracy bump. The secret is to be familiar with the different models, and to try them out in different scenarios. You will find that choosing the right model can be as much of an art as a science.

Factor Analysis: Picking the Right Variables

Factor Analysis, what is it?

In layman’s terms, it means choosing which factors (variables) in a data set you should use for your model. Consider the following data set:


In the above example, the columns (highlighted in light orange) would be our Factors. It can be very tempting, especially for new data science students, to want to include as many factors as possible. In fact, as you add more factors to a model, you will see many classic statistical markers for model goodness increase. This can give you a false sense of trust in the model.

The problem is, with too many poorly chosen factors, you model is almost guaranteed to under perform. To avoid this issue, try approaching a new model with the idea of minimizing factors, only using the factors that drive the greatest impact.

It may seem overwhelming at first. I mean where do you start? Looking at the list above, what do you get rid of? Well, for those who really love a little self torture, there are entire statistics textbooks dedicated to factor analysis. For the rest of us, consider some of the following concepts. While not an exhaustive list, these should get you started in the right direction.


In terms of regression analysis, collinearity concerns itself with factors that have strong correlations with each other. In my example above, think Height and Weight. In general, as Height increases so does Weight. You would expect a 6’4 senior to easily outweigh a 4’11 freshman. So as one factor (Height) increases or decreases (Weight) follows in kind. Correlations can also be negative with one factor decreasing as another  factors increases or visa versa.


The problem with these factors is that when used in a model, they tend to amplify their effect. So the model is skewed placing too much weight on what is essentially a single factor.

So what do you do about it?

Simply enough, in cases like this. You pick one. Height or Weight will do. In more complex models you can use mathematical techniques like Singular Value Decomposition (SVD), but I won’t cover that in this lesson.

I am also not going to cover any of the methods for detecting collinearity in this lesson, I will be covering those in further lessons. But it should be noted that a lot of times domain knowledge is really all you need to spot it. It doesn’t take a doctor to realize that taller people are generally heavier.

But wait…

I know what you are thinking, what about the 250 lb 5’1 kid or the 120 lb 6’2 kid? Well if you have enough of these outliers in your data and you feel that being over or under weight is an important variable to consider, I would recommend using a proxy. In this case, you could substitute BMI (body mass index – a calculation based on height and weight) to replace both height and weight as factors.


Stepwise regression is a method for determining which factors provide value to the model. The way it works (in the most basic definition I can offer) is you run your regression model with all your factors, removing the weakest factor each time (based on statistical evaluation methods like R^2 values and P values). This is done repeatedly until only high value factors are left in the model.


NEXT….Not “technically Factor Analysis” but can be useful in removing bad factors from your model.


Binning or Categorizing Data

Let’s say, looking at the data example above, our data covered all grades from 1-12. What if you want to look a kids in a two year period. You would want to bin the data into equal groups of 2: 1-2,3-4,5-6,7-8,9-10,11-12. You can now analyze the data in these blocks.

What if you wanted to measure the effectiveness of certain schools in the system. You might be wise to categorize the data. What that means is, we will take grades 1-6 and place them in one category (elementary), 7-8 in another(middle school), 9-12(high school).

Incomplete Data

Imagine a factor called household income. This is a field that very likely may not be readily answered by parents. If there are only a few missing fields, some algorithms won’t be too affected, but if there are a lot, say 5%, you need to do something about it.

What are you options?

You could perform a simple mean or median replacement for all missing values, or try to calculate a best guess based on other factors. You could delete records missing this value. Or, as I often do, just toss this factor away. Most likely any value this adds to your model is going to be questionable at best. Don’t fall for the Big Data more is always better trap. Sometimes simplicity wins out in the end.

Outliers and Erroneous Data

Outliers can really skew you model, but even worse, erroneous data can make you model absolutely worthless. Look out for outliers, question strange looking data. Unless you can come up with a real good reason why these should stay in your model, I say chuck the records containing them.





R: Text Mining (Term Document Matrix)

There are a bounty of well known machine learning algorithms, both supervised (Decision Tree, K Nearest Neighbor, Logistical Regression) and unsupervised (clustering, anomaly detection). The only catch is that these algorithms are designed to work with numbers, not text. The act of using numeric based data mining methods on text is known as duo-mining.

So before you can utilize these algorithms, you first have to  transform text into a format suitable for use in these number based algorithms. One of the most popular methods people first learn is how to create a Term Document Matrix (TDM). The easiest way to understand this concept is, of course, to do an example.

Let’s start by loading the required library

install.packages("tm") # if not already installed

Now let’s create a simple vector of strings.

wordVC <- c("I like dogs", "A cat chased the dog", "The dog ate a bone", 
            "Cats make fun pets")

Now we are going to place the strings into a data type designed for text mining (from the tm package) called corpus. A corpus simply means the full collection of text you want to work with.

corpus <- (VectorSource(wordVC))
corpus <- Corpus(corpus)


 Length Class Mode
1 2 PlainTextDocument list
2 2 PlainTextDocument list
3 2 PlainTextDocument list
4 2 PlainTextDocument list

As you can see from the summary, the corpus classified each string in the vector as a PlaintextDocument.

Now let’s create our first Term Document Matrix

TDM <- TermDocumentMatrix(corpus)



As you can see, we now have a numeric representation of our text. Each row represents a word in our text and each column represents an individual sentence. So if you look at the word dog, you can see it appears once in sentence 2 and 3, (0110). While bone appears once in sentence 3 only (0010).


One immediate issue that jumps out at me is that R now sees the words cat & cats and dog & dogs as different words, when really cats and dogs are just the plural versions of cat and dog. Now there may be some more advanced applications of text mining that you would want to keep the two words separate, but in most basic text mining applications, you would want to only keep one version of the word.

Luckily for us, R makes that simple. Use the function tm_map with the argument stemDocument

corpus2 <- tm_map(corpus, stemDocument)

make a new TDM

TDM2 <- TermDocumentMatrix(corpus2) 

Now you see only the singular of cat and dog exist in our list


If you would like, you can also work with the transpose of the TDM called the Document Term Matrix.

dtm = t(TDM2)




I’ll get deeper into more pre-processing tasks, as well as ways to work with your TDM in future lessons. But for now, practice making TDMs see if you can think of ways that you can use TDMs and DTMs with some machine learning algorithms you might already know (decision trees, logistic regression).

R: K-Means Clustering- Deciding how many clusters

In a previous lesson I showed you how to do a K-means cluster in R. You can visit that lesson here: R: K-Means Clustering.

Now in that lesson I choose 3 clusters. I did that because I was the one who made up the data, so I knew 3 clusters would work well. In the real world it doesn’t work that way. Choosing the right number of clusters is one of the trickier parts of performing a k-means cluster.

If you go over to Michael Grogan’s site, you will see he has a great method for figuring out how many clusters to choose. http://www.michaeljgrogan.com/k-means-clustering-example-stock-returns-dividends/

wss <- (nrow(sample_stocks)-1)*sum(apply(sample_stocks,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(sample_stocks,
plot(1:20, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")

If you understand the code above, then great. That is a great solution for choosing the number of clusters. If, however, you are not 100% sure what is going on above, keep reading. I’ll walk you through it.

K-Means Clustering

We need to start by getting a better understanding of what k-means clustering means. Consider this simplified explanation of clustering.

The way is works is each of the rows our data are placed into a vector.


These vectors are then plotted out in space. Centroids (the yellow stars in the picture below) are chosen at random. The plotted vectors are then placed into clusters based on which centroid they are closest to.


So how do you measure how good your clusters fit. (Do you need more clusters? Less clusters)? One popular metrics is the Within cluster sum of squares. R provides this as kmeans$withinss. What this means is the distance the vectors in each cluster are from their respected centroid.

The goal is to get the is to get this number as small as possible. One approach to handling this is to run your kmeans clustering multiple times, raising the number of the clusters each time. Then you compare the withinss each time, stopping when the rate of improvement drops off. The goal is to find a low withinss while still keeping the number of clusters low.


This is, in effect, what Michael Grogan has done above.

Break down the code

Okay, now lets break down Mr. Grogan’s code and see what he is doing.

wss <- (nrow(sample_stocks)-1)*sum(apply(sample_stocks,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(sample_stocks,
plot(1:20, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")


The first line of code is a little tricky. Let’s break it down.

wss <- (nrow(sample_stocks)-1)*sum(apply(sample_stocks,2,var))

sample_stocks – the data set

wss <-  – This simply assigns a value to a variable called wss

(nrow(sample_stocks)-1)  – the number of rows (nrow) in sample_stocks – 1. So if there are 100 rows in the data set, then this will return 99

sum(apply(sample_stocks,2,var)) – let’s break this down deeper and focus on the apply() function. apply() is kind of like a list comprehension in Python. Here is how the syntax works.

apply(data, (1=rows, 2=columns), function you are passing the data through)

So, let’s create a small array and play with this function. It makes more sense when you see it in action.

 tt <- array(1:20, dim=c(10,2)) # create array with data 1 -20, 
                                #10 rows, 2 columns
> tt
 [,1] [,2]
 [1,] 1 11
 [2,] 2 12
 [3,] 3 13
 [4,] 4 14
 [5,] 5 15
 [6,] 6 16
 [7,] 7 17
 [8,] 8 18
 [9,] 9 19
[10,] 10 20

Now lets try running this through apply.

> apply(tt, 2, mean)
[1] 5.5 15.5

Apply took the mean of each column. Had I used 1 as the second argument, it would have taken the mean of each row.

> apply(tt, 1, mean)
 [1] 6 7 8 9 10 11 12 13 14 15

Also, keep in mine, you can create your own functions to be used in apply

apply(tt,2, function(x) x+5)
 [,1] [,2]
 [1,] 6 16
 [2,] 7 17
 [3,] 8 18
 [4,] 9 19
 [5,] 10 20
 [6,] 11 21
 [7,] 12 22
 [8,] 13 23
 [9,] 14 24
[10,] 15 25

So, what is Mr. Grogan’s doing with his apply function? apply(sample_stocks,2,var) – He is taking the variance of each column his data set.

[1] 9.166667 9.166667

And by summing it: sum(apply(sample_stocks,2,var)) – he is simply adding the two values together.

[1] 18.33333

So, the entire first line of code using our data is:

wss <- (nrow(tt)-1)*sum(apply(tt,2,var))

wss <- (10-1) * (18.333)

wss <- (nrow(tt)-1)*sum(apply(tt,2,var))
> wss
[1] 165

What this number is effectively is the within sum of squares for a data set that has only one cluster

Next section of code

Next we will tackle the next two lines of code.

for (i in 2:20) wss[i] <- sum(kmeans(sample_stocks,

The first part is a for loop and should be simple enough. Note he doesn’t use {} to denote the inside of his loop. You can do this when your for loop is a single line, but I am going to use the {}’s anyway, as I think it makes the code a bit neater.

for (i in 2:20)  — a for loop iterating from 2 -20

for (i in 2:20) {

wss[i] <- }  – we are going to assign more values to the vector wss starting at 2 and working our way down to 20.

Remember, a single value variable in R is actually a single value vector.

c <- 5
> c
[1] 5
> c[2] <- 7
> c
[1] 5 7

Okay, so now to the trickier code. sum(kmeans(sample_stocks, centers = i)$withinss)

What he is doing is running a kmeans cluster for the data one time each for each value of centers (number of centroids we want) from 2 to 20 and reading the $withinss from each run. Finally it sums all the withinss up (you will have 1 withinss for every cluster you create – number of centers)

Plot the results

The last part of the code is plotting the results

plot(1:20, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")

plot (x, y, type= type of graph, xlab = label for x axis, ylab= label for y axis

Let’s try it with our data

If you already did my Kmeans lesson, you should already have the file, if not you can download it hear. cluster

 myData <- read.csv('cluster.csv')
> head(myData)
 StudentId TestA TestB
1 2355645.1 134 24
2 8718152.6 155 32
3 8303333.6 130 25
4 6352972.5 185 86
5 3381543.2 153 95
6 817332.4 153 81
> myData <- myData [,2:3] # get rid of StudentId column
> head(myData)
 TestA TestB
1 134 24
2 155 32
3 130 25
4 185 86
5 153 95
6 153 81

Now lets feed this through Mr. Grogan’s code

wss <- (nrow(myData)-1)*sum(apply(myData,2,var))
for (i in 2:20) {
          wss[i] <- sum(kmeans(myData,
plot(1:20, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")

Here is our output ( a scree plot for you math junkies out there)


Now Mr. Grogan’s plot a nice dramatic drop off, which is unfortunately not how most real world data I have seen works. I am going to chose 5 as my cut off point, because while the withinss does continue to decrease, it doesn’t seem to do so at a rate great enough to accept the added complexity of more clusters.

If you want to see how 5 clusters looks next to the three I had originally created, you can run the following code.

myCluster <- kmeans(myData,5, nstart = 20)
myData$cluster <- as.factor(myCluster$cluster)
ggplot(myData, aes(TestA, TestB, color = cluster))
+ geom_point()

5 Clusters


3 Clusters


I see some improvement in the 5 cluster model. So Michael Grogan’s trick for finding the number of clusters works.


R: K-Means Clustering

Note: This is an introductory lesson with a made up data set. After you are finished with this tutorial, if you want to see a nice real world example, head on over to Michael Grogan’s website:


K Means Cluster will be our introduction to Unsupervised Machine Learning. What is Unsupervised Machine Learning exactly? Well, the simplest explanation I can offer is that unlike supervised where our data set contains a result, unsupervised does not.

Think of a simple regression where I have the square footage and selling prices (result) of 100 houses. Taking that data, I can easily create a prediction model that will predict the selling price of a house based off of square footage. – This is supervised machine learning

Now, take a data set containing 100 houses with the following data: square footage, house style, garage/no garage, but no selling price. We can’t create a prediction model since we have no knowledge of prices, but we can group the houses together based on commonalities. These groupings (clusters) can be used to gain knowledge of your data set.

I think seeing it in action will help.

Here is the data set: cluster

The data we will be looking at test results for 149 students.


The task at hand is to group the students into 3 groups based on the test results. Now one thing any teacher will let you know is that some kids perform well in one subject and perhaps not so well in another. So we can’t simply group them on the score performance on one test. And when you are dealing with real world data, you might be looking at 20 -100 test/quiz scores per student.

So what we are going to do is let the computer decide how to group (or cluster) them.

To do so, we are going to be using K-means clustering. K-means clustering works by choosing random points (centroids). It then groups the data points around the centroids based which centroid the points are closest to.

Let’s get started

Let’s start by loading the data

st <- read.csv(file.choose())

our data


Now let’s run the data through a Kmeans() algorithm

First, we are only going to want to focus on columns 2 and 3 in the data set since column 1 (studentID) is basically a label and provides no value in prediction.

To do this, we subset the data: st[,2:3] – which means I want all row ([,) and columns 2-3 (2:3])


Now the code to make our clusters

stCl <- kmeans(st[, 2:3], 3, nstart = 20)

The syntax is kmeans(DATA, Number of clusters, Numbers of random starts)

Number of clusters I picked as 3 because I know this works well with the data, picking the right number usually takes a little trial and error in real life

Number of random starts is how many times you want the algorithm to be rerun (choosing new centroids each time) and choosing the result where the clusters are tightest.

Below is the output of our Kmeans – note the cluster means, this tells us the mean score for TestA and TestB set in each cluster.


Hey, if you are a math junkie, this may be all you want. But if you are looking for some more practical value here, lets move on.

First, we need to add a column to our data set that shows our columns.

Now since we read our data from a csv, it is a data frame. If you can’t remember that, you can always run the command is.data.frame(st) to test it out.

Do you remember how to add a column to a data frame?

Well, there are multiple ways, but this is, in my opinion, the easiest way.

st$cluster <- stCl$cluster

st$cluster <- stCl$cluster

Here is the result


Now with the clusters, you can group your students based their assigned cluster.

Technically we are done here. We have successfully grouped the students. But what if you want to make sure you did a good job. One quick check is to graph your work.

Before we can graph, we have to make sure our st$cluster column is set as a factor, then using ggplot, we can graph it. (if you don’t have ggplot2 installed, you will need to run this line: install.packages(“ggplot2”)

st$cluster <- as.factor(st$cluster)
ggplot(st, aes(TestA, TestB, color = cluster)) + geom_point()

And here is our output. The groups look pretty good.


R: Decision Trees (Regression)

Decision Trees are popular supervised machine learning algorithms. You will often find the abbreviation CART when reading up on decision trees. CART stands for Classification and Regression Trees.

In this example we are going to create a Regression Tree. Meaning we are going to attempt to build a model that can predict a numeric value.

We are going to start by taking a look at the data. In this example we are going to be using the Iris data set native to R. This data set



As you can see, our data has 5 variables – Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. The first 4 variables refer to measurements of flower parts and the species identifies which species of iris this flower represents.

In the Classification example, we tried to predict the Species of flower. In this example we are going to try to predict the Sepal.Length

In order to build our decision tree, first we need to install the correct package.



Next we are going to create our tree. Since we want to predict Sepal.Length – that will be the first element in our fit equation.

fit <- rpart(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width+ Species, 
 method="anova", data=iris )

Note the method in this model is anova. This means we are going to try to predict a number value. If we were doing a classifier model, the method would be class.

Now let’s plot out our model

plot(fit, uniform=TRUE, 
 main="Regression Tree for Sepal Length")
 text(fit, use.n=TRUE, cex = .6)

Note the splits are marked – like the top split is Petal.Length < 4.25

Also, at the terminating point of each branch, you see and n= . The number following this is the number of elements from the data file that fit at the end of that branch.


While this model actually works out pretty good, one thing to look for is over fitting. A good sign of that would be having a bunch of branches terminating with n values of 1 or 2. This means the model is tuned too much to the test data and when run up against a new set of data it will most likely result in poor predictions.

Of course we can look at some of the numbers if you are so inclined.


Notice the xerror (cross validation error) gets better with each split. That is something you want to look out for. If that number starts to creep up as the splits increase, that is a sign you may want to prune some of the branches. I will show how to do that in another lesson.

To get a better picture of the change in xerror as the splits increase, let’s look at a new visualization


This produces 2 charts, 1rst on shows how R-Squared improves as splits increase (remember R-squared gets better as it approaches 1 so this model is improving with each spit)

The second chart shows how xerror decreases with each split. For models that need pruning, you would see the curve starting to go back up as the splits increase. Imagine is split 6 was higher than split 5.


Okay, so finally now that we know the model is good, let’s make a prediction.

testData  <-data.frame (Species = 'setosa', Sepal.Width = 4, Petal.Length =1.2,
predict(fit, testData, method = "anova")


So as you can see, based on our test data, the model predicts our Sepal.Length will be approx 5.17.


R: Decision Trees (Classification)

Decision Trees are popular supervised machine learning algorithms. You will often find the abbreviation CART when reading up on decision trees. CART stands for Classification and Regression Trees.

In this example we are going to create a Classification Tree. Meaning we are going to attempt to classify our data into one of the (three in this case) classes.

We are going to start by taking a look at the data. In this example we are going to be using the Iris data set native to R. This data set



As you can see, our data has 5 variables – Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. The first 4 variables refer to measurements of flower parts and the species identifies which species of iris this flower represents. What we are going to attempt to do here is develop a predictive model that will allow us to identify the species of iris based on measurements.

The species we are trying to predict are setosa, virginica, and versicolor. These are our three classes we are trying to classify our data as.

In order to build our decision tree, first we need to install the correct package.



Next we are going to create our tree. Since we want to predict Species – that will be the first element in our fit equation.

fit <- rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
 method="class", iris)

Now, let’s take a look at the tree.


To understand what the output says, according to our model, if the Pedal.Length is < 2.45 then the flower is classified as setosa. If not, it goes to the next split – Petal Width. If < 1.75 then versicolor, else virginica.


Now, we want to take a look at how good the model is.


I am not going to harp too much on the stats here, but lets look down at the table on the bottom. The first row has a CP = 0.50.  This means (approx) that the first split reduced the relative error by 0.5. You can see this in the rel error in the second row.

Now the 2nd row CP = 0.44, so the second split improved the rel error in the third row to 0.06.

Now personally, when just trying to get a quick overview of the goodness of the model, I look at the xerror (cross validation error) of the final row. 0.10 is a nice low number.


Okay, now lets make a prediction. Start by creating some test data

testData <-data.frame (Sepal.Length = 1, Sepal.Width = 4, Petal.Length =1.2, 
+ Petal.Width=0.3)

Now let’s predict

predict(fit, testData, type="class")

Here is the output:


As you can see, the model predicted setosa. If you look back at the tree, you will see why.

Let’s do one more prediction

predict (fit, newdata, type="class")

Here is the output


The model predicts 1,2,3 are virginica and 4 is versicolor.

Now go find some more data and try this out.

R: Simple Linear Regression

Linear Regression is a very popular prediction method and most likely the first predictive algorithm most be people learn. To put it simply, in linear regression you try to place a line of best fit through a data set and then use that line to predict new data points.


If you are new to linear regression or are in need of a refresher, check out my lesson on Linear Regression in Excel where I go much deeper into the mechanics: Linear Regression using Excel

Get the Data

You can download our data set for this lesson here: linear1

Let’s upload our file into R

df <- read.csv(file.choose())


Now our data file contains a listing of Years a person has worked for company A and their Salary.

Check for linear relationship

With a 2 variable data set, often it is quickest just to graph the data to check for a possible linear relationship.

#plot data
plot(Years, Salary)

Looking at the plot, there definitely appears to be a linear relationship. I can easily see where I could draw a line through the data.


An even better way to do it is to check for correlation. Remember the closer to 1, the better the correlation found in the data.

#check for correlation
cor(Years, Salary)


Since our correlation is so high, I think it is a good idea to perform an linear regression.

Linear Regression in R

A linear regression in R is pretty simple. The syntax is lm(y, x, data)

#perform linear regression
fit <- lm(Salary~Years, data= df)


Now let’s take a second to break down the output.

The red box shows my P values. I want to make sure they are under my threshold (usually 0.05). This becomes more important in multiple regression.

The orange box shows my R-squared values. Since this is a simple regression, both of these numbers are pretty much the same, and it really doesn’t matter which one you look at. What these numbers tell me is how accurate my prediction line is. A good way to look at them for a beginner is to consider them to be like percentages. So in our example, our prediction model is 75-76% percent accurate.

Finally, the blue box are your coefficients. You can use these numbers to create your predictive model. Remember the linear equation: Y = mX + b? Well using your coefficients here our equation now reads Y = 1720.7X + 43309.7


You can use fitted() to show you how your model would predict your existing data



You can also use the predict command to try a new value

predict(fit, newdata =data.frame(Years= 40))


Let’s graph our regression now.

plot(Years, Salary)

abline(fit, col = 'red')


The Residuals Plot

I am not going to go too deep into the weeds here, but I want to show you something cool.

layout(matrix(c(1,2,3,4),2,2))  # c(1,2,3,4) gives us 4 graphs on the page, 

                                #2,2 - graphs are 2x2

I promise to go more into this in a later lesson, but for now, I just want you to note the numbers you see popping up inside the graphs. (38,18,9) – These represent outliers. One of the biggest problems with any linear system is they are easily thrown off by outliers. So you need to know where you outliers are.


If you look at the points listed in your graphs in your data, you will see why these are outliers. Now while this doesn’t tell you what to do about your outliers, that decision has to come from you, it is a great way of finding them quickly.


The Code

# upload file
df <- read.csv(file.choose())

#plot data
plot(Years, Salary)

#check for correlation
cor(Years, Salary)

#perform linear regression
fit <- lm(Salary~Years, data= df)

#see predictions

predict(fit, newdata =data.frame(Years= 40))

#plot regression line 
plot(Years, Salary)

abline(fit, col = 'red')




Python: Naive Bayes’

Naive Bayes’ is a supervised machine learning classification algorithm based off of Bayes’ Theorem. If you don’t remember Bayes’ Theorem, here it is:


Seriously though, if you need a refresher, I have a lesson on it here: Bayes’ Theorem

The naive part comes from the idea that the probability of each column is computed alone. They are “naive” to what the other columns contain.

You can download the data file here: logi2

Import the Data

import pandas as pd
df = pd.read_excel("C:\Users\Benjamin\Documents\logi2.xlsx")


Let’s look at the data. We have 3 columns – Score, ExtraCir, Accepted. These represent:

  • Score – Student Test Score
  • ExtraCir – Was Student in an Extra Circular Activity
  • Accepted – Was the Student Accepted

Now the Accepted column is our result column – or the column we are trying to predict. Having a result in your data set makes this a supervised machine learning algorithm.

Split the Data

Next split the data into input(score and extracir) and results (accepted).

y = df.pop('Accepted')
X = df




Fit Naive Bayes

Lucky for us, scikitlearn has a bit in Naive Bayes algorithm – (MultinomialNB)

Import MultinomialNB and fit our split columns to it (X,y)

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()


Run the some predictions

Let’s run the predictions below. The results show 1 (Accepted) 0 (Not Accepted)

#--score of 1200, ExtraCir = 1

#--score of 1000, ExtraCir = 0


The Code

import pandas as pd
df = pd.read_excel("C:\Users\Benjamin\Documents\logi2.xlsx")

y = df.pop('Accepted')
X = df


from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()

#--score of 1200, ExtraCir = 1

#--score of 1000, ExtraCir = 0


Python: K Means Clustering Part 2

In part 2 we are going focus on checking our assumptions. So far we have learned how to perform a K Means Cluster. When running a K Means Cluster, you first have to choose how many clusters you want. But what is the optimal number of clusters? This is  the “art” part of an algorithm like this.

One thing you can do is check the distance from you points to the cluster center. We can measure this using the interia_ function from scikit learn.

Let’s start by building our K Means Cluster:

Import the data

import pandas as pd

df = pd.read_excel("C:\Users\Benjamin\Documents\KMeans1.xlsx")


Drop unneeded columns

df1 = df.drop(["ID Tag", "Model", "Department"], axis = 1)


Create the model – here I set clusters to 4

from sklearn.cluster import KMeans
km = KMeans(n_clusters=4, init='k-means++', n_init=10)

Now fit the model and run the interia_ function



Now the answer you get is the sum of distances from your sample points to the cluster center.

What does the number mean? Well, on its own, not much. What you need to do is look at a list of interia_ for a range of cluster choices.

To do so, I am set up a for loop.

n = int(raw_input("Enter Starting Cluster: "))
n1 = int(raw_input("Enter Ending Cluster: "))
for i in range(n,n1):
 km = KMeans(n_clusters=i, init='k-means++', n_init=10)
 print i, km.inertia_


The trick to reading the results is look for the point of diminishing returns. The area I am pointing to with the arrow is where I would look. The changes in values start slowing down here.

I am using this example because I feel it is more real world. Working with real data takes time to a get a feeling for. If you are having trouble seeing why I chose this point, consider the following textbook example:

See how at this highlight part, the drop in number goes from hundreds to 25. That is a diminished return. The new result is not that much better than the earlier result. As opposed to 1 and 2 where 2 clusters perform 1000 units better.