R: K-Means Clustering

On December 7, 2016December 30, 2016 By Ben Larson Ph.D.In machine learning, R3 Comments

Note: This is an introductory lesson with a made up data set. After you are finished with this tutorial, if you want to see a nice real world example, head on over to Michael Grogan’s website:

http://www.michaeljgrogan.com/k-means-clustering-example-stock-returns-dividends/

K Means Cluster will be our introduction to Unsupervised Machine Learning. What is Unsupervised Machine Learning exactly? Well, the simplest explanation I can offer is that unlike supervised where our data set contains a result, unsupervised does not.

Think of a simple regression where I have the square footage and selling prices (result) of 100 houses. Taking that data, I can easily create a prediction model that will predict the selling price of a house based off of square footage. – This is supervised machine learning

Now, take a data set containing 100 houses with the following data: square footage, house style, garage/no garage, but no selling price. We can’t create a prediction model since we have no knowledge of prices, but we can group the houses together based on commonalities. These groupings (clusters) can be used to gain knowledge of your data set.

I think seeing it in action will help.

Here is the data set: cluster

The data we will be looking at test results for 149 students.

The task at hand is to group the students into 3 groups based on the test results. Now one thing any teacher will let you know is that some kids perform well in one subject and perhaps not so well in another. So we can’t simply group them on the score performance on one test. And when you are dealing with real world data, you might be looking at 20 -100 test/quiz scores per student.

So what we are going to do is let the computer decide how to group (or cluster) them.

To do so, we are going to be using K-means clustering. K-means clustering works by choosing random points (centroids). It then groups the data points around the centroids based which centroid the points are closest to.

Let’s get started

Let’s start by loading the data

st <- read.csv(file.choose())
head(st)

our data

Now let’s run the data through a Kmeans() algorithm

First, we are only going to want to focus on columns 2 and 3 in the data set since column 1 (studentID) is basically a label and provides no value in prediction.

To do this, we subset the data: st[,2:3] – which means I want all row ([,) and columns 2-3 (2:3])

Now the code to make our clusters

stCl <- kmeans(st[, 2:3], 3, nstart = 20)
stCl

The syntax is kmeans(DATA, Number of clusters, Numbers of random starts)

Number of clusters I picked as 3 because I know this works well with the data, picking the right number usually takes a little trial and error in real life

Number of random starts is how many times you want the algorithm to be rerun (choosing new centroids each time) and choosing the result where the clusters are tightest.

Below is the output of our Kmeans – note the cluster means, this tells us the mean score for TestA and TestB set in each cluster.

Hey, if you are a math junkie, this may be all you want. But if you are looking for some more practical value here, lets move on.

First, we need to add a column to our data set that shows our columns.

Now since we read our data from a csv, it is a data frame. If you can’t remember that, you can always run the command is.data.frame(st) to test it out.

Do you remember how to add a column to a data frame?

Well, there are multiple ways, but this is, in my opinion, the easiest way.

st$cluster <- stCl$cluster

is.data.frame(st)
st$cluster <- stCl$cluster
head(st)

Here is the result

Now with the clusters, you can group your students based their assigned cluster.

Technically we are done here. We have successfully grouped the students. But what if you want to make sure you did a good job. One quick check is to graph your work.

Before we can graph, we have to make sure our st$cluster column is set as a factor, then using ggplot, we can graph it. (if you don’t have ggplot2 installed, you will need to run this line: install.packages(“ggplot2”)

library(ggplot2)
st$cluster <- as.factor(st$cluster)
ggplot(st, aes(TestA, TestB, color = cluster)) + geom_point()

And here is our output. The groups look pretty good.

R: Decision Trees (Regression)

On November 23, 2016 By Ben Larson Ph.D.In machine learning, R, RegressionLeave a comment

Decision Trees are popular supervised machine learning algorithms. You will often find the abbreviation CART when reading up on decision trees. CART stands for Classification and Regression Trees.

In this example we are going to create a Regression Tree. Meaning we are going to attempt to build a model that can predict a numeric value.

We are going to start by taking a look at the data. In this example we are going to be using the Iris data set native to R. This data set

iris

In the Classification example, we tried to predict the Species of flower. In this example we are going to try to predict the Sepal.Length

In order to build our decision tree, first we need to install the correct package.

install.packages("rpart")

library(rpart)

Next we are going to create our tree. Since we want to predict Sepal.Length – that will be the first element in our fit equation.

fit <- rpart(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width+ Species, 
 method="anova", data=iris )

Note the method in this model is anova. This means we are going to try to predict a number value. If we were doing a classifier model, the method would be class.

Now let’s plot out our model

plot(fit, uniform=TRUE, 
 main="Regression Tree for Sepal Length")
 text(fit, use.n=TRUE, cex = .6)

Note the splits are marked – like the top split is Petal.Length < 4.25

Also, at the terminating point of each branch, you see and n= . The number following this is the number of elements from the data file that fit at the end of that branch.

2016-11-22_22-05-26

While this model actually works out pretty good, one thing to look for is over fitting. A good sign of that would be having a bunch of branches terminating with n values of 1 or 2. This means the model is tuned too much to the test data and when run up against a new set of data it will most likely result in poor predictions.

Of course we can look at some of the numbers if you are so inclined.

2016-11-22_22-11-14

Notice the xerror (cross validation error) gets better with each split. That is something you want to look out for. If that number starts to creep up as the splits increase, that is a sign you may want to prune some of the branches. I will show how to do that in another lesson.

To get a better picture of the change in xerror as the splits increase, let’s look at a new visualization

par(mfrow=c(1,2)) 
rsq.rpart(fit)

This produces 2 charts, 1rst on shows how R-Squared improves as splits increase (remember R-squared gets better as it approaches 1 so this model is improving with each spit)

The second chart shows how xerror decreases with each split. For models that need pruning, you would see the curve starting to go back up as the splits increase. Imagine is split 6 was higher than split 5.

2016-11-22_22-26-51

Okay, so finally now that we know the model is good, let’s make a prediction.

testData  <-data.frame (Species = 'setosa', Sepal.Width = 4, Petal.Length =1.2,
 Petal.Width=0.3)
predict(fit, testData, method = "anova")

2016-11-22_22-32-30

So as you can see, based on our test data, the model predicts our Sepal.Length will be approx 5.17.

R: Decision Trees (Classification)

On November 22, 2016 By Ben Larson Ph.D.In machine learning, R4 Comments

Decision Trees are popular supervised machine learning algorithms. You will often find the abbreviation CART when reading up on decision trees. CART stands for Classification and Regression Trees.

In this example we are going to create a Classification Tree. Meaning we are going to attempt to classify our data into one of the (three in this case) classes.

We are going to start by taking a look at the data. In this example we are going to be using the Iris data set native to R. This data set

iris

As you can see, our data has 5 variables – Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. The first 4 variables refer to measurements of flower parts and the species identifies which species of iris this flower represents. What we are going to attempt to do here is develop a predictive model that will allow us to identify the species of iris based on measurements.

The species we are trying to predict are setosa, virginica, and versicolor. These are our three classes we are trying to classify our data as.

In order to build our decision tree, first we need to install the correct package.

install.packages("rpart")

library(rpart)

Next we are going to create our tree. Since we want to predict Species – that will be the first element in our fit equation.

fit <- rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
 method="class", iris)

Now, let’s take a look at the tree.

plot(fit)
text(fit)

To understand what the output says, according to our model, if the Pedal.Length is < 2.45 then the flower is classified as setosa. If not, it goes to the next split – Petal Width. If < 1.75 then versicolor, else virginica.

2016-11-21_21-55-29

Now, we want to take a look at how good the model is.

printcp(fit)

I am not going to harp too much on the stats here, but lets look down at the table on the bottom. The first row has a CP = 0.50. This means (approx) that the first split reduced the relative error by 0.5. You can see this in the rel error in the second row.

Now the 2nd row CP = 0.44, so the second split improved the rel error in the third row to 0.06.

Now personally, when just trying to get a quick overview of the goodness of the model, I look at the xerror (cross validation error) of the final row. 0.10 is a nice low number.

2016-11-21_21-58-52

Okay, now lets make a prediction. Start by creating some test data

testData <-data.frame (Sepal.Length = 1, Sepal.Width = 4, Petal.Length =1.2, 
+ Petal.Width=0.3)

Now let’s predict

predict(fit, testData, type="class")

Here is the output:

As you can see, the model predicted setosa. If you look back at the tree, you will see why.

Let’s do one more prediction

newdata<-data.frame(Sepal.Length=c(3,8,7,5),
 Sepal.Width=c(2,3,2,6),
 Petal.Length=c(5.4,3.2,4.6,5.3),
 Petal.Width=c(4,3,6,1.3))
 
predict (fit, newdata, type="class")

Here is the output

The model predicts 1,2,3 are virginica and 4 is versicolor.

Now go find some more data and try this out.

R: Simple Linear Regression

On June 7, 2016June 7, 2016 By Ben Larson Ph.D.In machine learning, R, Regression3 Comments

Linear Regression is a very popular prediction method and most likely the first predictive algorithm most be people learn. To put it simply, in linear regression you try to place a line of best fit through a data set and then use that line to predict new data points.

linear1

If you are new to linear regression or are in need of a refresher, check out my lesson on Linear Regression in Excel where I go much deeper into the mechanics: Linear Regression using Excel

Get the Data

You can download our data set for this lesson here: linear1

Let’s upload our file into R

df <- read.csv(file.choose())
head(df)

linReg1

Now our data file contains a listing of Years a person has worked for company A and their Salary.

Check for linear relationship

With a 2 variable data set, often it is quickest just to graph the data to check for a possible linear relationship.

#plot data
attach(df)
plot(Years, Salary)

Looking at the plot, there definitely appears to be a linear relationship. I can easily see where I could draw a line through the data.

An even better way to do it is to check for correlation. Remember the closer to 1, the better the correlation found in the data.

#check for correlation
cor(Years, Salary)

Since our correlation is so high, I think it is a good idea to perform an linear regression.

Linear Regression in R

A linear regression in R is pretty simple. The syntax is lm(y, x, data)

#perform linear regression
fit <- lm(Salary~Years, data= df)
summary(fit)

Now let’s take a second to break down the output.

The red box shows my P values. I want to make sure they are under my threshold (usually 0.05). This becomes more important in multiple regression.

The orange box shows my R-squared values. Since this is a simple regression, both of these numbers are pretty much the same, and it really doesn’t matter which one you look at. What these numbers tell me is how accurate my prediction line is. A good way to look at them for a beginner is to consider them to be like percentages. So in our example, our prediction model is 75-76% percent accurate.

Finally, the blue box are your coefficients. You can use these numbers to create your predictive model. Remember the linear equation: Y = mX + b? Well using your coefficients here our equation now reads Y = 1720.7X + 43309.7

Predictions

You can use fitted() to show you how your model would predict your existing data

fitted(fit)

You can also use the predict command to try a new value

predict(fit, newdata =data.frame(Years= 40))

linReg7

Let’s graph our regression now.

plot(Years, Salary)

abline(fit, col = 'red')

linReg8

The Residuals Plot

I am not going to go too deep into the weeds here, but I want to show you something cool.

layout(matrix(c(1,2,3,4),2,2))  # c(1,2,3,4) gives us 4 graphs on the page, 

                                #2,2 - graphs are 2x2
plot(fit)

I promise to go more into this in a later lesson, but for now, I just want you to note the numbers you see popping up inside the graphs. (38,18,9) – These represent outliers. One of the biggest problems with any linear system is they are easily thrown off by outliers. So you need to know where you outliers are.

If you look at the points listed in your graphs in your data, you will see why these are outliers. Now while this doesn’t tell you what to do about your outliers, that decision has to come from you, it is a great way of finding them quickly.

The Code

# upload file
df <- read.csv(file.choose())
head(df)

#plot data
attach(df)
plot(Years, Salary)

#check for correlation
cor(Years, Salary)

#perform linear regression
fit <- lm(Salary~Years, data= df)
summary(fit)

#see predictions
fitted(fit)

predict(fit, newdata =data.frame(Years= 40))

#plot regression line 
plot(Years, Salary)

abline(fit, col = 'red')

layout(matrix(c(1,2,3,4),2,2)) 
plot(fit)

df

Python: Naive Bayes’

On June 7, 2016June 7, 2016 By Ben Larson Ph.D.In machine learning, Python1 Comment

Naive Bayes’ is a supervised machine learning classification algorithm based off of Bayes’ Theorem. If you don’t remember Bayes’ Theorem, here it is:

bayes

Seriously though, if you need a refresher, I have a lesson on it here: Bayes’ Theorem

The naive part comes from the idea that the probability of each column is computed alone. They are “naive” to what the other columns contain.

You can download the data file here: logi2

Import the Data

import pandas as pd
df = pd.read_excel("C:\Users\Benjamin\Documents\logi2.xlsx")
df.head()

Let’s look at the data. We have 3 columns – Score, ExtraCir, Accepted. These represent:

Score – Student Test Score
ExtraCir – Was Student in an Extra Circular Activity
Accepted – Was the Student Accepted

Now the Accepted column is our result column – or the column we are trying to predict. Having a result in your data set makes this a supervised machine learning algorithm.

Split the Data

Next split the data into input(score and extracir) and results (accepted).

y = df.pop('Accepted')
X = df

y.head()

X.head()

Fit Naive Bayes

Lucky for us, scikitlearn has a bit in Naive Bayes algorithm – (MultinomialNB)

Import MultinomialNB and fit our split columns to it (X,y)

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X,y)

Run the some predictions

Let’s run the predictions below. The results show 1 (Accepted) 0 (Not Accepted)

#--score of 1200, ExtraCir = 1
print(classifier.predict([1200,1]))

#--score of 1000, ExtraCir = 0
print(classifier.predict([1000,0]))

nb3

The Code

import pandas as pd
df = pd.read_excel("C:\Users\Benjamin\Documents\logi2.xlsx")
df.head()

y = df.pop('Accepted')
X = df

y.head()
X.head()

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X,y)

#--score of 1200, ExtraCir = 1
print(classifier.predict([1200,1]))

#--score of 1000, ExtraCir = 0
print(classifier.predict([1000,0]))

Python: K Means Clustering Part 2

On June 4, 2016June 4, 2016 By Ben Larson Ph.D.In machine learning, PythonLeave a comment

In part 2 we are going focus on checking our assumptions. So far we have learned how to perform a K Means Cluster. When running a K Means Cluster, you first have to choose how many clusters you want. But what is the optimal number of clusters? This is the “art” part of an algorithm like this.

One thing you can do is check the distance from you points to the cluster center. We can measure this using the interia_ function from scikit learn.

Let’s start by building our K Means Cluster:

Import the data

import pandas as pd

df = pd.read_excel("C:\Users\Benjamin\Documents\KMeans1.xlsx")
df.head()

kmeans1

Drop unneeded columns

df1 = df.drop(["ID Tag", "Model", "Department"], axis = 1)
df1.head()

kmeans2

Create the model – here I set clusters to 4

from sklearn.cluster import KMeans
km = KMeans(n_clusters=4, init='k-means++', n_init=10)

Now fit the model and run the interia_ function

km.fit(df1)
km.inertia_

Now the answer you get is the sum of distances from your sample points to the cluster center.

What does the number mean? Well, on its own, not much. What you need to do is look at a list of interia_ for a range of cluster choices.

To do so, I am set up a for loop.

n = int(raw_input("Enter Starting Cluster: "))
n1 = int(raw_input("Enter Ending Cluster: "))
for i in range(n,n1):
 km = KMeans(n_clusters=i, init='k-means++', n_init=10)
 km.fit(df1)
 print i, km.inertia_

The trick to reading the results is look for the point of diminishing returns. The area I am pointing to with the arrow is where I would look. The changes in values start slowing down here.

I am using this example because I feel it is more real world. Working with real data takes time to a get a feeling for. If you are having trouble seeing why I chose this point, consider the following textbook example:

See how at this highlight part, the drop in number goes from hundreds to 25. That is a diminished return. The new result is not that much better than the earlier result. As opposed to 1 and 2 where 2 clusters perform 1000 units better.

Python: K Means Cluster

On May 21, 2016May 21, 2016 By Ben Larson Ph.D.In machine learning, Python6 Comments

I think seeing it in action will help.

If you want to play along, download the data set here: KMeans1

The data set contains a 1 year repair history of 197 Ultrasound medical devices.

Data dictionary (ID Tag – asset number assigned device, Model – model name of device, WO Count – count of repair work orders, AVG Labor – average labor minutes per repair, Labor Cost – average labor cost per repair, No Problem- count of repairs where no problem was found, Avg Cost -average cost of parts, Travel – average travel hours per repair, Travel Cost – average travel cost per repair, Department – department that owns the ultrasound device)

kmeans

We want to see what kind of information we can extract from this data.

To do so, we are going to use K Means Clustering.

How does K Means Clustering work? Each row in the table is converted to a vector. Imagine the vectors now graphed in N-dimension space. Next pick the number of clusters you want to create. For each cluster, you will place a point(a centroid) in space and the vectors are grouped based on their proximity to their nearest centroid.

The calculation to tell proximity is made using geometric means (not arithmetic)- hence the name K-Means Cluster

(each dot below is a row in your table, the colors represent a cluster)

kmeans2

Let’s do it in Python

Import the data.

import pandas as pd

df = pd.read_excel("C:\Users\Benjamin\Documents\KMeans1.xlsx")
df.head()

kmeans1

Now, we are going to drop a few columns: ID Tag – is a random number, has no value in clustering. Then Model and Department,as they are text and while there are ways to work with the text, it is more complicated so for now, we are just going to drop the columns

df1 = df.drop(["ID Tag", "Model", "Department"], axis = 1)
df1.head()

kmeans2

Now lets import KMeans from sklearn.cluster

We then initialize KMeans (n_clusters= 4 -no of clusters you want, init=’k-means++’ -sets how the centroids are places. k-means++ is one of the faster methods of centroid placement, n_init=10 – number times the algorithm with run placing new centroids each iteration)

from sklearn.cluster import KMeans
km = KMeans(n_clusters=4, init='k-means++', n_init=10)

Choosing number of clusters is a bit of an art. Play with it a bit and see how different values play out for you.

Now fit the model

km.fit(df1)

Now, export the cluster identifiers to a list. Notice my values are 0 -3. One value for each cluster.

x = km.fit_predict(df1)
x

Create a new column on the original dataframe called Cluster and place your results (x) in that column

df["Cluster"]= x
df.head()

Sort your dataframe by cluster

df1 = df.sort(['Cluster'])
df1

Now as you start to examine the data in each cluster, you show start to see patterns emerge.

Below is an example of the patterns I found in the clusters.

Now remember, this is just an INTRODUCTION to unsupervised learning. We will learn more tricks to help you discover the patterns as we move forward.

Python: K Nearest Neighbor

On May 20, 2016May 20, 2016 By Ben Larson Ph.D.In machine learning, PythonLeave a comment

K Nearest Neighbor (Knn) is a classification algorithm. It falls under the category of supervised machine learning. It is supervised machine learning because the data set we are using to “train” with contains results (outcomes). It is easier to show you what I mean.

Here is our training set: logi

Let’s import our set into Python

This data set contains 42 student test score (Score) and whether or not they were accepted (Accepted) in a college program. It is the presence of the Accepted column that makes supervised machine learning possible. Knowing the outcomes of past events, we can create a prediction model for future events. So you could use the finished model to predict whether someone will be accepted based on their test score.

So how does Knn work?

Look at the chart below. Imagine this represents our data set. Each blue dot is accepted (1) while each red dot is not(0).

knn1

What if I want to know about my new data point (green star)? Is it a 1 or a 0?

knn2

I start by choosing a neighbor count – in this example I will choose 3, and I find the 3 nearest neighbors to my new point.

Let’s look at the results, I have 2 red(0) and 1 blue(1). Using basic probability, I am 67% (2/3) certain that you will not get in.

Now, let’s code it!

First we need to separate our data into 2 dataframes: Our training set X (Score) and our target set y (Accepted)

df.pop() removes the Accepted column from your dataframe and places it in a newly created one.

Import sklearn

sklearn is a massive library of machine learning algorithms available for Python. Today we are going to use KNeighborsClassfier

So below imported KNeighborsClassifier from sklearn.neighbors

Next I set my neighbor count to 5. You can experiment with other numbers and see how works out for you. Setting the neighbor count is something you kind of have to develop a feel for.

Now let’s fit the model with our training set(X) and target set(y)

knn7

Now we can use our model to make predictions.

ne.predict() will return 1 or 0 – (Accepted or Not)

while ne.predict_proba() will return a probability range. Results below read as (40% change of not Accepted(0), 60% chance of Accepted(1))

So there you go, you have now built a prediction model using K Nearest Neighbor.

Logistic Regression with Gretl

On April 6, 2016April 6, 2016 By Ben Larson Ph.D.In machine learning, RegressionLeave a comment

One of the most popular machine learning algorithms, Logistic Regression is actually a classification algorithm. Broken down to its simplest terms, binary logistic regression (the one we will be focusing on here) is answering a yes or no question. Will the customer buy or not? Is the email SPAM or not?

Score Accepted
982 0
1304 1
1256 1
1562 1
703 0

Above is a small sample from the data set we will be using for this lesson. In this set, student scores for an entrance test are listed in the first column and whether they were Accepted (1) or Not(0) is in the second column.

Download sample Excel file here: logi

I ran a scatter plot on the data with Scores on the X axis. As you can see the dots for 2 horizontal lines at 1 and 0. You may notice that the 1 (Accepted) dots seem to cluster towards higher scores and 0 (Not Accepted) dots cluster towards lower scores.

logi

Well since the point of Logistic Regression is help us make predictions, here is how the predictions work. The Logistic Regression, represented by my crudely drawn red S, goes from 1 to 0. And just like with Linear Regression, if we take a value for X, to make our prediction, we look for the value of Y on the line at that point.

logi1

In the case of a 1200 score, if we check the value of Y on the line, we get .80. This roughly translates to mean, that with a score of 1200, a student has an 80% chance of being accepted.

Let’s meet Gretl

While there are third party add-ons you can download for Excel that will do Logistical Regression, in its native form, Excel does not do a good job in this area. So I thought this would be a great opportunity to introduce you to a neat piece of FREE software called Gretl.

Here is the website to download Gretl: Gretl Download

So why Gretl? Why not R or Python? I mean those are the languages real data scientists use right?

That is true, and R and Python can easily do a Logical Regression. The problem is however, in order to use R and Python, you need to know how to program. Gretl, on the other hand, is GUI based. Think of it as a point and click light weight R. It is no where near as robust as R, but for learning how to do Logistical Regression, Gretl does a fine job.

Loading in the Data

After you install and start Gretl, the next step is to load in the data. Go to File>Open Data>User File. Search for the Excel file you downloaded previously in this lesson. Make sure you then select Excel from the file type at the bottom of the screen.

logi2

Select logi.xlsx. Leave the Start Import at window at 1 and 1. This is where the data starts in our Excel file: 1rst column, 1rst row. You will get a message letting you know how much data was imported.

The next pop up will noted that the data is undated. Click No on this window.

logi3

You data columns (Score, Passed) will appear in the Gretl window. If you click on one, the data from that column will appear in a pop-up window. **note in the file you download, column 2 will be Accepted not Passed.

Let’s Model

Without further ado, let us do some modelings. From the menu bar Model>Limited dependent variable>Logit>Binary…

logi4

Now you have to select you Dependent variable and Regressors. Here is a hint, the dependent variable is what we want to find. What are we looking for? Will the person be Accepted. So Accepted goes in Depentdent variable and Score goes in Regressors. Pick the Show p-values radio button and then click Okay.

logi5

Below are the results of your Logistic Regression model

I am not going to give a Stats lesson here, but I will cover the important points.

logi6

The top red box contains some important information. First the coefficients represent the b and m values from the linear equation we will be using later: y=mX+b =y=0.0105216X + -11.2757
The p-value of Score = 0.0009 This is important as the p-value is a probabilistic value that determines whether or not the regressor variable truly affects the dependent variable. The most common p-value threshold you are likely to come across is 0.05. If your regressor variable has a p-value above 0.05, you will want to reconsider your model.
The matrix at the bottom of the screen. This shows you how successfully your model predicted outcomes from the training data set. It translates of the 0’s (not accepted) the model got 19 out of 21 right. For 1’s(accepted) the model got 19 out of 21 right. That is a 90% success rate. Not bad.

Let’s Use the Model

Okay, so maybe you jumped ahead and tried 1200 in the linear formula we developed above. It is 1.325?? How is that? Isn’t this supposed to be between 0 and 1.

Well the problem is, we are not looking for Y we are looking for probability (p). Y in this case is not the Y intercept, but instead:

logis1

Well, we know Y = 1.325 for a score of 1200, how do we find p from that? We solve for p. Now feel free to go and do the math yourself if you want, but I already did the work for you. The equation below solves for p. If you don’t trust me and want to do it yourself, be my guest, but I assure you the equation below is good.

logis2

Let’s Make a Prediction

Let us put the formula’s we have found into Excel

logi8

Now you have a working prediction model. Any value you place in the score cell will be calculated to Y and p (probability). As the example above shows, a score of 1200 give us a probability of .79.

Turns out my crummy drawing wasn’t so bad after all.

logi1

Linear Regression using Excel

On March 31, 2016April 2, 2016 By Ben Larson Ph.D.In excel, machine learningLeave a comment

Link to video on Linear Regression using Excel

Regression Analysis is still the most popular method used in Predictive Analytics. The main reason is that it works. It is well known and understood. With its different flavors, regression analysis covers a width swath of problems. Another great reason to use it, is that regression tools are easy to find.

Today we are going to use Excel to tackle a simple regression problem. I have uploaded a spreadsheet to this page. If you would like to follow along with the exercise, please download it from the link below:

Excel File Download: Linear Regression Example File 1

What is Linear Regression?

Linear Regression is a method of statistical modeling where the value of a dependent variable based can be found calculated based on the value of one or more independent variables. The general idea, as seen in the picture below, is finding a line of best fit through the data. Using that line, you can then predict the value of Y given X.

linear1

I am not going to go too deep into the math here. I highly the Khan Academy video posted below if you are looking to brush up on your statistics.

Khan Academy – Linear Regression

Lets Start by Looking at the Data

If you download the Excel file at the top of the page, you will find 2 columns labeled Years and Salary. This example data set shows us the years of service and salary of 39 employees for an imaginary company.

linear2

What we are going to attempt to do is to develop a model using Linear Regression that will allow us to predict the salary of an employee given their years of service.

Step 1: Build a Scatter Plot

The first thing we want to do is build a scatter plot. Excel makes this simple enough. Just highlight all of your data > select the Insert Tab from the Ribbon > Select Scatter from Charts:

What you will get should look something like this:

linear4

We have a scatter chart with Salary on the Y Axis and Years on the X Axis. **Excel scatter charts set the left most column of the data set to the X Axis by default.

Before we move on, I want to take a moment to look at the scatter plot. Do you see a pattern? Can you see where you might be able to draw a line through the data?

I am not trying to just fill space here. I am asking a serious question. Because the answer is sometimes you will not see a pattern. Sometimes the scattering of data will be so random that there will no need to go forward with a linear regression. Learning to look for patterns in data visualizations is skill worth developing.

linear5

In this example there is a general pattern, or more accurately, we see what looks like Positive Correlation. We call it positive because it appears that as X increases so does Y. So now that our scatter chart has passed the visual test, it is time perform our regression.

Trend Line

Performing a simple linear regression in Excel is ridiculously easy. Simply click on your scatter plot > from the Ribbon select Chart Tools – Design > Add Chart Element > Trendline > Linear

Your trendline appears on your chart. I personally find the line a little hard to see as is, so I am going to format it a bit.

linear7

Start by double clicking on the trendline and the Format Trendline window will open on the right.

I made the following changes:

Line: — Color: Red — Width: 3pt — Dash type: Solid Line

Trendline Options — Select Display Equation on chart and Display R-squared value on chart

linear8

Alright, that line is much easier to read. Now let us talk about the numbers in the circle. Now I know I said I was not going to get too deep into the math, but I feel I can’t do this subject justice without at least a cursory explanation of what is going on.

linear9

What exactly did Excel do when it added the trendline? Technically it performed a statistical function known as Ordinary Least Squares. What does that mean? Well if you wanted to attempt this by hand, one approach you could take would be to start by drawing a line that looked best to you. You would then measure the Residuals (the distance from the actual data points and line you drew)

linear10

You then repeat the process (picking a new line and measuring residuals) until you find the line that results in the lowest overall residual.Once you have it, you get the equation for your line: y = 1357.9x+50974 (Luckily for us Excel makes the process a lot easier)

Now a quick refresher on the line formula: Y= mX + b (where m = Slope and b = Y-Intercept). This equation is what you would use to make predictions. In our equation a person with 0 years in service would have a salary of 50974: Y = 1357.9(0) + 50974 — Y= 50974. And each year of service would add 1357.90 to the salary.

Before we go start using your equation to start making predictions, we still need to discuss the R² you see below your line equation. I won’t bore you with how R² is calculated. You don’t really need to know how it is calculated to use linear regression, but you do need to know how to read it.

The simplest explanation I can give you for R² is that a value of 1 means perfect fit – every point in your data matches up to your line. 0 on the other hand, means your line doesn’t match anything. Our R² is 0.4423, which really is not that great. I generally prefer to aim for a R² value above 0.6.

How can we improve our R² value? My preference would be to get more data. We currently only have 39 tuples. More data could improve our accuracy. If more data is not available though, you can look at your outliers as Linear Regression can be greatly affected by outliers. Unfortunately outliers are often tricky to deal with. A person with 1 year of service making 100,000 a year would definitely be an outlier, but it is not an impossibility. If this employee is a highly experienced individual who just transferred from another company, it is totally feasible they could be earning 100,000.

The hard truth is, considering only the data we have, we cannot rightfully develop a reliable model. This happens more often than you might think. That is okay though, we will chalk this up as a learning experience and move on.

	Anonymous on Python: Accessing a SQL databa…
	Anonymous on Top 7 skills a Data Analyst ha…
	lovingfox4e1d0e653e on Data Jobs: What does a Data An…
	Anonymous on Top 7 skills a Data Analyst ha…
	Anonymous on Python Web Scraping / Automati…

	Anonymous on Python: Accessing a SQL databa…
	Anonymous on Top 7 skills a Data Analyst ha…
	lovingfox4e1d0e653e on Data Jobs: What does a Data An…
	Anonymous on Top 7 skills a Data Analyst ha…
	Anonymous on Python Web Scraping / Automati…

	Anonymous on Python: Accessing a SQL databa…
	Anonymous on Top 7 skills a Data Analyst ha…
	lovingfox4e1d0e653e on Data Jobs: What does a Data An…
	Anonymous on Top 7 skills a Data Analyst ha…
	Anonymous on Python Web Scraping / Automati…

	Anonymous on Python: Accessing a SQL databa…
	Anonymous on Top 7 skills a Data Analyst ha…
	lovingfox4e1d0e653e on Data Jobs: What does a Data An…
	Anonymous on Top 7 skills a Data Analyst ha…
	Anonymous on Python Web Scraping / Automati…

	Anonymous on Python: Accessing a SQL databa…
	Anonymous on Top 7 skills a Data Analyst ha…
	lovingfox4e1d0e653e on Data Jobs: What does a Data An…
	Anonymous on Top 7 skills a Data Analyst ha…
	Anonymous on Python Web Scraping / Automati…

Analytics4All

Category: machine learning

R: K-Means Clustering

Let’s get started

R: Decision Trees (Regression)

R: Decision Trees (Classification)

R: Simple Linear Regression

Get the Data

Check for linear relationship

Linear Regression in R

Predictions

Let’s graph our regression now.

The Residuals Plot

The Code

Python: Naive Bayes’

Import the Data

Split the Data

Fit Naive Bayes

Run the some predictions

The Code

Python: K Means Clustering Part 2

Python: K Means Cluster

Let’s do it in Python

Python: K Nearest Neighbor

So how does Knn work?

Now, let’s code it!

Import sklearn

Logistic Regression with Gretl

Score Accepted
982 0
1304 1
1256 1
1562 1
703 0

Let’s meet Gretl

Loading in the Data

Let’s Model

Let’s Use the Model

Let’s Make a Prediction

Linear Regression using Excel

What is Linear Regression?

Lets Start by Looking at the Data

Step 1: Build a Scatter Plot

Trend Line

Let’s get started

Get the Data

Check for linear relationship

Linear Regression in R

Predictions

Let’s graph our regression now.

The Residuals Plot

The Code

Import the Data

Split the Data

Fit Naive Bayes

Run the some predictions

The Code

Let’s do it in Python

So how does Knn work?

Now, let’s code it!

Import sklearn

Score Accepted 982 0 1304 1 1256 1 1562 1 703 0

Let’s meet Gretl

Loading in the Data

Let’s Model

Let’s Use the Model

Let’s Make a Prediction

What is Linear Regression?

Lets Start by Looking at the Data

Step 1: Build a Scatter Plot

Trend Line

Score Accepted
982 0
1304 1
1256 1
1562 1
703 0