Statistics: Range, Variance, and Standard Deviation

Measuring the spread can be a useful tool when analyzing a set of numbers. Three common measures of spread of range, variance, and standard deviation.

Here is the data set we will be working with: [2,4,6,7,8,10,15,18,22,26]

Range

Range is the simplest of the three measures. To find the range, all you need to do is subtract the smallest number in the set from the largest number

range = large-small

range = 26-2 = 24

Variance

Variance is created by taking the average of the squared difference between each value in the set minus the mean. We square the differences so that values above and below the mean do not cancel each other out.

Let’s find the mean:

statSpread

If you haven’t seen the x with the line over it before, this is referred to as x bar and it is used to represent the mean.

To find the variance you take the first number in your set, subtract the mean and square the result. You repeat that for each number in your list. Finally you add up all the results and divide by n (the number of items in your list)

ex – ((2-12)^2 + (4-12)^2 +….+(26-12)^2) / 10

statSpread1

variance = 58.56

Now I know what you are thinking, how can the average distance from the mean be 58.56 when the furthest point from the mean (26) is only 14? This is because we are squaring the differences. To get a number more in line with the data set, we have another measure called the standard deviation.

Standard Deviation

The standard deviation returns a value more in-line with what you would expect based on your data. To find the standard deviation – simply take the square root of the variance.

std dev = 7.65

Population vs Sample

The equations above work great if you have the entire population. What I mean by that is, if your data contains all the  data in the set. Using our data, imagine if the numbers were ages of children in a large family. If there are 10 kids in the family, then I have all the ages, so I am dealing with the population.

However, if we instead have sampled 10 random ages from all the kids in a large extended family where the total number of kids is 90. In this case, since we are looking at 10 out of 90, we are not dealing with the population, but the sample.

When working with the sample, you need to make an adjustment to your variance and standard deviation equations. The change is simple. Instead of dividing by n you will now divide by n-1.  This offset makes up for the fact you do not have all the data.

statSpread2

 

R: Intro to Statistics – Central Tendency

Central Tendency

One of the primary purposes of statistics is to find a way to summarize data. To do so, we often look for numbers known as collectively as measurements of central tendency (mean, median,  mode).

Look at the list below. This is a list of weekly gas expenditures for two vehicles. Can you tell me if one is better than the other, or are they both about the same?

rCentral

How about if I show you this?

rCentral1

Using the average, you can clearly see car2 is more cost efficient. At approx $21 a week, that is a savings of $630 over the course of the 30 weeks in the chart. That is a big difference, and one that is easy to see. We can see this using one of the 3 main measures of central tendency – the arithmetic mean – popularly called the average.

Mean

The mean – more accurately the arithmetic mean – is calculated by adding up the elements  in a list and dividing by the number of elements in the list.

rCentral2.jpg

In R, finding a mean is simple. Just put the values you want to average into a vector (**note to make a vector in R: var  <-c(x,y,z)) We then put the vector through the function mean()

rCentral3

Median

The median means, simply enough, the middle of an ordered list. This is also easy to find using R. As you can see, you do not even need to sort the list numerically first. Just feed the vector to the function median()

rCentral4.jpg

Mode

This is the last of 3 main measure. Mode returns the most common value found in a list. In the list 2,3,2,4,2 – the mode is 2. Unfortunately R does not have a built in mode function, so for this, we will have build our own function.

For those familiar with functions in programming, this shouldn’t be too foreign of a concept. However, if you don’t understand functions yet, don’t fret. We will get to them soon enough. For now, just read over the code and see if you can figure any of it out for yourself.

rCentral5

 

 

 

R: An Introduction

R is a programming language focused on statistics, data visualization, and data analysis. It is open source, which means there is a rich trove of libraries and add-ons constantly being developed by the open source community.

R is free to download and use. Follow the links below to download R if you would like to try it.

  • Windows
  • Linux Binaries
  • RStudio – Not required, but anyone looking for a better developed environment may want to check it out. I will be using just the base install of R for the following lessons though

Starting R

When R first starts up, this is what you will see. I am not going to focus too much on a grand tour, as most of the menus in R are pretty self explanatory. Instead, let’s jump right into the coding. Move your cursor to where my big red arrow is:

rIntro

Basic Syntax

There is an unwritten rule that states the first line of code you need to learn in any language is Hello World. Well, I am not going to do that. R is a stats program. Why don’t we start with some numbers instead.

rIntro1

As you can see, R uses your standard arithmetic operations (+,-,*,/,^)

Variables

Assigning variables in R is easy. “<-” is the designated syntax used to assign a variable. One great thing about R is that you do not need to declare variables in advance. R assigns the data type based on the input you give the variable.

rIntro2

Assigning Strings

rIntro3.jpg

Data Types

The main data types in R are:

  • Numeric: 1, 2.33, etc
  • Integer: 2L
  • Logical: TRUE, FALSE
  • Character: string
  • Complex: 2+3i (remember those from Trig class)

Vectors

Vectors allow you group multiple elements under one name. Use the syntax <-c() when creating a vector

rIntro4

Lists

Lists allow you to group unlike items – even vectors and strings:

rIntro5.jpg

 

 

 

Logistic Regression with Gretl

One of the most popular machine learning algorithms, Logistic Regression is actually a classification algorithm. Broken down to its simplest terms, binary logistic regression (the one we will be focusing on here) is answering a yes or no question. Will the customer buy or not? Is the email SPAM or not?

Score Accepted
982          0
1304         1
1256         1
1562         1
703          0

Above is a small sample from the data set we will be using for this lesson. In this set, student scores for an entrance test are listed in the first column and whether they were Accepted (1) or Not(0) is in the second column.

Download sample Excel file here: logi

I ran a scatter plot on the data with Scores on the X axis. As you can see the dots for 2 horizontal lines at 1 and 0. You may notice that the 1 (Accepted) dots seem to cluster towards higher scores and 0 (Not Accepted) dots cluster towards lower scores.

logi

Well since the point of Logistic Regression is help us make predictions, here is how the predictions work. The Logistic Regression, represented by my crudely drawn red S, goes from 1 to 0. And just like with Linear Regression, if we take a value for X, to make our prediction, we look for the value of Y on the line at that point.

logi1

In the case of a 1200 score, if we check the value of Y on the line, we get .80. This roughly translates to mean, that with a score of 1200, a student has an 80% chance of being accepted.

Let’s meet Gretl

While there are third party add-ons you can download for Excel that will do Logistical Regression, in its native form, Excel does not do a good job in this area. So I thought this would be a great opportunity to introduce you to a neat piece of FREE software called Gretl.

Here is the website to download GretlGretl Download

So why Gretl? Why not R or Python? I mean those are the languages real data scientists use right?

That is true, and R and Python can easily do a Logical Regression. The problem is  however, in order to use R and Python, you need to know how to program. Gretl, on the other hand, is GUI based. Think of it as a point and click light weight R. It is no where near as robust as R, but for learning how to do Logistical Regression, Gretl does a fine job.

Loading in the Data

After you install and start Gretl, the next step is to load in the data. Go to File>Open Data>User File. Search for the Excel file you downloaded previously in this lesson. Make sure you then select Excel from the file type at the bottom of the screen.

logi2

Select logi.xlsx. Leave the Start Import at window at 1 and 1. This is where the data starts in our Excel file: 1rst column, 1rst row. You will get a message letting you know how much data was imported.

The next pop up will noted that the data is undated. Click No on this window.

logi3

You data columns (Score, Passed) will appear in the  Gretl window. If you click on one, the data from that column will appear in a pop-up window. **note in the file you download, column 2 will be Accepted not Passed.

Let’s Model

Without further ado, let us do some modelings. From the menu bar Model>Limited dependent variable>Logit>Binary…

logi4

Now you have to select you Dependent variable and Regressors. Here is a hint, the dependent variable is what we want to find. What are we looking for? Will the person be Accepted. So Accepted goes in Depentdent variable and Score goes in Regressors. Pick the Show p-values radio button and then click Okay.

logi5

Below are the results of your Logistic Regression model

I am not going to give a Stats lesson here, but I will cover the important points.

logi6

  1. The top red box contains some important information. First the coefficients represent the b and m values from the linear equation we will be using later: y=mX+b =y=0.0105216X + -11.2757
  2. The p-value of Score = 0.0009 This is important as the p-value is a probabilistic value  that determines whether or not the regressor variable truly affects the dependent variable. The most common p-value threshold you are likely to come across is 0.05. If your regressor variable has a p-value above 0.05, you will want to reconsider your model.
  3. The matrix at the bottom of the screen. This shows you how successfully your model predicted outcomes from the training data set. It translates of the 0’s (not accepted) the model got 19 out of 21 right. For 1’s(accepted) the model got 19 out of 21 right. That is a 90% success rate. Not bad.

Let’s Use the Model

Okay, so maybe you jumped ahead and tried 1200 in the linear formula we developed above. It is 1.325?? How is that? Isn’t this supposed to be between 0 and 1.

Well the problem is, we are not looking for Y we are looking for probability (p). Y in this case is not the Y intercept, but instead:

logis1

 

Well, we know Y = 1.325 for a score of 1200, how do we find p from that? We solve  for p. Now feel free to go and do the math yourself if you want, but I already did the work for you. The equation below solves for p. If you don’t trust me and want to do it yourself, be my guest, but I assure you the equation below is good.

logis2

Let’s Make a Prediction

Let us put the formula’s we have found into Excel

logi8

Now you have a working prediction model. Any value you place in the score cell will be calculated to Y and p (probability). As the example above shows, a score of 1200 give us a probability of .79.

Turns out my crummy drawing wasn’t so bad after all.

logi1