Statistics: Range, Variance, and Standard Deviation

On April 29, 2016April 29, 2016 By Ben Larson Ph.D.In statisticsLeave a comment

Measuring the spread can be a useful tool when analyzing a set of numbers. Three common measures of spread of range, variance, and standard deviation.

Here is the data set we will be working with: [2,4,6,7,8,10,15,18,22,26]

Range

Range is the simplest of the three measures. To find the range, all you need to do is subtract the smallest number in the set from the largest number

range = large-small

range = 26-2 = 24

Variance

Variance is created by taking the average of the squared difference between each value in the set minus the mean. We square the differences so that values above and below the mean do not cancel each other out.

Let’s find the mean:

statSpread

If you haven’t seen the x with the line over it before, this is referred to as x bar and it is used to represent the mean.

To find the variance you take the first number in your set, subtract the mean and square the result. You repeat that for each number in your list. Finally you add up all the results and divide by n (the number of items in your list)

ex – ((2-12)^2 + (4-12)^2 +….+(26-12)^2) / 10

statSpread1

variance = 58.56

Now I know what you are thinking, how can the average distance from the mean be 58.56 when the furthest point from the mean (26) is only 14? This is because we are squaring the differences. To get a number more in line with the data set, we have another measure called the standard deviation.

Standard Deviation

The standard deviation returns a value more in-line with what you would expect based on your data. To find the standard deviation – simply take the square root of the variance.

std dev = 7.65

Population vs Sample

The equations above work great if you have the entire population. What I mean by that is, if your data contains all the data in the set. Using our data, imagine if the numbers were ages of children in a large family. If there are 10 kids in the family, then I have all the ages, so I am dealing with the population.

However, if we instead have sampled 10 random ages from all the kids in a large extended family where the total number of kids is 90. In this case, since we are looking at 10 out of 90, we are not dealing with the population, but the sample.

When working with the sample, you need to make an adjustment to your variance and standard deviation equations. The change is simple. Instead of dividing by n you will now divide by n-1. This offset makes up for the fact you do not have all the data.

statSpread2

R: Intro to Statistics – Central Tendency

On April 16, 2016 By Ben Larson Ph.D.In R, UncategorizedLeave a comment

Central Tendency

One of the primary purposes of statistics is to find a way to summarize data. To do so, we often look for numbers known as collectively as measurements of central tendency (mean, median, mode).

Look at the list below. This is a list of weekly gas expenditures for two vehicles. Can you tell me if one is better than the other, or are they both about the same?

rCentral

How about if I show you this?

rCentral1

Using the average, you can clearly see car2 is more cost efficient. At approx $21 a week, that is a savings of $630 over the course of the 30 weeks in the chart. That is a big difference, and one that is easy to see. We can see this using one of the 3 main measures of central tendency – the arithmetic mean – popularly called the average.

Mean

The mean – more accurately the arithmetic mean – is calculated by adding up the elements in a list and dividing by the number of elements in the list.

In R, finding a mean is simple. Just put the values you want to average into a vector (**note to make a vector in R: var <-c(x,y,z)) We then put the vector through the function mean()

rCentral3

Median

The median means, simply enough, the middle of an ordered list. This is also easy to find using R. As you can see, you do not even need to sort the list numerically first. Just feed the vector to the function median()

Mode

This is the last of 3 main measure. Mode returns the most common value found in a list. In the list 2,3,2,4,2 – the mode is 2. Unfortunately R does not have a built in mode function, so for this, we will have build our own function.

For those familiar with functions in programming, this shouldn’t be too foreign of a concept. However, if you don’t understand functions yet, don’t fret. We will get to them soon enough. For now, just read over the code and see if you can figure any of it out for yourself.

rCentral5

R: An Introduction

On April 15, 2016April 15, 2016 By Ben Larson Ph.D.In R3 Comments

R is a programming language focused on statistics, data visualization, and data analysis. It is open source, which means there is a rich trove of libraries and add-ons constantly being developed by the open source community.

R is free to download and use. Follow the links below to download R if you would like to try it.

Windows
Linux Binaries
RStudio – Not required, but anyone looking for a better developed environment may want to check it out. I will be using just the base install of R for the following lessons though

Starting R

When R first starts up, this is what you will see. I am not going to focus too much on a grand tour, as most of the menus in R are pretty self explanatory. Instead, let’s jump right into the coding. Move your cursor to where my big red arrow is:

rIntro

Basic Syntax

There is an unwritten rule that states the first line of code you need to learn in any language is Hello World. Well, I am not going to do that. R is a stats program. Why don’t we start with some numbers instead.

rIntro1

As you can see, R uses your standard arithmetic operations (+,-,*,/,^)

Variables

Assigning variables in R is easy. “<-” is the designated syntax used to assign a variable. One great thing about R is that you do not need to declare variables in advance. R assigns the data type based on the input you give the variable.

rIntro2

Assigning Strings

Data Types

The main data types in R are:

Numeric: 1, 2.33, etc
Integer: 2L
Logical: TRUE, FALSE
Character: string
Complex: 2+3i (remember those from Trig class)

Vectors

Vectors allow you group multiple elements under one name. Use the syntax <-c() when creating a vector

rIntro4

Lists

Lists allow you to group unlike items – even vectors and strings:

Logistic Regression with Gretl

On April 6, 2016April 6, 2016 By Ben Larson Ph.D.In machine learning, RegressionLeave a comment

One of the most popular machine learning algorithms, Logistic Regression is actually a classification algorithm. Broken down to its simplest terms, binary logistic regression (the one we will be focusing on here) is answering a yes or no question. Will the customer buy or not? Is the email SPAM or not?

Score Accepted
982 0
1304 1
1256 1
1562 1
703 0

Above is a small sample from the data set we will be using for this lesson. In this set, student scores for an entrance test are listed in the first column and whether they were Accepted (1) or Not(0) is in the second column.

Download sample Excel file here: logi

I ran a scatter plot on the data with Scores on the X axis. As you can see the dots for 2 horizontal lines at 1 and 0. You may notice that the 1 (Accepted) dots seem to cluster towards higher scores and 0 (Not Accepted) dots cluster towards lower scores.

logi

Well since the point of Logistic Regression is help us make predictions, here is how the predictions work. The Logistic Regression, represented by my crudely drawn red S, goes from 1 to 0. And just like with Linear Regression, if we take a value for X, to make our prediction, we look for the value of Y on the line at that point.

logi1

In the case of a 1200 score, if we check the value of Y on the line, we get .80. This roughly translates to mean, that with a score of 1200, a student has an 80% chance of being accepted.

Let’s meet Gretl

While there are third party add-ons you can download for Excel that will do Logistical Regression, in its native form, Excel does not do a good job in this area. So I thought this would be a great opportunity to introduce you to a neat piece of FREE software called Gretl.

Here is the website to download Gretl: Gretl Download

So why Gretl? Why not R or Python? I mean those are the languages real data scientists use right?

That is true, and R and Python can easily do a Logical Regression. The problem is however, in order to use R and Python, you need to know how to program. Gretl, on the other hand, is GUI based. Think of it as a point and click light weight R. It is no where near as robust as R, but for learning how to do Logistical Regression, Gretl does a fine job.

Loading in the Data

After you install and start Gretl, the next step is to load in the data. Go to File>Open Data>User File. Search for the Excel file you downloaded previously in this lesson. Make sure you then select Excel from the file type at the bottom of the screen.

logi2

Select logi.xlsx. Leave the Start Import at window at 1 and 1. This is where the data starts in our Excel file: 1rst column, 1rst row. You will get a message letting you know how much data was imported.

The next pop up will noted that the data is undated. Click No on this window.

logi3

You data columns (Score, Passed) will appear in the Gretl window. If you click on one, the data from that column will appear in a pop-up window. **note in the file you download, column 2 will be Accepted not Passed.

Let’s Model

Without further ado, let us do some modelings. From the menu bar Model>Limited dependent variable>Logit>Binary…

logi4

Now you have to select you Dependent variable and Regressors. Here is a hint, the dependent variable is what we want to find. What are we looking for? Will the person be Accepted. So Accepted goes in Depentdent variable and Score goes in Regressors. Pick the Show p-values radio button and then click Okay.

logi5

Below are the results of your Logistic Regression model

I am not going to give a Stats lesson here, but I will cover the important points.

logi6

The top red box contains some important information. First the coefficients represent the b and m values from the linear equation we will be using later: y=mX+b =y=0.0105216X + -11.2757
The p-value of Score = 0.0009 This is important as the p-value is a probabilistic value that determines whether or not the regressor variable truly affects the dependent variable. The most common p-value threshold you are likely to come across is 0.05. If your regressor variable has a p-value above 0.05, you will want to reconsider your model.
The matrix at the bottom of the screen. This shows you how successfully your model predicted outcomes from the training data set. It translates of the 0’s (not accepted) the model got 19 out of 21 right. For 1’s(accepted) the model got 19 out of 21 right. That is a 90% success rate. Not bad.

Let’s Use the Model

Okay, so maybe you jumped ahead and tried 1200 in the linear formula we developed above. It is 1.325?? How is that? Isn’t this supposed to be between 0 and 1.

Well the problem is, we are not looking for Y we are looking for probability (p). Y in this case is not the Y intercept, but instead:

logis1

Well, we know Y = 1.325 for a score of 1200, how do we find p from that? We solve for p. Now feel free to go and do the math yourself if you want, but I already did the work for you. The equation below solves for p. If you don’t trust me and want to do it yourself, be my guest, but I assure you the equation below is good.

logis2

Let’s Make a Prediction

Let us put the formula’s we have found into Excel

logi8

Now you have a working prediction model. Any value you place in the score cell will be calculated to Y and p (probability). As the example above shows, a score of 1200 give us a probability of .79.

Turns out my crummy drawing wasn’t so bad after all.

logi1

	Anonymous on Python: Accessing a SQL databa…
	Anonymous on Top 7 skills a Data Analyst ha…
	lovingfox4e1d0e653e on Data Jobs: What does a Data An…
	Anonymous on Top 7 skills a Data Analyst ha…
	Anonymous on Python Web Scraping / Automati…

	Anonymous on Python: Accessing a SQL databa…
	Anonymous on Top 7 skills a Data Analyst ha…
	lovingfox4e1d0e653e on Data Jobs: What does a Data An…
	Anonymous on Top 7 skills a Data Analyst ha…
	Anonymous on Python Web Scraping / Automati…

	Anonymous on Python: Accessing a SQL databa…
	Anonymous on Top 7 skills a Data Analyst ha…
	lovingfox4e1d0e653e on Data Jobs: What does a Data An…
	Anonymous on Top 7 skills a Data Analyst ha…
	Anonymous on Python Web Scraping / Automati…

	Anonymous on Python: Accessing a SQL databa…
	Anonymous on Top 7 skills a Data Analyst ha…
	lovingfox4e1d0e653e on Data Jobs: What does a Data An…
	Anonymous on Top 7 skills a Data Analyst ha…
	Anonymous on Python Web Scraping / Automati…

	Anonymous on Python: Accessing a SQL databa…
	Anonymous on Top 7 skills a Data Analyst ha…
	lovingfox4e1d0e653e on Data Jobs: What does a Data An…
	Anonymous on Top 7 skills a Data Analyst ha…
	Anonymous on Python Web Scraping / Automati…

Analytics4All

Tag: statistics

Statistics: Range, Variance, and Standard Deviation

Range

Variance

Standard Deviation

Population vs Sample

R: Intro to Statistics – Central Tendency

Central Tendency

Mean

Median

Mode

R: An Introduction

Starting R

Basic Syntax

Variables

Data Types

Vectors

Lists

Logistic Regression with Gretl

Score Accepted
982 0
1304 1
1256 1
1562 1
703 0

Let’s meet Gretl

Loading in the Data

Let’s Model

Let’s Use the Model

Let’s Make a Prediction

Range

Variance

Standard Deviation

Population vs Sample

Central Tendency

Mean

Median

Mode

Starting R

Basic Syntax

Variables

Data Types

Vectors

Lists

Score Accepted 982 0 1304 1 1256 1 1562 1 703 0

Let’s meet Gretl

Loading in the Data

Let’s Model

Let’s Use the Model

Let’s Make a Prediction

Score Accepted
982 0
1304 1
1256 1
1562 1
703 0