Python: Hypothesis Testing(T Test)

Hypothesis testing is a first step into really understanding how to use statistics.

The purpose of the test is to tell if there is any significant difference between two data sets.

Consider the follow example:

Let’s say I am trying to decide between two computers. I want to use the computer to run advanced analytics, so the only thing I am concerned with is speed.

I pick a sorting algorithm and a large data set and run it on both computers 10 times, timing each run in seconds.

Now I put the results into two lists. A and B

a = [10,12,9,11,11,12,9,11,9,9]
b = [13,11,9,12,12,11,12,12,10,11]

A quick look at the data makes me think b is slower than a. But is it slower enough to mean something or are these results just a matter of chance (meaning if I ran the test 200 more times would the end result be closer to equal or further apart).

Hypothesis test

To find out, let’s do a hypothesis test.

Set our Hypothesis:

  • H0 = H1 – there is no significant difference between data sets
  • H0 <> H1 – there is a significant difference

To test our hypothesis, let’s run a t-test

import stats from scipy and run stats.ttest_ind().

Our output is the z-statistic and the p-value.

Our p-value is 0.08 – greater than the common significance value of 0.05. Since it is greater, we cannot reject H0=H1. This means both computers are effectively the same speed.

hypoTest

Let’s try a third computer – d

d = [13,12,9,12,12,13,12,13,10,11]

Now, let’s run a second T-test.  This one comes back with a p-value of 0.026 – under 0.05. This means we can reject our hypothesis that a=d. The speed differences between a and d are significant.

hypoTest1

Advertisements

Python: Printing with .format()

Another way to print variable values in Python is the .format(). This method replaces %d,%s, and %r, but I still showed them to you in an earlier lesson as you will still see them in use, I wanted you to know about them.

When using .format() – {} represents your place holder in the string.

pythonFormat.jpg

You can also use numbers to tell format which order to use values in.

pythonFormat1.jpg

You can name your place holders:

pythonFormat2

You can use user input

pythonFormat3.jpg

And you can even use dictionaries

pythonFormat4

Python: Logistic Regression

This lesson will focus more on performing a Logistic Regression in Python. If you are unfamiliar with Logistic Regression, check out my earlier lesson: Logistic Regression with Gretl

If you would like to follow along, please download the exercise file here: logi2

Import the Data

You should be good at this by now, use Pandas .read_excel().

df.head() gives us a the first 5 rows.

What we have here is a list of students applying to a school. They have a Score that runs from 0 -1600,  ExtraCir (extracurricular activity) 0 = no 1 = yes, and finally Accepted 0 = no 1 = yes

logi1

Create Boolean Result

We are going to create a True/False column for our dataframe.

What I did was:

  • df[‘Accept’]   — create a new column named Accept
  • df[‘Accepted’]==1  — if my Accepted column is 1 then True, else False

logi1

What are we modeling?

The goal of our model is going to be to predict and output – whether or not someone gets Accepted based on some input – Score, ExtraCir.

So we feed our model 2 input (independent)  variables and 1 result (dependent) variable. The model then gives us coefficients. We place these coefficients(c,c1,c2) in the following formula.

y = c + c1*Score + c2*ExtraCir

Note the first c in our equation is by itself. If you think back to the basic linear equation (y= mx +b), the first c is b or the y intercept. The Python package we are going to be using to find our coefficients requires us to have a place holder for our y intercept. So, let’s do that real quick.

logi2

 

Let’s build our model

Let’s import statsmodels.api

From statsmodels we will use the Logit function. First giving it the dependent variable (result) and then our independent variables.

After we perform the Logit, we will perform a fit()

logi3.jpg

The summary() function gives us a nice chart of our results

logi4.jpg

If you are a stats person, you can appreciate this. But for what we need, let us focus on our coef.

logi45.jpg

remember our formula from above: y = c + c1*Score + c2*ExtraCir

Let’s build a function that solves for it.

Now let us see how a student with a Score of 1125 and a ExCir of 1 would fair.

logi9

okayyyyyy. So does 3.7089 mean they got in?????

Let’s take a quick second to think about the term logistic. What does it bring to mind?

Logarithms!!!

Okay, but our results equation was linear — y = c+ c1*Score + c2*ExCir

So what do we do.

So we need to remember y is a function of probability.

logis1

So to convert y our into a probability, we use the following equation

logis2

So let’s import numpy so we can make use of e (exp() in Python)

logi8.jpg

Run our results through the equation. We get .97. So we are predicting a 97% chance of acceptance.

logi10.jpg

Now notice what happens if I drop the test score down to 75. We end up with only a 45% chance of acceptance.

logi11.jpg


If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT. 

Follow this link for more Python content: Python

 

 

 

 

 

 

Python: Linear Regression

Regression is still one of the most widely used predictive methods. If you are unfamiliar with Linear Regression, check out my: Linear Regression using Excel lesson. It will explain the more of the math behind what we are doing here. This lesson is focused more on how to code it in Python.

We will be working with the following data set: Linear Regression Example File 1

Import the data

Using pandas .read_excel()

linReg.jpg

What we have is a data set representing years worked at a company and salary.

Let’s plot it

Before we go any further, let’s plot the data.

Looking at the plot, it looks like there is a possible correlation.

linReg1.jpg

Linear Regression using scipy

scipy library contains some easy to use maths and science tools. In this case, we are importing stats from scipy

the method stats.linregress() produces the following outputs: slope, y-intercept, r-value, p-value, and standard error.

linReg2

I set slope to m and y-intercept to b: so we match the linear formula y = mx+b

Using the results of our regression, we  can create an easy function to predict a salary. In the example below, I want to predict the salary of a person who has been working there 10 years.

linReg3.jpg

Our p value is nice and low. This means our variables do have an effect on each other

linReg4.jpg

Our standard error is 250, but this can be misleading based on the size of the values in your regression. A better measurement is r squared. We find that by squaring our r output

linReg5.jpg

R squared runs from 0 (bad) to 1 (good). Our R squared is .44. So our regression is not that great. I prefer to keep a r squared value at least .6 or above.

Plot the Regression Line

We can use our pred() function to find the y-coords needed to plot our regression line.

Passing pred() a x value of 0 I get our bottom value. I pass pred() a x value of 35 to get our top value.

I then redo my scatter plot just like above. Then I plot my line using plt.plot().

linReg6.jpg


If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT. 

Follow this link for more Python content: Python

 

 

Python: Accessing a SQL database

If you really want to do data work, you need to be able to connect to a database. In this example I will show you how to connect to and query data from MS SQL Server with the AdventureWorks2012 database installed.

This lesson assumes some very basic knowledge of SQL. If SQL is a complete mystery, head over to my SQL page: SQL  If you check out the first 4 intro lessons, you will know everything about SQL you need to know for this lesson.

Install pyodbc

To connect to the database, we need to install pyodbc. Go to your Anaconda terminal and type: pip install pyodbc

sqlpython.jpg

Now open up your jupyter notebook and start a new notebook

Connect to Database

import pyodbc

cnxn is our variable – it is commonly used as a shorten version of connection

syntax: pyodbc.connect(‘DRIVER={SQL Server}; SERVER=server name; DATABASE=database name;UID = user name; PWD = password’)

finally cursor =cnxn.cursor() creates a cursor for us. In SQL, a cursor is used to step through your results one row at a time.

sqlpython1.jpg

cursor.execute(place sql query here)  – this is how you pass a sql query – note query goes in quotes

tables = cursor.fetchall() – fetch all the rows in your query results

sqlpython2

We can now iterate through the rows in tables.

sqlpython3.jpg

I don’t like the layout of this. Also we can’t really work with the data.

Pandas Dataframe

first import pandas

  • d= [] – create empty dictionary
  • d.dappend({‘Name’:row.Name, ‘Class’: row.GroupName}) – fill dictionary 1 row at a time
  • df = pd.DataFrame(d) – convert your dictionary to a dataframe

sqlpython4.jpg


If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT. 

Follow this link for more Python content: Python

Python: Read CSV and Excel with Pandas

Earlier is showed you how to use the Python CSV library to read and write to CSV files. While CSV does work, and I still use elements of it occasionally, you will find working with Pandas to be so much easier.

Practice Files Excel: Linear Regression Example File 1

CSV: heightWeight_w_headers

Let’s start with our CSV file. Unzip it and place it in a folder where you will be able to find it.

If you want to make your life a little easier, when you open your Jupyter Notebook, type pwd

This will give you your present working directory. If you place your csv file in this directory, all you have to do is call it by name.

csv1

I am not going to do that however. I want to show you how to work with directories.

Get Full Directory Path

For Windows Users, if you hold down the Shift key while right clicking on your file, you will see an option that says: Copy as Path

csv2

Read the CSV file

First import pandas as pd

Then assign a variable = pd.read_csv(file name) – paste the full path of your CSV file here.

variable.head() = the first 5 rows from your data frame.

pandaCSV

Write CSV file

Okay, let’s write a CSV file. First, let’s add some rows to current dataframe.

We will do this be first creating a new dataframe with 3 rows of data.

pandaCSV1.jpg

Now we will use the .append() function to append this new dataframe to end of our existing dataframe.

The ignore_index = True argument tells the new rows to continue using the index of the main dataframe. Otherwise, the new rows would start counting at 0 again. If you want to try it, just leave out the argument: HgtWgt_df.append(df2)

pandaCSV2.jpg

Now, we just use the df.to_csv() command to write our appended dataframe to a new CSV file

pandaCSV3

And there is our new file

pandaCSV4.jpg

Read Excel File

I’d love to be able to wow you with how complicated reading an Excel file is, but the difference between the Excel file reading and CSV is one word – excel.

pandaexcel.jpg

The only caveat is if your Excel file has multiple sheets. By default pd.read_excel() goes to sheet 1. If you want it to read sheet 4 instead, you would add: pd.read_excel(filename, sheetname= 4)

Write to Excel File

pandaexcel1.jpg

Bonus note

If you haven’t already, please check out my earlier CSV lesson: Python: Working with CSV Files

This will really give you an appreciation for how powerful and time saving pandas really is.


 

If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT. 

Follow this link for more Python content: Python