Python 2.xx VS 3.xx

If you are trying to learn Python, especially for Data Science, you are going to come across a bunch a people (myself included) who have been very hesitant to move from Python 2.xx to Python 3.xx.

The reason?

Not Backwards Compatible

Well, when Python 3 first came out, it was made very clear that it was not backwards compatible. While most of the code remained the same, there were some changes that made programs coded in Python 2.xxx fail. A prime example is Print. In Python 2,

Print ‘Hello World!’

was perfectly acceptable, but in Python 3, it fails and throws up an exception. Python 3 had added the requirement for () with print statements

Print(‘Hello World’)   — Python 3 friendly

Well I have a whole bunch of code sitting on my hard drive I like to refer to, and having to go through it cleaning up all the new changes did not exactly seem like the best use of my time, since I am able to just continue using Python 2.xxx.

Libraries

Python is so great for Data Science because of the community of libraries out there providing the data horsepower we all love. (Pandas, Numpy, SciKit Learn, ect).  Well guess what, the changes to Python 3 made some of these libraries unstable. Strike 2, another reason not to waste my time with this new version.

Moving On

Well, it appears that enough time has passed and some people tell me all the bugs have been worked out with Python 3. While Python 2.7 will be supported til 2020 (those who love it, just keep using it I say), I have decided to try putting Python 3 through it’s paces.

My lessons will be reviewed one by one, with the 2.7 code being tested in a 3.4 environment. I will note any changes that need to made to the code to make it 3.4 compatible and add them to the  lesson. Each lesson that has been reviewed will have the following heading.

*Note: This lesson was written using Python 2.xx. If you are using Python 3.xxx any changes to the code will be annotated under headings: Python 3.xxx

2.7??

If you love 2, just keep using it. You’ve got 3 more years. By that point, whatever revision of 3 we are on may not even look like the current version. But if you want to look ahead, follow along with me as I update my code. (Note, all the original 2.7 code will remain on my site until the time that it is no longer supported)

Python: Pivot Tables with Pandas

Pandas provides Python with lots of advanced data management tools. Being able to create pivots tables is one of the cooler tools.

You can download the data set I will be working with here: KMeans1

First, let’s upload our data into Python

import pandas as pd
df = pd.read_excel("C:\Users\Benjamin\Documents\KMeans1.xlsx")
df.head()

pandasPivotTable.jpg

Let’s create our first pivot table.

The syntax is = pd.pivot_table(dataframe, index = Columns you want to group by, values = Columns you want to aggregate, aggfunc = type of aggregation)

pd.pivot_table(df, index='Department', values = 'AVG Labor', aggfunc = 'sum')

pandasPivotTable1.jpg

We can group by more than one column

pd.pivot_table(df, index=['Department','Model'], values = 'AVG Labor',  
aggfunc= 'mean')

pandasPivotTable2.jpg

You can also have multiple value columns

 pd.pivot_table(df, index='Department', values = ['AVG Labor','Labor Cost'],
  aggfunc= 'sum')

pandasPivotTable3.jpg

Now, one catch is, what if you want to have different aggregate functions. What if I want to get the mean of AVG Labor, but I want the sum of Labor Cost columns.

In this case we are going to use the groupby().aggregate()

import numpy as np
df.groupby('Department').aggregate({'AVG Labor':np.mean, 'Labor Cost': np.sum})

pandasPivotTable4.jpg

Python: An Interesting Problem with Pandas

I was writing a little tongue and cheek article for LinkedIn on fraud detection using frequency distributions (you can read the article here: LinkedIn). While this was a non-technical article, I wanted to use some histograms from a real data set, so I uploaded a spread sheet into Python and went to work.

While working with the data I ran into an interesting problem that had me chasing my tail for about 10 minutes before I figured it out. It is a fun little problem involving Series and Dataframes.

As always, you can upload the data set here: FraudCheck1

Upload the data.

import pandas as pd
df = pd.read_excel
("C:\\Users\\Benjamin\\OneDrive\\Documents\\article\\python\\FraudCheck1.xlsx")
df.head()

pandasProb.jpg

The data is pretty simple here. We are concerned with our answer column and the CreatedBy (which is the employee ID).  What I am trying to do is see if the “answer”  (a reading from an electric meter) are really random or if they have been contrived by someone trying to fake the data.

First, I want to get the readings for all the employees, so I used pop() to place the answer column into a separate list.

df1 = df

y = df1.pop("answer")

pandasProb1.jpg

Then, to make my histogram more pleasant looking, I decided to only use the last digit before the decimal. That way I will have 10 bars (0-9). (Remember, this is solely for making charts for an article. So I was not concerned with any more stringent methods of normalization)

What I am doing below is int(199.7%10). Remember % is the modulus – leaves you with the remainder and int converts your float to an integers. So 199.7 is cut to 199. The 199/10 remainder = 9.

a= []
i = 0 
while i < len(y):
     a.append(int(y[i]%10))
     i += 1
a[1:10]

pandasProb2

Then I created my histogram.

%matplotlib inline
from matplotlib import pyplot as plt
plt.hist(a)

pandasProb3.jpg

Now my problem

Now I want graph only the answers from employee 619, so first I filter out all rows but the ones for employee 619.

df2 = df.query('CreatedBy == 619')
y1 =df2.pop("answer")

Then I ran my loop to turn my answers into a single digit.

And I get an error.  Why?

pandasProb4.jpg

Well the answer lies in the datatypes we are working with. Pandas read_excel function creates a Dataframe.

When you pop a column from a dataframe, you end up with a Series. And remember a series is an indexed listing of values.

Let’s look at our Series. Check out the point my line is pointing to below. Notice how my index jumps from 31 to 62. My while loop counts by 1, so after 31, I went looking for y1[32] and it doesn’t exist.

pandasProb5.jpg

Using .tolist() converts our Series to a list and now our while loop works.

pandasProb6.jpg

And now we can build another histogram.

pandasProb7.jpg

The Code

import pandas as pd
df = pd.read_excel
("C:\\Users\\Benjamin\\OneDrive\\Documents\\article\\python\\FraudCheck1.xlsx")
df.head()

df1 = df
y =df1.pop("answer")

a= []
i = 0 
while i < len(y):
   a.append(int(y[i]%10))
   i += 1
a[1:10]

%matplotlib inline
from matplotlib import pyplot as plta1 = []
i = 0
while i < len(y2):
 a1.append(int(y2[i])%10)
 i = i+1
a1[1:10]

plt.hist(a)

df2 = df.query('CreatedBy == 619')
y1 =df2.pop("answer")

y2= y1.tolist()
type(y2)

a1 = []
i = 0
while i < len(y2):
    a1.append(int(y2[i])%10)
    i = i+1
a1[1:10]

plt.hist(a1)

 

Python: Naive Bayes’

Naive Bayes’ is a supervised machine learning classification algorithm based off of Bayes’ Theorem. If you don’t remember Bayes’ Theorem, here it is:

bayes

Seriously though, if you need a refresher, I have a lesson on it here: Bayes’ Theorem

The naive part comes from the idea that the probability of each column is computed alone. They are “naive” to what the other columns contain.

You can download the data file here: logi2

Import the Data

import pandas as pd
df = pd.read_excel("C:\Users\Benjamin\Documents\logi2.xlsx")
df.head()

nb.jpg

Let’s look at the data. We have 3 columns – Score, ExtraCir, Accepted. These represent:

  • Score – Student Test Score
  • ExtraCir – Was Student in an Extra Circular Activity
  • Accepted – Was the Student Accepted

Now the Accepted column is our result column – or the column we are trying to predict. Having a result in your data set makes this a supervised machine learning algorithm.

Split the Data

Next split the data into input(score and extracir) and results (accepted).

y = df.pop('Accepted')
X = df

y.head()

X.head()

nb1.jpg

Fit Naive Bayes

Lucky for us, scikitlearn has a bit in Naive Bayes algorithm – (MultinomialNB)

Import MultinomialNB and fit our split columns to it (X,y)

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X,y)

nb2.jpg

Run the some predictions

Let’s run the predictions below. The results show 1 (Accepted) 0 (Not Accepted)

#--score of 1200, ExtraCir = 1
print(classifier.predict([1200,1]))

#--score of 1000, ExtraCir = 0
print(classifier.predict([1000,0]))

nb3

The Code

import pandas as pd
df = pd.read_excel("C:\Users\Benjamin\Documents\logi2.xlsx")
df.head()

y = df.pop('Accepted')
X = df

y.head()
X.head()

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X,y)

#--score of 1200, ExtraCir = 1
print(classifier.predict([1200,1]))

#--score of 1000, ExtraCir = 0
print(classifier.predict([1000,0]))

 

Python: K Means Clustering Part 2

In part 2 we are going focus on checking our assumptions. So far we have learned how to perform a K Means Cluster. When running a K Means Cluster, you first have to choose how many clusters you want. But what is the optimal number of clusters? This is  the “art” part of an algorithm like this.

One thing you can do is check the distance from you points to the cluster center. We can measure this using the interia_ function from scikit learn.

Let’s start by building our K Means Cluster:

Import the data

import pandas as pd

df = pd.read_excel("C:\Users\Benjamin\Documents\KMeans1.xlsx")
df.head()

kmeans1

Drop unneeded columns

df1 = df.drop(["ID Tag", "Model", "Department"], axis = 1)
df1.head()

kmeans2

Create the model – here I set clusters to 4

from sklearn.cluster import KMeans
km = KMeans(n_clusters=4, init='k-means++', n_init=10)

Now fit the model and run the interia_ function

km.fit(df1)
km.inertia_

kmeaninter.jpg

Now the answer you get is the sum of distances from your sample points to the cluster center.

What does the number mean? Well, on its own, not much. What you need to do is look at a list of interia_ for a range of cluster choices.

To do so, I am set up a for loop.

n = int(raw_input("Enter Starting Cluster: "))
n1 = int(raw_input("Enter Ending Cluster: "))
for i in range(n,n1):
 km = KMeans(n_clusters=i, init='k-means++', n_init=10)
 km.fit(df1)
 print i, km.inertia_

kmeaninter1.jpg

The trick to reading the results is look for the point of diminishing returns. The area I am pointing to with the arrow is where I would look. The changes in values start slowing down here.

I am using this example because I feel it is more real world. Working with real data takes time to a get a feeling for. If you are having trouble seeing why I chose this point, consider the following textbook example:

See how at this highlight part, the drop in number goes from hundreds to 25. That is a diminished return. The new result is not that much better than the earlier result. As opposed to 1 and 2 where 2 clusters perform 1000 units better.

kmeaninter2.jpg

 

Python: K Means Cluster

K Means Cluster will be our introduction to Unsupervised Machine Learning. What is Unsupervised Machine Learning exactly? Well, the simplest explanation I can offer is that unlike supervised where our data set contains a result, unsupervised does not.

Think of a simple regression where I have the square footage and selling prices (result) of 100 houses. Taking that data, I can easily create a prediction model that will predict the selling price of a house based off of square footage. – This is supervised machine learning

Now, take a data set containing 100 houses with the following data: square footage, house style, garage/no garage, but no selling price. We can’t create a prediction model since we have no knowledge of prices, but we can group the houses together based on commonalities. These groupings (clusters) can be used to gain knowledge of your data set.

I think seeing it in action will help.

If you want to play along, download the data set here: KMeans1

The data set contains a 1 year repair history of 197 Ultrasound medical devices.

Data dictionary (ID Tag – asset number assigned device, Model – model name of device, WO Count – count of repair work orders, AVG Labor – average labor minutes per repair, Labor Cost – average labor cost per repair, No Problem-  count of repairs where no problem was found, Avg Cost -average cost of parts, Travel – average travel hours per repair, Travel Cost – average travel cost per repair, Department – department that owns the ultrasound device)

kmeans

We want to see what kind of information we can extract from this data.

To do so, we are going to use K Means Clustering.

How does K Means Clustering work? Each row in the table is converted to a vector. Imagine the vectors now graphed in N-dimension space. Next pick the number of clusters you want to create. For each cluster, you will place a  point(a centroid) in space and the vectors are grouped based on their proximity to their nearest centroid.

The calculation to tell proximity is made using geometric means (not arithmetic)- hence the name K-Means Cluster

(each dot below is a row in your table, the colors represent a cluster)

kmeans2

Let’s do it in Python

Import the data.

import pandas as pd

df = pd.read_excel("C:\Users\Benjamin\Documents\KMeans1.xlsx")
df.head()

kmeans1

Now, we are going to drop a few columns: ID Tag – is a random number, has no value in clustering. Then Model and Department,as they are text and while there are ways to work with the text, it is more complicated so for now, we are just going to drop the columns

df1 = df.drop(["ID Tag", "Model", "Department"], axis = 1)
df1.head()

kmeans2

Now lets import KMeans from sklearn.cluster

We then initialize KMeans (n_clusters= 4 -no of clusters you want, init=’k-means++’ -sets how the centroids are places. k-means++ is one of the faster methods of centroid placement, n_init=10 – number times the algorithm with run placing new centroids each iteration)

from sklearn.cluster import KMeans
km = KMeans(n_clusters=4, init='k-means++', n_init=10)

kmeans3.jpg

Choosing number of clusters is a bit of an art. Play with it a bit and see how different values play out for you.

Now fit the model

km.fit(df1)

kmeans4.jpg

Now, export the cluster identifiers to a list. Notice my values are 0 -3. One value for each cluster.

x = km.fit_predict(df1)
x

kmeans5.jpg

Create a new column on the original dataframe called Cluster and place your results (x) in that column

df["Cluster"]= x
df.head()

kmeans6.jpg

Sort your dataframe by cluster

df1 = df.sort(['Cluster'])
df1

kmeans7.jpg

Now as you start to examine the data in each cluster, you show start to see patterns emerge.

Below is an example of the patterns I found in the clusters.

kmeans9.jpg

Now remember, this is just an INTRODUCTION to unsupervised learning. We will learn more tricks to help you discover the patterns as we move forward.

Python: K Nearest Neighbor

K Nearest Neighbor (Knn) is a classification algorithm. It falls under the category of supervised machine learning. It is supervised machine learning because the data set we are using to “train” with contains results (outcomes). It is easier to show you what I mean.

Here is our training set: logi

Let’s import our set into Python

knn.jpg

This data set contains 42 student test score (Score) and whether or not they were accepted (Accepted) in a college program.  It is the presence of the Accepted column that makes supervised machine learning possible. Knowing the outcomes of past events, we can create a prediction model for future events. So you could use the finished model to predict whether someone will be accepted based on their test score.

So how does Knn work?

Look at the chart below. Imagine this represents our data set. Each blue dot is accepted (1) while each red dot is not(0).

knn1

What if I want to know about my new data point (green star)? Is it a 1 or a 0?

knn2

I start by choosing a neighbor count – in this example I will choose 3, and I find the 3 nearest neighbors to my new point.

Let’s look at the results, I have 2 red(0) and 1 blue(1). Using basic probability, I am 67% (2/3) certain that you will not get in.

knn3.jpg

Now, let’s code it!

First we need to separate our data into 2 dataframes: Our training set X (Score) and our target set y (Accepted)

df.pop() removes the Accepted column from your dataframe and places it in a newly created one.

knn4.jpg

knn5.jpg

Import sklearn

sklearn is a massive library of machine learning algorithms available for Python. Today we are going to use KNeighborsClassfier

So below imported KNeighborsClassifier from sklearn.neighbors

Next I set my neighbor count to 5. You can experiment with other numbers and see how works out for you. Setting the neighbor count is something you kind of have to develop a feel for.

knn6.jpg

Now let’s fit the model with our training set(X) and target set(y)

knn7

Now we can use our model to make predictions.

ne.predict() will return 1 or 0 – (Accepted or Not)

while ne.predict_proba() will return a probability range. Results below read as (40% change of not Accepted(0), 60% chance of Accepted(1))

knn8.jpg

So there you go, you have now built a prediction model using K Nearest Neighbor.

 

 

 

Python: Object Oriented Programming

In the argument between R and Python, the fact the Python is a full blown Object Oriented Programming(OOP) language gives it a solid advantage to me. Why? OOP gives you the ability to create and use objects in your programs.

Now I remember my first OOP class was in Java. It was a college course and started with reading 4 dense chapters on what made a language OOP. It talked about Objects and methods and data encapsulation. In the end, I just had a headache, and I really didn’t understand anything until I built my first Object.

So, let’s just jump right in and build our first object.

Class

The first step in building an object is to build a class. Now I am going to move quick here, and then work my way back to better explain what we are doing.

So, I am building a 2 function calculator here.

  • class myCalc:  – First I create my class and name it
  • def __init__(self):  — this is called a constructor – you need to have one, but you don’t need to use it. Notice I have the word pass on the next line. That means just move on without doing anything. Don’t worry constructors for now. We will cover constructors in a later lesson.
  • def myAdd(self, x,y): this is basically a function inside my class, but inside a class we call functions methods. The first argument (self) is a reference to our constructor. Again, don’t worry about it yet, just know it has to be there or you will get an error
  • def mySub(self,x,y): Same as above

OOP.jpg

Create our Object

By assigning our class to a variable name, what we doing is creating an Instance

OOP1

Now that I created and Instance can call my Methods

Notice when I call myAdd and mySub, I only provide 2 arguments. Like I said earlier, we aren’t using the self argument in this case, so we don’t need to pass any argument to it.

OOP2

Okay, so what is the big deal you are asking? I mean, all I did was show you a more complicated way to make a function.

Why Bother

So I am going to attempt to show you a practical use for a class without resorting to totally impractical examples or overly complicated ones.

myExp works by asking you for a number when you create an Instance. Then when you call the exp1() function, it raises what ever value you give exp1() by the number you started with.

examples:

  • ms = myExp(2) – self.x is set to 2 in the __init__ method
  • ms.exp1(3) – 3  ** 2
  • ms.exp1(4) – 4 **2
  • ms1 = myExp(3) – self.x is set to 3
  • ms1.exp1(3) – 3**3
  • ms1.exp1(4) – 4**3
  • ms.exp1(3) – 3** 2 — this is the cool part. –self.x for my ms object is still 2 and self.x for ms1 is 3

OOP3.jpg

Here it is in running code:

OOP4

I can create at many Instances as I want, giving each instance whatever value of self.x I want. And I can reuse them in any fashion I can imagine.

Here I pass one Instance as an argument to another.

OOP5.jpg

It is the re usability and the ability to assign different values to objects that makes them so useful.

This is just an intro

Don’t worry if you are little confused. This is just an intro and I will create more OOP lessons to hopefully help clear everything up.

 

 

Python: Create a Box whisker plot

Box whisker plots are used in stats to graphically view the spread of a data set, as well as to compare data sets.

If you would like to follow along with this example, he is the data set: sensors

Using pandas, let’s load the data set

%matplotlib inline
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt

sensorDF = pd.read_excel("C:\Users\Benjamin\Documents\sensors.xlsx")
sensorDF.head()

Our data set represents monthly readings taken from 4 sensors over the span of a year

boxplot

We need to convert the dataframe to a list values for our box plot function.

To do this, first we need to flatten() our dataframe. The flatten() method places all the values from the dataframe into 1 list

boxplot1.jpg

Now let us chop the list into the for sensors represented by the rows in our dataframe

boxplot2.jpg

Finally, we need to make a list of these lists

boxplot3.jpg

I know that seemed like a lot, but you will spend more time cleaning and prepping data than any other task. It is just the nature of the job.

Let’s Plot

The code for creating a boxplot is now easy.

boxplot4.jpg

Let’s label our chart a little better now.

boxplot5.jpg

 

 

 

 

 

Python: Hypothesis Testing(T Test)

Hypothesis testing is a first step into really understanding how to use statistics.

The purpose of the test is to tell if there is any significant difference between two data sets.

Consider the follow example:

Let’s say I am trying to decide between two computers. I want to use the computer to run advanced analytics, so the only thing I am concerned with is speed.

I pick a sorting algorithm and a large data set and run it on both computers 10 times, timing each run in seconds.

Now I put the results into two lists. A and B

a = [10,12,9,11,11,12,9,11,9,9]
b = [13,11,9,12,12,11,12,12,10,11]

A quick look at the data makes me think b is slower than a. But is it slower enough to mean something or are these results just a matter of chance (meaning if I ran the test 200 more times would the end result be closer to equal or further apart).

Hypothesis test

To find out, let’s do a hypothesis test.

Set our Hypothesis:

  • H0 = H1 – there is no significant difference between data sets
  • H0 <> H1 – there is a significant difference

To test our hypothesis, let’s run a t-test

import stats from scipy and run stats.ttest_ind().

Our output is the z-statistic and the p-value.

Our p-value is 0.08 – greater than the common significance value of 0.05. Since it is greater, we cannot reject H0=H1. This means both computers are effectively the same speed.

hypoTest

Let’s try a third computer – d

d = [13,12,9,12,12,13,12,13,10,11]

Now, let’s run a second T-test.  This one comes back with a p-value of 0.026 – under 0.05. This means we can reject our hypothesis that a=d. The speed differences between a and d are significant.

hypoTest1