Python: Confusion Matrix

What is a confusion matrix?

A confusion matrix is a supervised machine learning evaluation tool that provides more insight into the overall effectiveness of a machine learning classifier. Unlike a simple accuracy metric, which is calculated by dividing the number of correctly predicted records by the total number of records, confusion matrices return 4 unique metrics for you to work with.

While I am not saying accuracy is always misleading, there are times, especially when working with examples of imbalanced data,  that accuracy can be all but useless.

Let’s consider credit card fraud. It is not uncommon that given a list of credit card transactions, that a fraud event might make up a little as 1 in 10,000 records. This is referred to a severely imbalanced data.  Now imaging a simple machine learning classifier running through that data and simply labeling everything as not fraudulent. When you checked the accuracy, it would come back as 99.99% accurate. Sounds great right? Except you missed the fraud event, the only reason to try to create the model in the first place.

A confusion matrix will show you more details, letting you know that you completely missed the fraud event. Instead of a single number result, a confusion matrix provides you will 4 metrics to evaluate. (note: the minority class – (in the case of fraud – the fraudulent events) – are labeled positive by confusion matrices. So a non-fraud event is a negative. This is not a judgement between the classes, only a naming convention)

TP = true positive – minority class (fraud) is correctly predicted as positive

FP = false positive – majority class (not fraud) is incorrectly predicted

FN = false negative – minority class (fraud) incorrectly predicted

TN = true negative – majority class (not fraud) correctly predicted

In matrix form:

confus

To run a confusion matrix in Python, Sklearn provides a method called confusion_matrix(y_test, y_pred)

y_test = actual results from the test data set

y_pred = predictions made by model on test data set

so in a pseudocode example:

model.fit(X,y)
y_pred = model.predict(X_test)

If this is at all confusing, refer to my Python SVM lesson where I create the training and testing set and run a confusion matrix (Python: Support Vector Machine (SVM))

To run a confusion matrix in Python, first run a model, then run predictions (as shown above) and then follow the code below:

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

Output looks like this:

Confu1

Now, if you want to capture the TP, TN, FP, FN into individual variables to work with, you can add the ravel() function to your confusion matrix:

TN,FP,FN,TP = confusion_matrix(y_test, y_pred).ravel()

Thank you for taking the time to read this, and good luck on your analytics journey.

Python: Support Vector Machine (SVM)

Support Vector Machine (SVM):

A Support Vector Machine, or SVM, is a popular binary classifier machine learning algorithm. For those who may not know, a binary classifier is a predictive tool that returns one of two values as the result, (YES – NO), (TRUE – FALSE), (1 – 0).  Think of it as a simple decision maker:

Should this applicant be accepted to college? (Yes – No)

Is this credit card transaction fraudulent? (Yes – No)

An SVM predictive model is built by feeding a labeled data set to the algorithm, making this a supervised machine learning model. Remember, when the training data contains the answer you are looking for, you are using a supervised learning model. The goal, of course, of a supervised learning model is that once built, you can feed the model new data which you do not know the answer to, and the model will give you the answer.

Brief explanation of an SVM:

An SVM is a discriminative classifier. It is actually an adaptation of a previously designed classifier called perceptron. (The perceptron algorithm also helped to inform the development of artificial neural networks).

The SVM works by finding the optimal hyperplane that can be used to discriminate between classes in the data set. (Classes refers to the label or “answer” column of each record.  The true/false, yes/no column in a binary set). When considering a two dimensional model, the hyperplane simply becomes a line that divides to the classes of data.

The hyperplane (or line in 2 dimensions) is informed by what are known as Support Vectors. A record from the data set is converted into a vector when fed through the algorithm (this is where a basic understanding of linear algebra comes in handy). Vectors (data records) closest to the decision boundary are called Support Vectors. It is on either side of this decision boundary that a vector is labeled by the classifier.

The focus on the support vectors and where they deem the decision boundary to be, is what informs the SVM as to where to place the optimal hyperplane. It is this focus on the support vectors as opposed to the data set as a whole, that gives SVM an advantage over a simple learner like a linear regression, when dealing with complex data sets.

Coding Exercise:

Libraries needed:

sklearn

pandas

This is the main reason I recommend the Anaconda distribution of Python, because it comes prepackaged with the most popular data science libraries.

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import metrics
from sklearn.metrics import confusion_matrix
import pandas as pd

Next, let’s look at the data set. This is the Pima Indians Diabetes data set. It is a publicly available data set consisting of 768 records. Columns are as follows:

  1. Number of times pregnant.
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
  3. Diastolic blood pressure (mm Hg).
  4. Triceps skinfold thickness (mm).
  5. 2-Hour serum insulin (mu U/ml).
  6. Body mass index (weight in kg/(height in m)^2).
  7. Diabetes pedigree function.
  8. Age (years).
  9. Class variable (0 or 1).

Data can be downloaded with the link below

pima_indians

Once you download the file, load it into python (you’re file path will be different)

df = pd.read_excel(‘C:\\Users\\blars\\Documents\\pima_indians.xlsx’)

now look at the data:

df.head()

svm1

Now keep in mind, class is our target. That is what we want to predict.

So let us start by separating the target class.

We use the pandas command .pop() to remove the Class column to the y variable, and the remained of the dataframe is now in the X

Let’s now split the data into training and test sets:

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33)

Now we will train (fit) the model. This example I am using Sklearns SVC() model for an SVM example. There are a lot of SVMs available to try if you would like to explore deeper.

Code for fitting the model:

model =SVC()
model.fit(X_train, y_train)

Now using the testing subset we withheld, we will test our model

y_pred = model.predict(X_test)

Now to see how good the model is, we will perform an accuracy test.  This simply takes all the correct guess and divides them by total guesses.

As, you can seen below, we compare the y_pred (predicted values) against y_test (actual values) and we get .7677 or 77% accuracy. Which is not a bad model for simply using defaults.

svm3

Let’s look at a confusion matrix to get just a little more in-depth info

svm4

For those not familiar with a confusion matrix, this will help you to interpret results:

First number 151 = True Negatives — this would be the number of 0’s or not diabetics correctly predicted

Second number 15 = False Positives — the number of 0’s (non-diabetics) falsely predicted to be a 1

Third number 44 = False negatives — the number of 1’s (diabetics) falsely predicted to be a 0

Fourth number 44 = True Positives — the number of 1 (diabetics) correctly predicted.

So, the model correctly identified 44 out of the 59 diabetics in the test data, and misdiagnoses 44 out the 195 non diabetics in the data sample.

To see a video version of this lesson, click the link here: Python: Build an SVM

R: Installing Packages with Dependencies

Usually installing packages in R is as simple as

install.packages("package name")

However sometimes you will run into errors. This could be due to the fact that the package you are trying to install has what is known as a dependency. What this means is that in order for the package to properly install and run, it requires another package to  already be installed.

You can think of this like trying to install an add-on for Excel like PowerQuery without having Excel installed in the first place. Clearly it would not work.

Now if you are lucky enough to know exactly what package(s) needs to be installed first, then you can simply install it and be on your way. However, most of the time we are not that lucky. And in the case of some packages, you may need to install up to a dozen packages up front to get it to work.

The easier way, just add the following syntax to your command

install.packages("package name", dependencies = TRUE)

Remember in R, Boolean (TRUE and FALSE) must be all capital letters or R will not recognize them as Boolean.

If you are using RStudio, you can install the package using the GUI

At the top, got to Tools and select Install Packages from the drop down.

2018-07-12_20-23-11

Start typing the package you want in the box, it will pop up in the window

2018-07-12_20-23-40

Finally, make sure install dependencies and checked and click install.

2018-07-12_20-23-54

R: Manually entering data

You can use the data frame edit() function to manually enter / edit data in R.

Start by creating a data frame.

Note I am initializing each of the columns to datatype(0). This tells R that I while I want the name column to be a character and the age column to be numeric, I am leaving the size dynamic. You can set size limits during the initialization phase if you so choose.

 dfe <- data.frame(name=character(0), age = numeric(0), jobTitle = character(0))

Now, let’s use the edit() function to add some data to our data frame

dfe<- edit(dfe)

When the new window pops up, fill in the data and simply click the X when you are done

2018-05-27_16-21-46.png

You may get warning messages when you close out your edit window. These particular messages I got simply informed me that name and jobTitle were set as factors by R. Remember in R, warnings just want you to be aware of something, they are not errors.

Now if you run dfe, you can see your data frame

2018-05-27_16-23-32.png

By running the edit() function again, you can edit the values that currently exist in the data frame.  In this example I am going to change Philip’s age from 28 to 29

If you want to add a column to the data frame, just add data to the next empty column

2018-05-27_16-37-59.png

You can just close out now and rename the column in R, or just click on the column header and you will be able to rename it there.

2018-05-27_16-38-24.png

Now we have a new column listing pets

  name   age jobTitle pet
1 Ben    42  Data Sc  cat
2 Philip 29  Data Ana dog
3 Julia  36  Manager  frog

You can use the edit() function to manually edit existing data sets or data imported from other sources.

Below, I am editing the ChickWeight data set

2018-05-27_16-43-00.png

R: Connecting to SQL Server Database

You can query data from a SQL Server database directly from R using the RODBC package.

install.packages("RODBC")

First you need to form a connection

library(RODBC)
##connection string
cn <- odbcDriverConnect(connection="Driver={SQL Server Native Client 11.0};server=localhost; database=SSRSTraining;trusted_connection=yes;")

We use the odbcDriverConnect() function. Inside we pass a connection = value

Driver = {SQL Server Native Client 11.0};  — this is based on the version of SQL Server you have

server=localhost;  — I used localhost because the SQL Server was on the same computer I was working with. Otherwise, pass the server name

database=SSRSTraining; — name of database I want to work with

trusted_connection=yes; — this means I am able to pass my Windows credentials.

If you don’t have a trusted connect pass the user Id and password like this

uid = userName; pwd = Password;

Note each parameter is separated by a semicolon

Query the database

> ##passes query to SQL Server
> df <- sqlQuery(cn, "select * FROM [SSRSTraining].[dbo].[JobDataSet]")
> head(df)

    Name              Job Hours Complete
1  Sally Predictive Model     1        n
2 Philip      Maintanence    10        n
3    Tom    Ad-hoc Report    12        y
4    Bob             SSRS     3        y
5 Philip         Tableau      7        n
6    Tom         Tableau      9        n

using sqlQuery() – pass through the connection string (cn) and enclose your query in ” ”

 

 

 

R: Importing Excel Files

To work with Excel files in R, you can use the readxl library

install.packages(“readxl”)

Use the read_excel() function to read and Excel workbook

> library(readxl)
>
> ## read excel file
> df1 <- read_excel("r_excel.xlsx")
> head(df1)
# A tibble: 6 x 4
  Name   Job              Hours Complete
  <chr>  <chr>            <dbl> <chr>  
1 Sally  Predictive Model    1. n      
2 Philip Maintanence        10. n      
3 Tom    Ad-hoc Report      12. y      
4 Bob    SSRS                3. y      
5 Philip Tableau             7. n      
6 Tom    Tableau             9. n

By default, read_excel() reads only the first sheet in the Excel file. To read other sheets using the sheet key word.

> ## read  sheet 2
> df2 <- read_excel("r_excel.xlsx", sheet =2)
> head(df2)

# A tibble: 6 x 2
  Animal Num_Legs
  <chr>     <dbl>
1 Dog          4.
2 Duck         2.
3 Snake        0.
4 Horse        4.
5 Spider       8.
6 Human        2.

The next example, we are reading range B2 – C6 on sheet 1 (same as Excel’s range function)

> ## read sheet 1, range B2 - C6
> df3 <- read_excel("r_excel.xlsx", sheet =1, range = "B2:C6")
> df3
# A tibble: 4 x 2
  `Predictive Model`   `1`
  <chr>              <dbl>
1 Maintanence          10.
2 Ad-hoc Report        12.
3 SSRS                  3.
4 Tableau               7.

In this last example, we are importing only the first 4 rows on sheet 2

> ## read sheet 2, first 4 rows only
> df4 <- read_excel("r_excel.xlsx", sheet = 2, n_max = 4)
> df4
# A tibble: 4 x 2
  Animal Num_Legs
  <chr>     <dbl>
1 Dog          4.
2 Duck         2.
3 Snake        0.
4 Horse        4.

Python: Simulate Blockchain Mining

In my earlier tutorial, I demonstrated how to use the Python library hashlib to create a sha256 hash function. Now, using Python, I am going to demonstrate the principle of blockchain mining. Again using BitCoin as my model, I will be trying to find a nonce value that will result in a hash value below a predetermined target.

We will start by simply enumerating an integer through our sha 256 hash function until we find a hash with 4 leading zeros.

I used a while loop, passing the variable “y” through my hashing function each time the loop runs. I then inspect the first 4 digits [:4] of my hash value. If the first four digits equal 0000 then I exit the loop by setting the found variable to 1

(*note, a hash value is a string – hence the need for quotes around ‘0000’)

2018-04-27_13-30-18

As you can see in the version above, it took 88445 iterations to find an acceptable hash value

Now, using the basic example of a blockchain I gave in an earlier lesson, let’s simulate mining a block

2018-04-04_12-39-34

You’ll see, I am now combining the block number, nonce, data, and previous hash of my simulated block and passing it through my encryption function.  Just like in BitCoin, the only value I change per iteration is the Nonce. I keep passing my block through the hashing function until I find the Nonce that gives me a hash below the target.

2018-04-27_13-36-38

Now, let’s lower the target value to 6 leading zeros. This should result in a longer runtime to get your hash

2018-04-27_13-37-16

To measure the run time difference, let’s add some time stamps to our code

2018-04-27_13-39-31.png

So, I am using the timestamp function twice. D1 will be our start time, d2 will be our end time, and I am subtracting d1 from d2 to get our elapsed time. In the example below, my elapsed time was 5 secs

2018-04-27_13-50-43

Now, let’s bump the target down to 7 leading zeros. Now this brings my elapsed computing time to 20 minutes. That is a considerable commitment of resources. You can see why they call it a “proof of work” now.

2018-04-27_14-26-04

 

Blockchain: Mining

This was, without a question, the most confusing aspect of blockchain to me when I first tried to learn it. Maybe it was because all the buzz around BitCoin mining. I usually find that when a topic becomes popular, misinformation spreads just as fast, if not faster, than actual information.

In an attempt to explain this topic I am going to be using BitCoin as my primary example. I am doing this mainly because I am sure that is where most of you first heard of the term blockchain mining. I will however, begin with a more generic explanation.

First off, let us separate blockchain mining from the idea of financial reward. Yes, in BitCoin, the miner who successfully mines a block is rewarded with BitCoin, but that is not a required part of a blockchain environment.

Mining, at its simplest form, just means successfully adding a new block to the blockchain.

So why can’t you just add a new block like you add a new element to a list or array in any other programming language? This has to do with the decentralized nature of blockchain. Since there is no centralized authority, blockchains rely on group consensus to verify that a new block added to the chain is valid. Keep in mind, this group is made up of anonymous nodes all over the world who do not know each other, and have no good reason to trust one another.

So, if you were to add a new block to the chain, the rest of the chain would need a mechanism in place that gives them time to update their copy of the chain and verify the block you added is good. So effectively we need a pause button.

How do we do that fairly and in a random manner that doesn’t allow for gaming of the system? We force anyone wishing to add a new block to show what is known as proof of work. Proof of work is proof that the person wishing to add the new block has completed a complex mathematical puzzle that required some level of resource allocation on their end. This effectively means they have skin in the game. There is a “cost” associated on their end. This “cost” deters the typical denial of service or blasting type attachk.

In BitCoin the mining process goes like this:

  1. Bob wants to buy a guitar from Philip. They agree on a price of .2 BitCoin and through what is known as their BitCoin wallet a transaction is sent out to the BitCoin universe. A BitCoin wallet is a software client that the person trading with BitCoin uses. From the end user’s point of view, it is a lot like a Paypal type interaction.
  2. Once the transaction is out in the ether, nodes (computers) known as BitCoin miners verify the transaction using a set of established rules built into the BitCoin software they are all running. These rules verify sender and receiver public keys, timestamps, etc. Once verified, the transaction is put into a queue.
  3. Next the bitcoin miners build a new block. The goal is to make their new block the next block in the chain. This block with contain the following items:
    1. Block number – just the next number in the line
    2. Previous Hash – this the hash value of the current last block in the chain
    3. Transactions – they will fill the block with verified transactions from the queue
    4. A timestamp
    5. The Nonce

On a side note – BitCoin only releases a block every 10 minutes. This is a design decision made by the makers of BitCoin, this is not a requirement of Blockchain

Once they have all of that information, they pass their block into a hashing algorithm and get a hash for their new block – This is where mining gets interesting

You see, not any old hash will do. In order for the block to be added to the chain, the hash must be less than the target hash.

The target hash is established by the initial creator of the blockchain. It can be whatever they chose. In the case of BitCoin, the target hash is actually programmed to drop lower and lower every couple thousand blocks.

Okay, so what is the big deal about trying to get below the hash?

Let’s consider this problem using a 6 digit number. (keep in mind that sha256 uses a 64 digit hexadecimal number – much bigger).

So we all agree that a hash is nothing more than creating a random number. Sure it looks funny in hexadecimal, but it can be converted to a base 10 number we all understand.

So, when I pass anything into my imaginary 6 digit hashing function, I can expect 1 million results, (if I only consider positive numbers) – 000,000 to 999,999

Now, let’s set a target for our hash. Let’s say that in order for a hash to be under the target, it must have a first digit of 0. I know what you are thinking – big deal!  Well, actually, it kind of is. We have just gone from 1 million possible hashes to 100,000.  We have effectively made 9 out of 10 available hashes invalid. We can now only accept 000,000 – 099,999. Now, let’s make the first 3 digits of our target 0.  So now we have gone from 1 million possible hashes to 1000. 1000/1,000,000 = .001. So every time you run a hashing function, you now have a 0.1% chance of getting a valid hash.

Now think about our 64 digit hexadecimal number. It has a maximum value of 18,446,744,073,709,551,615 when converted to base 10. So If I made a requirement for the first 6 digits to be 0, we are now looking at: 99,999,999,999,999/18,446,744,073,709,551,615

ca

Every time you run your hashing function, you have a 0.00054% chance of getting one below the target. So as you can see, in order to get a hash below the target, you will most likely have to use a brute force approach.

And that is what miners do. In fact the current difficultly related to finding a hash below the target in BitCoin has led to hundreds of thousands of nodes teaming up together to find a good hash in a brute force manner.

The way they find the hash is through the Nonce. You see, if you pass the word “dog” through my imaginary hash, you will get 123456. If you pass it to my hash 1 million times, you will get 123456 1 million times. So how it working in Blockchain, is we add the nonce to the hash, so dog+1 will give us 879602 and dog+2 will give us 258665. We will repeat this, enumerating the nonce until we get a good hash: dog+28549 gives us 000587. The nonce that gave us that hash 28549 – is called the golden nonce.

And in the world of BitCoin, once you have a golden nonce, you have your proof of work. You can now place your block on the chain.

It is called a proof of work, because other nodes on the blockchain  can very quickly verify your work. They just pass your block with your golden nonce and they will get a hash of 000587, which is below the target. When enough nodes have verified your proof of work ( a consensus), your block becomes locked into the chain, it can no longer be changed or removed.

Okay, so what about the reward BitCoin miners get? Well, built into the BitCoin algorithm, each block mined with worth an ever decreasing number of BitCoin. At the time of this writing, I believe a mined block is worth 12.5 BitCoins. On top of that, the miners also get transaction fees from the people wishing to buy something with BitCoin. In the example above, Bob might offer up .1 BitCoin as a transaction fee as encouragement for some miner to put their transaction into the Blockchain. Keep in mind, there is no bank here, no centralized entity. So the transaction fee is the fee you pay help encourage total strangers to use their time and electricity to verify and move your transaction into the blockchain.

I promise to create another lesson diving deeper into transaction fees, until then, I hope this offers up at basic understanding of how mining works.

Python: Create a Blockchain Hash Function

If you are at all like me, reading about a concept is one thing. Actually practicing it though, that helps me to actually understand it. If you have been reading my blockchain tutorial, or if you came from an outside tutorial, then you have undoubtedly read enough about cryptographic hashes.

Enough reading, let’s make one:

( if you are unfamiliar with crytographic hashes, you can reference my tutorial on them here: Blockchain: Cryptographic Hash )

For this example, I am using the Anaconda Python 3 distribution.

Like most things in Python, creating a hash is as simple as importing a library someone has already created for us. In this case, that library is: hashlib

So our first step is to import hashlib

import hashlib

Now let us take a moment to learn the syntax require to create a cryptographic hash with hashlib. In this example, I am using the SHA 256 hashing algorithm. I am using this because it is the same algorithm used by BitCoin.

Here is the syntax used

hashlib.sha256(string.encode()).hexdigest()

To understand the syntax, we are calling the hashlib method sha256(): hashlib.sha256()

Inside the brackets, we are entering the string we want to encode in the hash. Yes it must be a string for this function to work.

Still inside the brackets we use the method .encode() to (surprise, surprise) ENCODE the string as a hash

Finally, I added the method .hexdigest() to have the algorithm return our hash in hexadecimal format. This format will help in understanding future lessons on blockchain mining.

So in the example below, you can see that I assigned the variable x the string ‘doggy’. I then passed x to our hash function. The output can be seen below.

2018-04-19_15-52-05.png

Now a hash can hold much more than just a simple word. Below, I have passed the Gettysburg Address to the hashing function.

(**note the ”’ ”’ triple quotes. Those are used in Python if your string takes up more than one line **)

2018-04-19_15-54-59

Now I try passing a number. You will notice I get an error.

2018-04-19_15-55-41.png

To avoid the error, I turn the integer 8 into a string with the str() function

2018-04-19_15-56-17.png

Below I concatenation a string and an integer.

2018-04-19_15-57-13.png

Last I want to show the avalanche effect of the hash function.

2018-04-19_15-58-01

By simply changing the first letter from an uppercase T to a lowercase t the hash changes completely. This is a requirement for hashing functions. If the hash did not change dramatically from a small change to the string, it would be easy to reverse engineer the hash. This is known as the avalanche effect.

2018-04-19_15-58-27