Python: Confusion Matrix

What is a confusion matrix?

A confusion matrix is a supervised machine learning evaluation tool that provides more insight into the overall effectiveness of a machine learning classifier. Unlike a simple accuracy metric, which is calculated by dividing the number of correctly predicted records by the total number of records, confusion matrices return 4 unique metrics for you to work with.

While I am not saying accuracy is always misleading, there are times, especially when working with examples of imbalanced data,  that accuracy can be all but useless.

Let’s consider credit card fraud. It is not uncommon that given a list of credit card transactions, that a fraud event might make up a little as 1 in 10,000 records. This is referred to a severely imbalanced data.  Now imaging a simple machine learning classifier running through that data and simply labeling everything as not fraudulent. When you checked the accuracy, it would come back as 99.99% accurate. Sounds great right? Except you missed the fraud event, the only reason to try to create the model in the first place.

A confusion matrix will show you more details, letting you know that you completely missed the fraud event. Instead of a single number result, a confusion matrix provides you will 4 metrics to evaluate. (note: the minority class – (in the case of fraud – the fraudulent events) – are labeled positive by confusion matrices. So a non-fraud event is a negative. This is not a judgement between the classes, only a naming convention)

TP = true positive – minority class (fraud) is correctly predicted as positive

FP = false positive – majority class (not fraud) is incorrectly predicted

FN = false negative – minority class (fraud) incorrectly predicted

TN = true negative – majority class (not fraud) correctly predicted

In matrix form:

confus

To run a confusion matrix in Python, Sklearn provides a method called confusion_matrix(y_test, y_pred)

y_test = actual results from the test data set

y_pred = predictions made by model on test data set

so in a pseudocode example:

model.fit(X,y)
y_pred = model.predict(X_test)

If this is at all confusing, refer to my Python SVM lesson where I create the training and testing set and run a confusion matrix (Python: Support Vector Machine (SVM))

To run a confusion matrix in Python, first run a model, then run predictions (as shown above) and then follow the code below:

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

Output looks like this:

Confu1

Now, if you want to capture the TP, TN, FP, FN into individual variables to work with, you can add the ravel() function to your confusion matrix:

TN,FP,FN,TP = confusion_matrix(y_test, y_pred).ravel()

Thank you for taking the time to read this, and good luck on your analytics journey.

Please Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s