Support Vector Machine (SVM):
A Support Vector Machine, or SVM, is a popular binary classifier machine learning algorithm. For those who may not know, a binary classifier is a predictive tool that returns one of two values as the result, (YES – NO), (TRUE – FALSE), (1 – 0). Think of it as a simple decision maker:
Should this applicant be accepted to college? (Yes – No)
Is this credit card transaction fraudulent? (Yes – No)
An SVM predictive model is built by feeding a labeled data set to the algorithm, making this a supervised machine learning model. Remember, when the training data contains the answer you are looking for, you are using a supervised learning model. The goal, of course, of a supervised learning model is that once built, you can feed the model new data which you do not know the answer to, and the model will give you the answer.
Brief explanation of an SVM:
An SVM is a discriminative classifier. It is actually an adaptation of a previously designed classifier called perceptron. (The perceptron algorithm also helped to inform the development of artificial neural networks).
The SVM works by finding the optimal hyperplane that can be used to discriminate between classes in the data set. (Classes refers to the label or “answer” column of each record. The true/false, yes/no column in a binary set). When considering a two dimensional model, the hyperplane simply becomes a line that divides to the classes of data.
The hyperplane (or line in 2 dimensions) is informed by what are known as Support Vectors. A record from the data set is converted into a vector when fed through the algorithm (this is where a basic understanding of linear algebra comes in handy). Vectors (data records) closest to the decision boundary are called Support Vectors. It is on either side of this decision boundary that a vector is labeled by the classifier.
The focus on the support vectors and where they deem the decision boundary to be, is what informs the SVM as to where to place the optimal hyperplane. It is this focus on the support vectors as opposed to the data set as a whole, that gives SVM an advantage over a simple learner like a linear regression, when dealing with complex data sets.
This is the main reason I recommend the Anaconda distribution of Python, because it comes prepackaged with the most popular data science libraries.
from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.metrics import metrics from sklearn.metrics import confusion_matrix import pandas as pd
Next, let’s look at the data set. This is the Pima Indians Diabetes data set. It is a publicly available data set consisting of 768 records. Columns are as follows:
- Number of times pregnant.
- Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
- Diastolic blood pressure (mm Hg).
- Triceps skinfold thickness (mm).
- 2-Hour serum insulin (mu U/ml).
- Body mass index (weight in kg/(height in m)^2).
- Diabetes pedigree function.
- Age (years).
- Class variable (0 or 1).
Data can be downloaded with the link below
Once you download the file, load it into python (you’re file path will be different)
df = pd.read_excel(‘C:\\Users\\blars\\Documents\\pima_indians.xlsx’)
now look at the data:
Now keep in mind, class is our target. That is what we want to predict.
So let us start by separating the target class.
We use the pandas command .pop() to remove the Class column to the y variable, and the remained of the dataframe is now in the X
Let’s now split the data into training and test sets:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33)
Now we will train (fit) the model. This example I am using Sklearns SVC() model for an SVM example. There are a lot of SVMs available to try if you would like to explore deeper.
Code for fitting the model:
model =SVC() model.fit(X_train, y_train)
Now using the testing subset we withheld, we will test our model
y_pred = model.predict(X_test)
Now to see how good the model is, we will perform an accuracy test. This simply takes all the correct guess and divides them by total guesses.
As, you can seen below, we compare the y_pred (predicted values) against y_test (actual values) and we get .7677 or 77% accuracy. Which is not a bad model for simply using defaults.
Let’s look at a confusion matrix to get just a little more in-depth info
For those not familiar with a confusion matrix, this will help you to interpret results:
First number 151 = True Negatives — this would be the number of 0’s or not diabetics correctly predicted
Second number 15 = False Positives — the number of 0’s (non-diabetics) falsely predicted to be a 1
Third number 44 = False negatives — the number of 1’s (diabetics) falsely predicted to be a 0
Fourth number 44 = True Positives — the number of 1 (diabetics) correctly predicted.
So, the model correctly identified 44 out of the 59 diabetics in the test data, and misdiagnoses 44 out the 195 non diabetics in the data sample.
To see a video version of this lesson, click the link here: Python: Build an SVM