Python: Confusion Matrix

What is a confusion matrix?

A confusion matrix is a supervised machine learning evaluation tool that provides more insight into the overall effectiveness of a machine learning classifier. Unlike a simple accuracy metric, which is calculated by dividing the number of correctly predicted records by the total number of records, confusion matrices return 4 unique metrics for you to work with.

While I am not saying accuracy is always misleading, there are times, especially when working with examples of imbalanced data,  that accuracy can be all but useless.

Let’s consider credit card fraud. It is not uncommon that given a list of credit card transactions, that a fraud event might make up a little as 1 in 10,000 records. This is referred to a severely imbalanced data.  Now imaging a simple machine learning classifier running through that data and simply labeling everything as not fraudulent. When you checked the accuracy, it would come back as 99.99% accurate. Sounds great right? Except you missed the fraud event, the only reason to try to create the model in the first place.

A confusion matrix will show you more details, letting you know that you completely missed the fraud event. Instead of a single number result, a confusion matrix provides you will 4 metrics to evaluate. (note: the minority class – (in the case of fraud – the fraudulent events) – are labeled positive by confusion matrices. So a non-fraud event is a negative. This is not a judgement between the classes, only a naming convention)

TP = true positive – minority class (fraud) is correctly predicted as positive

FP = false positive – majority class (not fraud) is incorrectly predicted

FN = false negative – minority class (fraud) incorrectly predicted

TN = true negative – majority class (not fraud) correctly predicted

In matrix form:


To run a confusion matrix in Python, Sklearn provides a method called confusion_matrix(y_test, y_pred)

y_test = actual results from the test data set

y_pred = predictions made by model on test data set

so in a pseudocode example:,y)
y_pred = model.predict(X_test)

If this is at all confusing, refer to my Python SVM lesson where I create the training and testing set and run a confusion matrix (Python: Support Vector Machine (SVM))

To run a confusion matrix in Python, first run a model, then run predictions (as shown above) and then follow the code below:

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

Output looks like this:


Now, if you want to capture the TP, TN, FP, FN into individual variables to work with, you can add the ravel() function to your confusion matrix:

TN,FP,FN,TP = confusion_matrix(y_test, y_pred).ravel()

Thank you for taking the time to read this, and good luck on your analytics journey.

Python: Support Vector Machine (SVM)

Support Vector Machine (SVM):

A Support Vector Machine, or SVM, is a popular binary classifier machine learning algorithm. For those who may not know, a binary classifier is a predictive tool that returns one of two values as the result, (YES – NO), (TRUE – FALSE), (1 – 0).  Think of it as a simple decision maker:

Should this applicant be accepted to college? (Yes – No)

Is this credit card transaction fraudulent? (Yes – No)

An SVM predictive model is built by feeding a labeled data set to the algorithm, making this a supervised machine learning model. Remember, when the training data contains the answer you are looking for, you are using a supervised learning model. The goal, of course, of a supervised learning model is that once built, you can feed the model new data which you do not know the answer to, and the model will give you the answer.

Brief explanation of an SVM:

An SVM is a discriminative classifier. It is actually an adaptation of a previously designed classifier called perceptron. (The perceptron algorithm also helped to inform the development of artificial neural networks).

The SVM works by finding the optimal hyperplane that can be used to discriminate between classes in the data set. (Classes refers to the label or “answer” column of each record.  The true/false, yes/no column in a binary set). When considering a two dimensional model, the hyperplane simply becomes a line that divides to the classes of data.

The hyperplane (or line in 2 dimensions) is informed by what are known as Support Vectors. A record from the data set is converted into a vector when fed through the algorithm (this is where a basic understanding of linear algebra comes in handy). Vectors (data records) closest to the decision boundary are called Support Vectors. It is on either side of this decision boundary that a vector is labeled by the classifier.

The focus on the support vectors and where they deem the decision boundary to be, is what informs the SVM as to where to place the optimal hyperplane. It is this focus on the support vectors as opposed to the data set as a whole, that gives SVM an advantage over a simple learner like a linear regression, when dealing with complex data sets.

Coding Exercise:

Libraries needed:



This is the main reason I recommend the Anaconda distribution of Python, because it comes prepackaged with the most popular data science libraries.

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import metrics
from sklearn.metrics import confusion_matrix
import pandas as pd

Next, let’s look at the data set. This is the Pima Indians Diabetes data set. It is a publicly available data set consisting of 768 records. Columns are as follows:

  1. Number of times pregnant.
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
  3. Diastolic blood pressure (mm Hg).
  4. Triceps skinfold thickness (mm).
  5. 2-Hour serum insulin (mu U/ml).
  6. Body mass index (weight in kg/(height in m)^2).
  7. Diabetes pedigree function.
  8. Age (years).
  9. Class variable (0 or 1).

Data can be downloaded with the link below


Once you download the file, load it into python (you’re file path will be different)

df = pd.read_excel(‘C:\\Users\\blars\\Documents\\pima_indians.xlsx’)

now look at the data:



Now keep in mind, class is our target. That is what we want to predict.

So let us start by separating the target class.

We use the pandas command .pop() to remove the Class column to the y variable, and the remained of the dataframe is now in the X

Let’s now split the data into training and test sets:

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33)

Now we will train (fit) the model. This example I am using Sklearns SVC() model for an SVM example. There are a lot of SVMs available to try if you would like to explore deeper.

Code for fitting the model:

model =SVC(), y_train)

Now using the testing subset we withheld, we will test our model

y_pred = model.predict(X_test)

Now to see how good the model is, we will perform an accuracy test.  This simply takes all the correct guess and divides them by total guesses.

As, you can seen below, we compare the y_pred (predicted values) against y_test (actual values) and we get .7677 or 77% accuracy. Which is not a bad model for simply using defaults.


Let’s look at a confusion matrix to get just a little more in-depth info


For those not familiar with a confusion matrix, this will help you to interpret results:

First number 151 = True Negatives — this would be the number of 0’s or not diabetics correctly predicted

Second number 15 = False Positives — the number of 0’s (non-diabetics) falsely predicted to be a 1

Third number 44 = False negatives — the number of 1’s (diabetics) falsely predicted to be a 0

Fourth number 44 = True Positives — the number of 1 (diabetics) correctly predicted.

So, the model correctly identified 44 out of the 59 diabetics in the test data, and misdiagnoses 44 out the 195 non diabetics in the data sample.

To see a video version of this lesson, click the link here: Python: Build an SVM

Ensemble Modeling

In the world of analytics,modeling is a general term used to refer to the use of data mining (machine learning) methods to develop predictions. If you want to know what ad a particular user is more likely to click on, or which customers are likely to leave you for a competitor, you develop a predictive model.

There are a lot of models to choose from: Regression, Decision Trees, K Nearest Neighbor, Neural Nets, etc. They all will provide you with a prediction, but some will do better than others depending on the data you are working with. While there are certain tricks and tweaks one can do to improve the accuracy of these models, it never hurts to remember the fact that there is wisdom to be found in the masses.

The Jelly Bean Jar

I am sure everyone has come across some version of this in their life: you are at a fair or school fund raising event and someone has a large see-through jar full of jelly beans (or marbles or nickles). Next to the jar are some slips of paper with the instructions to “Guess the number of jelly beans in the jar you win!”

An interesting thing about this game, and you can try this out for yourself, is that given a reasonable number of participants, more often than not, the average guess of the group will perform better than the best individual guesser. Or in other words, imagine there are 200 jelly beans in the jar and the best guesser (the winner) guesses 215. More often than not, the average of all the guesses might be something like 210 or 190. The group cancels out their over and under guessing, resulting in a better answer than anyone individual.

How Do We Get the Average in Models?

There are countless ways to do it, and researchers are constantly trying new approaches to get that extra 2% improvement over the last model. For ease of understanding though, I am going to focus on 2 very popular methods of ensemble modeling : Random Forests & Boosted Trees.


Random Forests:

Imagine you have a data set containing 50,000 records. We will start by randomly selecting 1000 records and creating a decision tree from those records. We will then put the records back into the data set and draw another 1000 records, creating another decision tree. The process is repeated over and over again for a predefined number of iterations (each time the data used is returned to the pool where it could possibly be picked again).

After all the sample decision trees have been created (let’s say we created 500 for the sake of argument), the model then takes the mean or average of all the models if you are looking at a regression or the mode of all the models if you are dealing with a classification.

For those unfamiliar with the terminology, a regression model looks for a numeric value as the answer. It could be the selling price of a house, a person’s weight, the price of a stock, etc. While a classification looks for classifying answers: yes or no, large – medium – small, fast or slow, etc.

Boosted Trees:

Another popular method of ensemble modeling is known as boosted trees. In this method, a simple (poor learner) model tree is created – usually 3-5 splits maybe. Then another small tree (3-5 splits) is built by using incorrect predictions for the first tree. This is repeated multiple times (say 50 in this example), building layers of trees, each one getting a little bit better than the one before it. All the layers are combined to make the final predictive model.


Now I know this may be an oversimplified explanation, and I will create some tutorials on actually building ensemble models, but sometimes I think just getting a feel for the concept is important.

So are ensemble models always the best? Not necessarily.

One thing you will learn when it comes to modeling is that no one method is the best. Each has their own strengths. The more complex the model, the longer it takes to run, so sometimes you will find speed outweighs desire for the added 2% accuracy bump. The secret is to be familiar with the different models, and to try them out in different scenarios. You will find that choosing the right model can be as much of an art as a science.

Simpson’s Paradox: How to Lie with Statistics

We’ve all heard the saying from Benjamin Disreali, “Lies, damn lies, and statistics.”

While statistics has proven to be of great benefit to mankind in almost every endeavor, inexperienced, sloppy, and downright unscrupulous statisticians have made some pretty wild claims. And because these wild claims are often presented as statistical fact, people in all industries – from business, to healthcare, to education -have chased these white elephants right down the rabbit hole.

Anyone who has taken even an introductory statistics course can tell you how easily statistics can be misrepresented. One of my favorite examples involves using bar charts to confuse the audience. Look at the chart below. It represents the number of games won by two teams in a season of beer league softball.


At first glance, you might think Team B won twice as many games as Team A, and that is indeed the intention of the person who made this chart. But when you look at the numbers to the left, you will see Team A won 15 games to Team B’s 20. While I am no mathematician, even I know 15 is not half of 20.

This deception was perpetrated by simply adjusting the starting point of the Y – Axis. When you reset it to 0, the chart tells a different story.


Even Honest People Can Lie by Accident

In the example above, the person creating the chart was manipulating the data on purpose to achieve a desired effect. You may look at this and say I would never deceive people like that, the truth is – you just might do it by accident.

What do I mean? Let’s take an example from an industry fraught with horrible statistics – our education system.

Below you will find a chart depicting the average math scores on a standardized test since 2000 for Happy Town, USA. You will notice the test scores are significantly lower now than they were back in 2000.


What does this mean? Are the kids getting stupider? Has teacher quality gone down? Who should be held accountable for this? Certainly those lazy tenured teachers who are only there to collect their pensions and leach off the tax payers.

I mean look at the test scores. The average score has dipped from around 90 to close to 70. Surely something in the system is failing.

Now what if I were to tell you that the chart above – while correct – does not tell the whole story. Test scores in Happy Town, USA are actually up – if you look at the data correctly.

What we are dealing with is something known in statistics as Simpson Paradox, and even some of the brightest academic minds have published research that ignored this very important concept.

What do I mean?

Let me tell you the whole story about Happy Town, USA. Happy Town was your average American middle class town. The economic make-up of this town in 2000 was 20% of the families made over $150K, 60% made between $150K and $50K, with 20% earning less than $50K a year.

In 2008, that all changed. The recession hit causing people to lose their jobs and default on their mortgages. Families moved out, housing prices fell. Due to the new lower housing prices, families from Non-So Happy Town, USA were able to afford houses in Happy Town. They moved their families there in hopes of a better education and better life for their children.

While the schools in Happy Town were better, the teachers were not miracle workers. These kids from Not So Happy Town did not have the strong educational foundation the pre-recession residents of Happy Town did. Many teachers found themselves starting almost from scratch.

No matter how hard these new kids and their teachers tried, they could never be expected to jump right in and perform as well as the pre-2008 Happy Town kids. The economic makeup of the town shifted. The under $50K’s now represent 60% of the town’s population, with 150K-50K making up only 30% and the top earners dwindling down to 10%.

So while taking an average of all the students is not a sign of someone necessarily trying to pull the wool over your eyes, it does not tell the whole story.

To see the whole story, and to unravel Simpson Paradox, you need to look at the scores across the different economic sectors of this town which has undergone drastic changes.


Looking at from the standpoint of economic sector, you will see the scores in each sector have improved. With the under $50K improving at an impressive rate. Clearly the teachers and staff at Happy Town School are doing their job and then some.

So while the person who took the average of the whole school may not have intended to lie with their statistics, a deeper dive into the numbers showed that the truth was hidden inside the aggregate.

Keep this in mind next time someone shows you falling SAT scores, crime stats, or disease rates. All of these elements are easily affected by a shift in demographics. If you don’t see the breakdown, don’t believe the hype.

Feedback Loops in Predictive Models

Predictive models are full of perilous traps for the uninitiated. With the ease of use of some modeling tools like JMP or SAS, you can literally point and click your way into a predictive model. These models will give you results. And a lot of times, the results are good. But how do you measure the goodness of the results?

I will be doing a series of lessons on model evaluation. This is one of the more difficult concepts for many to grasp, as some of it may seem subjective. In this lesson I will be covering feedback loops and showing how they can sometimes improve, and other times destroy, a model.

What is a feedback loop?

A feedback loop in modeling is where the results of the model are somehow fed back into the model (sometimes intentionally, other times not). One simple example might be an ad placement model.

Imagine you built a model determining where  on a page to place an ad based on the webpage visitor. When a visitor in group A sees an ad on the left margin, he clicks on it. This click is fed back into the model, meaning left margin placement will have more weight when selecting where to place the ad when another group A visitor comes to your page.

This is good, and in this case – intentional. The model is constantly retraining itself using a feedback loop.

When feedback loops go bad…

Gaming the system.

Build a better mousetrap.. the mice get smarter.

Imagine a predictive model  developed to determine entrance into a university. Let’s say when you initially built the model, you discovered that students who took German in high school seemed to be better students overall. Now as we all know, correlation is not causation. Perhaps this was just a blip in your data set, or maybe it was just the language most commonly offered at the better high schools. The truth is, you don’t actually know.

How can this be a problem?

Competition to get into universities (especially highly sought after universities) is fierce to say the least. There are entire industries designed to help students get past the admissions process. These industries use any insider knowledge they can glean, and may even try reverse engineering the admissions algorithm.

The result – a feedback loop

These advisers will learn that taking German greatly increases a student’s chance of admission at this imaginary university. Soon they will be advising prospective students (and their parents) who otherwise would not have any chance of being accepted into your school, to sign up for German classes. Well now you have a bunch of students, who may no longer be the best fit, making their way past your model.

What to do?

Feedback loops can be tough to anticipate, so one method to guard against them is to retrain your model every once in a while. I even suggest retooling the model (removing some factors in an attempt to determine if a rogue factor – i.e. German class, is holding too much weight in your model).

And always keep in mind that these models are just that – models. They are not fortune tellers. Their accuracy should constantly be criticized and methods questioned. Because while ad clicks or college admissions are one thing, policing and criminal sentencing algorithms run the risk of being much more harmful.

Left unchecked, the feedback loop of a predictive criminal activity model in any large city in the United States will almost always teach the computer to emulate the worst of human behavior – racism, sexism, and class discrimination.

Since minority males from poor neighborhoods dis-proportionally make up our current prison population, any model that takes race, sex, and economic status into account will inevitably determine a 19 year old black male from a poor neighborhood is a criminal. We will have then violated the basic tenant of our justice system – innocent until proven guilty.


R: Text Mining (Pre-processing)

This is part 2 of my Text Mining Lesson series. If you haven’t already, please check out part 1 that covers Term Document Matrix: R: Text Mining (Term Document Matrix)

Okay, now I promise to get to the fun stuff soon enough here, but I feel that in most tutorials I have seen online, the pre-processing of text is often glanced over. It was always (and often still is) a real sore spot for me when assumptions are made as to my knowledge level. If you are going to throw up a block of code, at least give a line or two explanation as to what the code is there for, don’t just assume I know.

I remember working my way through many tutorials where I was able to complete the task by simply copying the code, but I didn’t have a full grasp of what was happening in the middle. I this lesson, I am going to cover some of the more common text pre-processing steps used in the TM library. I am going to go into some level of detail and make some purposeful mistakes so hopefully when you are done here you will have a firm grasp on this very important step in the text mining process.

Let’s start by getting our libraries

install.packages("tm") # if not already installed


Now, let’s load our data. For this lesson we are going to use a simple vector.

wordVC <- c("I like dogs!", "A cat chased the dog.", "The dog ate a bone.", 
            "Cats make fun pets.", "The cat did what? Don't tell the dog.", 
            "A cat runs when it has too, but most dogs love running!")

Now let’s put this data into a corpus for text processing.

corpus <- (VectorSource(wordVC))
corpus <- Corpus(corpus)

Here is the output


Now for a little frustration. Let’s say you want to see what is in text document 4. You could try


But this will be your output, not really what you are looking for.


If you want to see the actual text- try this instead


Now you can see the text


As we go through the text pre-processing, we are going to use the following For Loop to examine our corpus

for (i in 1:6) print (corpus[[i]]$content)




Punctuation generally adds no value to text mining when utilizing standard numeric based data mining algorithms like clustering or classification. So it behooves us to just remove it.

To do so, the tm package has a cool function called tm_map() that we can pass arguments to, such as removePunctuation

corpus <- tm_map(corpus, content_transformer(removePunctuation))

for (i in 1:6) print (corpus[[i]]$content)

Note, you do not need the for loop, I am simply running it each time to show you the progress. 

Notice all the punctuation is gone now.



Next we are going to get rid of what are known as stopwords. Stopwords are common words such as (the, an, and, him, her). These words are so commonly used that they provide little insight as to the actual meaning of the given text.  To get rid of them, we use the following code.

corpus <- tm_map(corpus, content_transformer(removeWords), 
for (i in 1:6) print (corpus[[i]]$content)

If you look at line 2, “A cat chase  dog”, you will the word “the” has been removed. However, if you look at the next line down, you will notice “The” is still there.



Well it comes down to the fact that computers do not treat T and t as the same letter, even though they are. Capitalized letters are viewed by computers as separate entities. So “The” doesn’t match “the” found in the list of stopwords to remove.

For a full list of R stopwords, go to:

So how do we fix this?


Using tm_map with the “tolower” argument will make all the letters lowercase. If we then re-run our stopwords command, you will see all the “the” are gone

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, content_transformer(removeWords), 
for (i in 1:6) print (corpus[[i]]$content)




Next we will stem our words. I covered this in the last lesson, but it bears repeating. What stemming does is attempt to remove variants of words. In our example, pay attention to the following words (dog, dogs, cat, cats, runs, running)

corpus <- tm_map(corpus, stemDocument)
for (i in 1:6) print (corpus[[i]]$content)

Notice the words are now (dog, cat, run)



Finally, let’s get rid of all this extra white space we have now.

corpus <- tm_map(corpus, stripWhitespace) 
for (i in 1:6) print (corpus[[i]]$content)




I didn’t use this argument with my tm_map() function today because I did not have any numbers in my text. But if I, the command would be as follows

corpus <- tm_map(corpus, content_transformer(removeNumbers))



R: Creating a Word Cloud

Word Clouds are great visualization techniques for dealing with text analytics. The idea behind them is they display the most common words in a corpus of text. The more often a word is used, the larger and darker it is.


Making a word cloud in R is relatively easy. The tm and wordcloud libraries from R’s CRAN repository is used to create one.


If you do not have either of these loaded on your machine, you will have to use the following commands


Now in order to make a word cloud, you first need a collection of words. In our example I am going to use a text file I created from the Wikipedia page on R.

You can download the text file here: rwiki

Now let’s load the data file.

text <- readLines("rWiki.txt")
> head(text)
[1] "R is a programming language and software environment 
[2] "The R language is widely used among statisticians and 
[3] "Polls, surveys of data miners, and studies of scholarly 
[4] "R is a GNU package.[9] The source code for the R 
[5] "General Public License, and pre-compiled binary versions
[6] "R is an implementation of the S programming language "

Notice each line in the text file is an individual element in the vector –  text

Now we need to move the text into a tm element called a Corpus. First we need to convert the vector text into a VectorSource.

wc <- VectorSource(text)
wc <- Corpus(wc)

Now we need to pre-process the data. Let’s start by removing punctuation from the corpus.

wc <- tm_map(wc, removePunctuation)

Next we need to set all the letters to lower case. This is because R differentiates upper and lower case letters. So “Program” and “program” would treated as 2 different words. To change that, we set everything to lowercase.

wc <- tm_map(wc, content_transformer(tolower))

Next we will remove stopwords. Stopwords are commonly used words that provide no value to the evaluation of the text. Examples of stopwords are: the, a, an, and, if, or, not, with ….

wc <- tm_map(wc, removeWords, stopwords("english"))

Finally, let’s strip away the whitespace

wc <- tm_map(wc, stripWhitespace)

Now let us make our first word cloud

The syntax is as follows – wordcloud( words = corpus, scale = physical size, max.word = number of words in cloud)

wordcloud(words = wc, scale=c(4,0.5), max.words=50)


Now we have a word cloud, let’s add some more elements to it.

random.order = False brings the most popular words to the center

wordcloud(words = wc, scale=c(4,0.5), max.words=50,random.order=FALSE)


To add a little more rotation to your word cloud use rot.per

wordcloud(words = wc, scale=c(4,0.5), max.words=50,random.order=FALSE,

Finally, lets add some color. We are going to use brewer.pal.  The syntax is brewer.pal(number of colors, color mix)

cp <- brewer.pal(7,"YlOrRd")
wordcloud(words = wc, scale=c(4,0.5), max.words=50,random.order=FALSE,
 rot.per=0.25, colors=cp)





R: Connect to Twitter with R

You can do a lot in the way of text analytics with Twitter. But in order to do so, first you need to connect with Twitter.

In order to do so, first you need to set up an account on Twitter and get an API key. Once you have created your account, go to the following website:

Once there, click My apps


Click Create New App


Give name, description (it doesn’t matter – matter anything up), for website I just used


Go to Keys and Access Tokens


Get your Consumer Key (API Key), Consumer Secret (API Secret), Access Token, and Access Token Secret

Now open R and install the TwitteR package.


Now load the library


Next we are going to use the API keys you collected to authorize our connection to the Twitter API (Twitter uses OAuth by the way)

api_key<- "insert consumer key here"
api_secret <- "insert consumer secret here"
access_token <- "insert access token here"
access_token_secret <- "insert access token secret here

Now to test to see if your connection is good


R: gsub


R is an interpreted language – meaning the code is run as it is read – kind of like a musician who plays music while reading it off the sheet(note by note). To that end, R does not perform loops as efficiently as compiled languages like C or Java. So to address this issue, R has some interesting work-arounds. One of my favorite is gsub.

Here is how gsub works. Take the sentence “Bob likes dogs”. Using gsub I can replace any element of that sentence. So I can say replace “dogs” with “cats” and the sentence would read “Bob like cats”. Kind of cool all by itself, but it is even cooler when dealing with a larger data set.

Let’s set x to a vector of 3 elements

 x <- c("the green ball", "Bob likes the dog", "Sally is the best runner
 in the group")

Now let’s run a gsub command (syntax: gsub(to be replace, what to replace with, date source)

x <- gsub("the", "a", x)

This short line of code replaces all the “the” in the vector with “a”. It does it for a vector of 1000 elements just as well as it does it for this small vector of 3 elements.


Okay, now my personal pet peeve when it comes to learning this stuff. Show me a practical approach to this. So here we go, a practical data science based use for this.

Check out this selection of tweets I pulled from Twitter. Notice the annoying RT “retweet” at the beginning of most of tweets. I want to get rid of it. When doing a sentiment analysis, knowing something is a RT does little for me.


gsub to get rid of RT


Now I have an empty space I want to get rid.


And if you are wondering how I got that Twitter data? Don’t worry, you don’t need any expensive software. I did it all with R for free and I will show you how to do it too. Stay tuned.

What is Big Data?

Big Data is Like Teenage Sex: Everyone talks about it, nobody really knows how to do it, everyone things everyone else is doing it, so everyone claims they are doing it.

– Dan Ariely

I remember when I first started developing an interest in Big Data and Analytics. One of the biggest frustrations I faced was that it seemed like everyone in the know was talking in code. They would toss words around like supervised machine learning, map reduce, hadoop, SAP HANA, in-memory, and the biggest buzz word of them all, Big Data.


So what is Big Data?

In all honesty, it is a buzzword. Big Data isn’t a single thing as much as it is a collection of technologies and concepts that surround the management and analysis of massive data sets.

What kind of data qualifies as Big Data?

The common consensus you will find in textbooks is that Big Data is concerned with the 3 V’s: Velocity, Volume, Variety.

Velocity: Velocity is not so much concerned with how fast the data gets to you. This is not something you can clock using network metrics. Instead, this is how fast data can become actionable. In the days of yore, managers would rely on monthly or quarterly reports to determine business strategy. Now these decisions are being made more dynamically. Data only 2 hours old can be viewed outdated in this new high velocity world.

Volume: Volume is the name of the game in Big Data. Think of sheer volume of data produced by a wireless telecom company: every call, every tower connection, length of calls, etc. These guys are racking up terabytes upon terabytes.

Variety:  Big Data is all about variety. As a complete 180 from the rigid structure that makes relational databases work, Big Data lives in the world of unstructured data. Big Data repositories are full of videos, pictures, free text, audio clips, and all other forms of unstructured data.

How do you store all this data?

Storing and managing all this data is one of big challenges. This is where specialized data management systems like Hadoop come into play. What makes Hadoop so great? First it is Hadoop’s ability to scale. By scale I mean, Hadoop can grow and shrink on demand.

For those unfamiliar with the back end storage methodology of standard relational databases (Oracle, DB2, SQL Server), they don’t play well across multiple computers. Instead you will find you need to invest in high end servers and plan out ahead any clusters you are going to use with a general idea of your storage needs in mind. If you build out a database solution designed to handle 10 terabytes and suddenly find yourself needing to manage 50, you are going to have some serious (and expensive) reconfiguration work ahead of you.

Hadoop on the other hand is designed to run easily across commodity hardware. Which means you can have a rack full of mid priced servers and Hadoop can provision and utilize them at will. So if you typically run at 10 terabyte database and there is a sudden need to run another 50 terabytes – (your company is on-boarding a new company) Hadoop will just grow as needed (assuming you have 50 tb worth of commodity servers available). It will also free up the space when it is done. So if the 50 terabytes were only needed for a particular job, once that job is over, Hadoop can release the storage space for other systems to use.

What about MapReduce?

MapReduce is algorithm designed to make querying or processing massive data sets possible. In a very simplified explanation MapReduce works like so:

Map – The data is broken into chunks and handed off to mappers. These mappers perform the data processing job at on their individual chunk of the data set. There are hundreds (or many many hundreds) of these mappers working in parallel, taking advantage of different processors on the racks of commodity hardware we were talking about earlier.

Reduce – The output of all of these map jobs are then passed into the reduce. This part of the algorithm puts all the pieces back together, providing the user with one result set. The entire purpose behind MapReduce is speed.

Data Analytics

Now that you have the data, you are going to want to make some sense of it. To pull information out of this mass of data requires specially designed algorithms running on high end hardware. Platforms like SAP HANA tout in memory analytics to drive up speed, while a lot of the buzz around deep learning seems to mention accessing the incredibly fast memory found in GPUs (graphical processor units).

At the root of all of this, you will still find some old familiar stand-bys. Regression is still at the top of pack in prediction methods used by most companies. Other popular machine learning algorithms like Affinity Analysis (market basket) and Clustering are also commonly used with Big Data.

What really separates Big Data analytics from regular analysis methods is that with it’s sheer volume of data, it is not as reliant on inferential statistical methods to draw conclusions.

If you think about an election in a city with 50,000 registered voters. The classic method of polling was to ask a representative sample (say 2000 voters) how they were going to vote. Using that information, you  could infer how the election was to play out (with a margin of error of course). With Big Data, we are asking all 50,000 voters. We do not need to infer anymore. We know the answer

Imagine this more “real world” application example. Think of a large manufacturing plant. A pretty common maintenance strategy in a plant like this is to have people perform periodic checks on motors, belts, fans, etc. They will take reading from gauges, write them on a clipboard and maybe the information is enter into a computer that can analyze trends to check for out of range parameters.

In the IoT(Internet of Things) world, we now have cheap, network connected sensors on all of this equipment sending out readings every second. This data is fed through algorithms designed to analyze trends. The trends can tell you a fan is starting to go bad and needs to be replaced 2 or 3 days before most humans could.


Big Data is all about correlation. And this can be a sticking point for some people. As humans, we love to look for a root cause. If we can’t find one, we often create one to satisfy our desire for one – hence the Roman volcano gods.

With Big Data, cause is not name of the game. Analyzing massive streams of data, we can find correlations that can help up make better decisions. Using the fan example above, the algorithms may pick up that if the fan begins drawing more current while maintaining the same speed that the soon the fan motor will fail. And using probabilistic algorithms, we can show you the increasing odds of it failing each minute, hour, or day you ignore it.

This we can do, with great certainty in many instances. But what we can’t do is tell you why it is happening. That is for someone else. We leave whys to the engineers and the academics. We are happy enough knowing If A Then B.


Correlation can also show us relationship we never knew existed. Like did you know Pop Tarts are hot sellers in the days leading up to a hurricane? Walmart does. They even know what flavor ( I want to say strawberry, but don’t quote me on this).

This pattern finding power can be found everywhere from Walmart check outs, to credit card fraud detection, dating sites, and even medicine.