Ensemble Modeling

In the world of analytics,modeling is a general term used to refer to the use of data mining (machine learning) methods to develop predictions. If you want to know what ad a particular user is more likely to click on, or which customers are likely to leave you for a competitor, you develop a predictive model.

There are a lot of models to choose from: Regression, Decision Trees, K Nearest Neighbor, Neural Nets, etc. They all will provide you with a prediction, but some will do better than others depending on the data you are working with. While there are certain tricks and tweaks one can do to improve the accuracy of these models, it never hurts to remember the fact that there is wisdom to be found in the masses.

The Jelly Bean Jar

I am sure everyone has come across some version of this in their life: you are at a fair or school fund raising event and someone has a large see-through jar full of jelly beans (or marbles or nickles). Next to the jar are some slips of paper with the instructions to “Guess the number of jelly beans in the jar you win!”

An interesting thing about this game, and you can try this out for yourself, is that given a reasonable number of participants, more often than not, the average guess of the group will perform better than the best individual guesser. Or in other words, imagine there are 200 jelly beans in the jar and the best guesser (the winner) guesses 215. More often than not, the average of all the guesses might be something like 210 or 190. The group cancels out their over and under guessing, resulting in a better answer than anyone individual.

How Do We Get the Average in Models?

There are countless ways to do it, and researchers are constantly trying new approaches to get that extra 2% improvement over the last model. For ease of understanding though, I am going to focus on 2 very popular methods of ensemble modeling : Random Forests & Boosted Trees.

2016-11-22_22-05-26

Random Forests:

Imagine you have a data set containing 50,000 records. We will start by randomly selecting 1000 records and creating a decision tree from those records. We will then put the records back into the data set and draw another 1000 records, creating another decision tree. The process is repeated over and over again for a predefined number of iterations (each time the data used is returned to the pool where it could possibly be picked again).

After all the sample decision trees have been created (let’s say we created 500 for the sake of argument), the model then takes the mean or average of all the models if you are looking at a regression or the mode of all the models if you are dealing with a classification.

For those unfamiliar with the terminology, a regression model looks for a numeric value as the answer. It could be the selling price of a house, a person’s weight, the price of a stock, etc. While a classification looks for classifying answers: yes or no, large – medium – small, fast or slow, etc.

Boosted Trees:

Another popular method of ensemble modeling is known as boosted trees. In this method, a simple (poor learner) model tree is created – usually 3-5 splits maybe. Then another small tree (3-5 splits) is built by using incorrect predictions for the first tree. This is repeated multiple times (say 50 in this example), building layers of trees, each one getting a little bit better than the one before it. All the layers are combined to make the final predictive model.

Oversimplified?

Now I know this may be an oversimplified explanation, and I will create some tutorials on actually building ensemble models, but sometimes I think just getting a feel for the concept is important.

So are ensemble models always the best? Not necessarily.

One thing you will learn when it comes to modeling is that no one method is the best. Each has their own strengths. The more complex the model, the longer it takes to run, so sometimes you will find speed outweighs desire for the added 2% accuracy bump. The secret is to be familiar with the different models, and to try them out in different scenarios. You will find that choosing the right model can be as much of an art as a science.

Simpson’s Paradox: How to Lie with Statistics

We’ve all heard the saying from Benjamin Disreali, “Lies, damn lies, and statistics.”

While statistics has proven to be of great benefit to mankind in almost every endeavor, inexperienced, sloppy, and downright unscrupulous statisticians have made some pretty wild claims. And because these wild claims are often presented as statistical fact, people in all industries – from business, to healthcare, to education -have chased these white elephants right down the rabbit hole.

Anyone who has taken even an introductory statistics course can tell you how easily statistics can be misrepresented. One of my favorite examples involves using bar charts to confuse the audience. Look at the chart below. It represents the number of games won by two teams in a season of beer league softball.

2017-01-10_10-28-21

At first glance, you might think Team B won twice as many games as Team A, and that is indeed the intention of the person who made this chart. But when you look at the numbers to the left, you will see Team A won 15 games to Team B’s 20. While I am no mathematician, even I know 15 is not half of 20.

This deception was perpetrated by simply adjusting the starting point of the Y – Axis. When you reset it to 0, the chart tells a different story.

2017-01-10_10-31-54.jpg

Even Honest People Can Lie by Accident

In the example above, the person creating the chart was manipulating the data on purpose to achieve a desired effect. You may look at this and say I would never deceive people like that, the truth is – you just might do it by accident.

What do I mean? Let’s take an example from an industry fraught with horrible statistics – our education system.

Below you will find a chart depicting the average math scores on a standardized test since 2000 for Happy Town, USA. You will notice the test scores are significantly lower now than they were back in 2000.

2017-01-10_10-40-48

What does this mean? Are the kids getting stupider? Has teacher quality gone down? Who should be held accountable for this? Certainly those lazy tenured teachers who are only there to collect their pensions and leach off the tax payers.

I mean look at the test scores. The average score has dipped from around 90 to close to 70. Surely something in the system is failing.

Now what if I were to tell you that the chart above – while correct – does not tell the whole story. Test scores in Happy Town, USA are actually up – if you look at the data correctly.

What we are dealing with is something known in statistics as Simpson Paradox, and even some of the brightest academic minds have published research that ignored this very important concept.

What do I mean?

Let me tell you the whole story about Happy Town, USA. Happy Town was your average American middle class town. The economic make-up of this town in 2000 was 20% of the families made over $150K, 60% made between $150K and $50K, with 20% earning less than $50K a year.

In 2008, that all changed. The recession hit causing people to lose their jobs and default on their mortgages. Families moved out, housing prices fell. Due to the new lower housing prices, families from Non-So Happy Town, USA were able to afford houses in Happy Town. They moved their families there in hopes of a better education and better life for their children.

While the schools in Happy Town were better, the teachers were not miracle workers. These kids from Not So Happy Town did not have the strong educational foundation the pre-recession residents of Happy Town did. Many teachers found themselves starting almost from scratch.

No matter how hard these new kids and their teachers tried, they could never be expected to jump right in and perform as well as the pre-2008 Happy Town kids. The economic makeup of the town shifted. The under $50K’s now represent 60% of the town’s population, with 150K-50K making up only 30% and the top earners dwindling down to 10%.

So while taking an average of all the students is not a sign of someone necessarily trying to pull the wool over your eyes, it does not tell the whole story.

To see the whole story, and to unravel Simpson Paradox, you need to look at the scores across the different economic sectors of this town which has undergone drastic changes.

2017-01-10_11-26-34

Looking at from the standpoint of economic sector, you will see the scores in each sector have improved. With the under $50K improving at an impressive rate. Clearly the teachers and staff at Happy Town School are doing their job and then some.

So while the person who took the average of the whole school may not have intended to lie with their statistics, a deeper dive into the numbers showed that the truth was hidden inside the aggregate.

Keep this in mind next time someone shows you falling SAT scores, crime stats, or disease rates. All of these elements are easily affected by a shift in demographics. If you don’t see the breakdown, don’t believe the hype.

Feedback Loops in Predictive Models

Predictive models are full of perilous traps for the uninitiated. With the ease of use of some modeling tools like JMP or SAS, you can literally point and click your way into a predictive model. These models will give you results. And a lot of times, the results are good. But how do you measure the goodness of the results?

I will be doing a series of lessons on model evaluation. This is one of the more difficult concepts for many to grasp, as some of it may seem subjective. In this lesson I will be covering feedback loops and showing how they can sometimes improve, and other times destroy, a model.

What is a feedback loop?

A feedback loop in modeling is where the results of the model are somehow fed back into the model (sometimes intentionally, other times not). One simple example might be an ad placement model.

Imagine you built a model determining where  on a page to place an ad based on the webpage visitor. When a visitor in group A sees an ad on the left margin, he clicks on it. This click is fed back into the model, meaning left margin placement will have more weight when selecting where to place the ad when another group A visitor comes to your page.

This is good, and in this case – intentional. The model is constantly retraining itself using a feedback loop.

When feedback loops go bad…

Gaming the system.

Build a better mousetrap.. the mice get smarter.

Imagine a predictive model  developed to determine entrance into a university. Let’s say when you initially built the model, you discovered that students who took German in high school seemed to be better students overall. Now as we all know, correlation is not causation. Perhaps this was just a blip in your data set, or maybe it was just the language most commonly offered at the better high schools. The truth is, you don’t actually know.

How can this be a problem?

Competition to get into universities (especially highly sought after universities) is fierce to say the least. There are entire industries designed to help students get past the admissions process. These industries use any insider knowledge they can glean, and may even try reverse engineering the admissions algorithm.

The result – a feedback loop

These advisers will learn that taking German greatly increases a student’s chance of admission at this imaginary university. Soon they will be advising prospective students (and their parents) who otherwise would not have any chance of being accepted into your school, to sign up for German classes. Well now you have a bunch of students, who may no longer be the best fit, making their way past your model.

What to do?

Feedback loops can be tough to anticipate, so one method to guard against them is to retrain your model every once in a while. I even suggest retooling the model (removing some factors in an attempt to determine if a rogue factor – i.e. German class, is holding too much weight in your model).

And always keep in mind that these models are just that – models. They are not fortune tellers. Their accuracy should constantly be criticized and methods questioned. Because while ad clicks or college admissions are one thing, policing and criminal sentencing algorithms run the risk of being much more harmful.

Left unchecked, the feedback loop of a predictive criminal activity model in any large city in the United States will almost always teach the computer to emulate the worst of human behavior – racism, sexism, and class discrimination.

Since minority males from poor neighborhoods dis-proportionally make up our current prison population, any model that takes race, sex, and economic status into account will inevitably determine a 19 year old black male from a poor neighborhood is a criminal. We will have then violated the basic tenant of our justice system – innocent until proven guilty.

 

R: Text Mining (Pre-processing)

This is part 2 of my Text Mining Lesson series. If you haven’t already, please check out part 1 that covers Term Document Matrix: R: Text Mining (Term Document Matrix)

Okay, now I promise to get to the fun stuff soon enough here, but I feel that in most tutorials I have seen online, the pre-processing of text is often glanced over. It was always (and often still is) a real sore spot for me when assumptions are made as to my knowledge level. If you are going to throw up a block of code, at least give a line or two explanation as to what the code is there for, don’t just assume I know.

I remember working my way through many tutorials where I was able to complete the task by simply copying the code, but I didn’t have a full grasp of what was happening in the middle. I this lesson, I am going to cover some of the more common text pre-processing steps used in the TM library. I am going to go into some level of detail and make some purposeful mistakes so hopefully when you are done here you will have a firm grasp on this very important step in the text mining process.

Let’s start by getting our libraries

install.packages("tm") # if not already installed
install.packages("SnowballC")

library(tm)
library(SnowballC)

Now, let’s load our data. For this lesson we are going to use a simple vector.

wordVC <- c("I like dogs!", "A cat chased the dog.", "The dog ate a bone.", 
            "Cats make fun pets.", "The cat did what? Don't tell the dog.", 
            "A cat runs when it has too, but most dogs love running!")

Now let’s put this data into a corpus for text processing.

corpus <- (VectorSource(wordVC))
corpus <- Corpus(corpus)
summary(corpus)

Here is the output

2016-12-22_09-11-53.jpg

Now for a little frustration. Let’s say you want to see what is in text document 4. You could try

inspect(corpus[4])

But this will be your output, not really what you are looking for.

2016-12-22_09-23-30.jpg

If you want to see the actual text- try this instead

corpus[[4]]$content

Now you can see the text

2016-12-22_09-26-21.jpg

As we go through the text pre-processing, we are going to use the following For Loop to examine our corpus

for (i in 1:6) print (corpus[[i]]$content)

Output

2016-12-22_09-30-41.jpg

Punctuation

Punctuation generally adds no value to text mining when utilizing standard numeric based data mining algorithms like clustering or classification. So it behooves us to just remove it.

To do so, the tm package has a cool function called tm_map() that we can pass arguments to, such as removePunctuation

corpus <- tm_map(corpus, content_transformer(removePunctuation))

for (i in 1:6) print (corpus[[i]]$content)

Note, you do not need the for loop, I am simply running it each time to show you the progress. 

Notice all the punctuation is gone now.

2016-12-22_09-34-28.jpg

Stopwords

Next we are going to get rid of what are known as stopwords. Stopwords are common words such as (the, an, and, him, her). These words are so commonly used that they provide little insight as to the actual meaning of the given text.  To get rid of them, we use the following code.

corpus <- tm_map(corpus, content_transformer(removeWords), 
          stopwords("english"))
for (i in 1:6) print (corpus[[i]]$content)

If you look at line 2, “A cat chase  dog”, you will the word “the” has been removed. However, if you look at the next line down, you will notice “The” is still there.

2016-12-22_09-39-52

WHY?

Well it comes down to the fact that computers do not treat T and t as the same letter, even though they are. Capitalized letters are viewed by computers as separate entities. So “The” doesn’t match “the” found in the list of stopwords to remove.

For a full list of R stopwords, go to: https://github.com/arc12/Text-Mining-Weak-Signals/wiki/Standard-set-of-english-stopwords

So how do we fix this?

tolower

Using tm_map with the “tolower” argument will make all the letters lowercase. If we then re-run our stopwords command, you will see all the “the” are gone

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, content_transformer(removeWords), 
          stopwords("english"))
for (i in 1:6) print (corpus[[i]]$content)

Output

2016-12-22_09-49-25

stemming

Next we will stem our words. I covered this in the last lesson, but it bears repeating. What stemming does is attempt to remove variants of words. In our example, pay attention to the following words (dog, dogs, cat, cats, runs, running)

corpus <- tm_map(corpus, stemDocument)
for (i in 1:6) print (corpus[[i]]$content)

Notice the words are now (dog, cat, run)

2016-12-22_09-55-02.jpg

Whitespace

Finally, let’s get rid of all this extra white space we have now.

corpus <- tm_map(corpus, stripWhitespace) 
for (i in 1:6) print (corpus[[i]]$content)

Output

2016-12-22_10-01-01.jpg

removeNumbers

I didn’t use this argument with my tm_map() function today because I did not have any numbers in my text. But if I, the command would be as follows

corpus <- tm_map(corpus, content_transformer(removeNumbers))

 

 

R: Creating a Word Cloud

Word Clouds are great visualization techniques for dealing with text analytics. The idea behind them is they display the most common words in a corpus of text. The more often a word is used, the larger and darker it is.

2016-12-16_21-27-13.jpg

Making a word cloud in R is relatively easy. The tm and wordcloud libraries from R’s CRAN repository is used to create one.

library(tm)
library(wordcloud)

If you do not have either of these loaded on your machine, you will have to use the following commands

install.packages("tm")
install.packages("wordcloud")

Now in order to make a word cloud, you first need a collection of words. In our example I am going to use a text file I created from the Wikipedia page on R.

You can download the text file here: rwiki

Now let’s load the data file.

text <- readLines("rWiki.txt")
> head(text)
[1] "R is a programming language and software environment 
[2] "The R language is widely used among statisticians and 
[3] "Polls, surveys of data miners, and studies of scholarly 
[4] "R is a GNU package.[9] The source code for the R 
[5] "General Public License, and pre-compiled binary versions
[6] "R is an implementation of the S programming language "
>

Notice each line in the text file is an individual element in the vector –  text

Now we need to move the text into a tm element called a Corpus. First we need to convert the vector text into a VectorSource.

wc <- VectorSource(text)
wc <- Corpus(wc)

Now we need to pre-process the data. Let’s start by removing punctuation from the corpus.

wc <- tm_map(wc, removePunctuation)

Next we need to set all the letters to lower case. This is because R differentiates upper and lower case letters. So “Program” and “program” would treated as 2 different words. To change that, we set everything to lowercase.

wc <- tm_map(wc, content_transformer(tolower))

Next we will remove stopwords. Stopwords are commonly used words that provide no value to the evaluation of the text. Examples of stopwords are: the, a, an, and, if, or, not, with ….

wc <- tm_map(wc, removeWords, stopwords("english"))

Finally, let’s strip away the whitespace

wc <- tm_map(wc, stripWhitespace)

Now let us make our first word cloud

The syntax is as follows – wordcloud( words = corpus, scale = physical size, max.word = number of words in cloud)

wordcloud(words = wc, scale=c(4,0.5), max.words=50)

2016-12-16_22-37-12.jpg

Now we have a word cloud, let’s add some more elements to it.

random.order = False brings the most popular words to the center

wordcloud(words = wc, scale=c(4,0.5), max.words=50,random.order=FALSE)

2016-12-16_22-42-35.jpg

To add a little more rotation to your word cloud use rot.per

wordcloud(words = wc, scale=c(4,0.5), max.words=50,random.order=FALSE,
 rot.per=0.25)

Finally, lets add some color. We are going to use brewer.pal.  The syntax is brewer.pal(number of colors, color mix)

cp <- brewer.pal(7,"YlOrRd")
wordcloud(words = wc, scale=c(4,0.5), max.words=50,random.order=FALSE,
 rot.per=0.25, colors=cp)

2016-12-16_22-48-06

 

 

 

R: Connect to Twitter with R

You can do a lot in the way of text analytics with Twitter. But in order to do so, first you need to connect with Twitter.

In order to do so, first you need to set up an account on Twitter and get an API key. Once you have created your account, go to the following website: https://dev.twitter.com/

Once there, click My apps

twiiter

Click Create New App

twitter1

Give name, description (it doesn’t matter – matter anything up), for website I just used http://test.de

twiiter2

Go to Keys and Access Tokens

twitter3

Get your Consumer Key (API Key), Consumer Secret (API Secret), Access Token, and Access Token Secret

Now open R and install the TwitteR package.

install.packages("twitteR")

Now load the library

library(twitteR)

Next we are going to use the API keys you collected to authorize our connection to the Twitter API (Twitter uses OAuth by the way)

api_key<- "insert consumer key here"
api_secret <- "insert consumer secret here"
access_token <- "insert access token here"
access_token_secret <- "insert access token secret here
setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)

Now to test to see if your connection is good

searchTwitter('analytics')

R: gsub

Gsub

R is an interpreted language – meaning the code is run as it is read – kind of like a musician who plays music while reading it off the sheet(note by note). To that end, R does not perform loops as efficiently as compiled languages like C or Java. So to address this issue, R has some interesting work-arounds. One of my favorite is gsub.

Here is how gsub works. Take the sentence “Bob likes dogs”. Using gsub I can replace any element of that sentence. So I can say replace “dogs” with “cats” and the sentence would read “Bob like cats”. Kind of cool all by itself, but it is even cooler when dealing with a larger data set.

Let’s set x to a vector of 3 elements

 x <- c("the green ball", "Bob likes the dog", "Sally is the best runner
 in the group")

Now let’s run a gsub command (syntax: gsub(to be replace, what to replace with, date source)

x <- gsub("the", "a", x)

This short line of code replaces all the “the” in the vector with “a”. It does it for a vector of 1000 elements just as well as it does it for this small vector of 3 elements.

gsub1

Okay, now my personal pet peeve when it comes to learning this stuff. Show me a practical approach to this. So here we go, a practical data science based use for this.

Check out this selection of tweets I pulled from Twitter. Notice the annoying RT “retweet” at the beginning of most of tweets. I want to get rid of it. When doing a sentiment analysis, knowing something is a RT does little for me.

gsub2

gsub to get rid of RT

gsub3.jpg

Now I have an empty space I want to get rid.

gsub4.jpg

And if you are wondering how I got that Twitter data? Don’t worry, you don’t need any expensive software. I did it all with R for free and I will show you how to do it too. Stay tuned.

What is Big Data?

Big Data is Like Teenage Sex: Everyone talks about it, nobody really knows how to do it, everyone things everyone else is doing it, so everyone claims they are doing it.

– Dan Ariely

I remember when I first started developing an interest in Big Data and Analytics. One of the biggest frustrations I faced was that it seemed like everyone in the know was talking in code. They would toss words around like supervised machine learning, map reduce, hadoop, SAP HANA, in-memory, and the biggest buzz word of them all, Big Data.

 

So what is Big Data?

In all honesty, it is a buzzword. Big Data isn’t a single thing as much as it is a collection of technologies and concepts that surround the management and analysis of massive data sets.

What kind of data qualifies as Big Data?

The common consensus you will find in textbooks is that Big Data is concerned with the 3 V’s: Velocity, Volume, Variety.

Velocity: Velocity is not so much concerned with how fast the data gets to you. This is not something you can clock using network metrics. Instead, this is how fast data can become actionable. In the days of yore, managers would rely on monthly or quarterly reports to determine business strategy. Now these decisions are being made more dynamically. Data only 2 hours old can be viewed outdated in this new high velocity world.

Volume: Volume is the name of the game in Big Data. Think of sheer volume of data produced by a wireless telecom company: every call, every tower connection, length of calls, etc. These guys are racking up terabytes upon terabytes.

Variety:  Big Data is all about variety. As a complete 180 from the rigid structure that makes relational databases work, Big Data lives in the world of unstructured data. Big Data repositories are full of videos, pictures, free text, audio clips, and all other forms of unstructured data.

How do you store all this data?

Storing and managing all this data is one of big challenges. This is where specialized data management systems like Hadoop come into play. What makes Hadoop so great? First it is Hadoop’s ability to scale. By scale I mean, Hadoop can grow and shrink on demand.

For those unfamiliar with the back end storage methodology of standard relational databases (Oracle, DB2, SQL Server), they don’t play well across multiple computers. Instead you will find you need to invest in high end servers and plan out ahead any clusters you are going to use with a general idea of your storage needs in mind. If you build out a database solution designed to handle 10 terabytes and suddenly find yourself needing to manage 50, you are going to have some serious (and expensive) reconfiguration work ahead of you.

Hadoop on the other hand is designed to run easily across commodity hardware. Which means you can have a rack full of mid priced servers and Hadoop can provision and utilize them at will. So if you typically run at 10 terabyte database and there is a sudden need to run another 50 terabytes – (your company is on-boarding a new company) Hadoop will just grow as needed (assuming you have 50 tb worth of commodity servers available). It will also free up the space when it is done. So if the 50 terabytes were only needed for a particular job, once that job is over, Hadoop can release the storage space for other systems to use.

What about MapReduce?

MapReduce is algorithm designed to make querying or processing massive data sets possible. In a very simplified explanation MapReduce works like so:

Map – The data is broken into chunks and handed off to mappers. These mappers perform the data processing job at on their individual chunk of the data set. There are hundreds (or many many hundreds) of these mappers working in parallel, taking advantage of different processors on the racks of commodity hardware we were talking about earlier.

Reduce – The output of all of these map jobs are then passed into the reduce. This part of the algorithm puts all the pieces back together, providing the user with one result set. The entire purpose behind MapReduce is speed.

Data Analytics

Now that you have the data, you are going to want to make some sense of it. To pull information out of this mass of data requires specially designed algorithms running on high end hardware. Platforms like SAP HANA tout in memory analytics to drive up speed, while a lot of the buzz around deep learning seems to mention accessing the incredibly fast memory found in GPUs (graphical processor units).

At the root of all of this, you will still find some old familiar stand-bys. Regression is still at the top of pack in prediction methods used by most companies. Other popular machine learning algorithms like Affinity Analysis (market basket) and Clustering are also commonly used with Big Data.

What really separates Big Data analytics from regular analysis methods is that with it’s sheer volume of data, it is not as reliant on inferential statistical methods to draw conclusions.

If you think about an election in a city with 50,000 registered voters. The classic method of polling was to ask a representative sample (say 2000 voters) how they were going to vote. Using that information, you  could infer how the election was to play out (with a margin of error of course). With Big Data, we are asking all 50,000 voters. We do not need to infer anymore. We know the answer

Imagine this more “real world” application example. Think of a large manufacturing plant. A pretty common maintenance strategy in a plant like this is to have people perform periodic checks on motors, belts, fans, etc. They will take reading from gauges, write them on a clipboard and maybe the information is enter into a computer that can analyze trends to check for out of range parameters.

In the IoT(Internet of Things) world, we now have cheap, network connected sensors on all of this equipment sending out readings every second. This data is fed through algorithms designed to analyze trends. The trends can tell you a fan is starting to go bad and needs to be replaced 2 or 3 days before most humans could.

Correlation

Big Data is all about correlation. And this can be a sticking point for some people. As humans, we love to look for a root cause. If we can’t find one, we often create one to satisfy our desire for one – hence the Roman volcano gods.

With Big Data, cause is not name of the game. Analyzing massive streams of data, we can find correlations that can help up make better decisions. Using the fan example above, the algorithms may pick up that if the fan begins drawing more current while maintaining the same speed that the soon the fan motor will fail. And using probabilistic algorithms, we can show you the increasing odds of it failing each minute, hour, or day you ignore it.

This we can do, with great certainty in many instances. But what we can’t do is tell you why it is happening. That is for someone else. We leave whys to the engineers and the academics. We are happy enough knowing If A Then B.

 

Correlation can also show us relationship we never knew existed. Like did you know Pop Tarts are hot sellers in the days leading up to a hurricane? Walmart does. They even know what flavor ( I want to say strawberry, but don’t quote me on this).

This pattern finding power can be found everywhere from Walmart check outs, to credit card fraud detection, dating sites, and even medicine.

How I Found Love Using Pivot Tables

Okay, a little background is in order here. I work for a Clinical Engineering Department in a large hospital system. Our main purpose is to inspect and repair medical equipment, from MRIs and CT Scanners down to IV pumps.

Now I know the title says love, and I promise there is a love connection, just be patient.

While doing some database work on the system we use to track repairs, I decided to do a little data exploration (I don’t have a lot of hobbies). I asked myself, “What equipment is breaking down the most?” I figured this could be a valuable piece of information. So, I exported six months worth of repair history into a CSV file.

Using Excel, I started playing with Pivot Tables. I started by checking to see what types of equipment seemed to break down the most. Turns out it was infusion pumps, not a real surprise to anyone who has ever worked in the field.

excelpivot2

But looking a little more closely. One hospital out of my system used Brand A and they wanted to get rid of them out of the belief they were the unreliable. However, a quick look at the data provided otherwise.

excelpivot3

Okay, so after I unsullied the reputation of the Brand A pump, I decided “Why not look at the repair rates of individual pieces of equipment?” (I know, I live a WILD life)

Below is the list from one of my hospitals. Pay special attention to the area highlighted in red. The amount of repair work orders opened for dental chairs was way off anything my 20 plus years of experience in the field would have led me to expect.

excelpivot1.jpg

So I decided to dig a little further. Well, it turns out all 78 work orders we opened by one of our younger (24 year old) single technicians. A quick walk up to the dental department quickly explained why the dental chairs needed so much attention. The problem was about 5’2″, long blond hair, a cute smile, and (even in scrubs) quite the little body.

So there you go. Young love revealed itself through the power of the pivot table.

8 Software Packages You Should Learn for Data Science

1. Excel

Before everyone starts booing me, Excel is a highly under rated tool by many. I blame it on over familiarity. We have all seen our computer illiterate co-worker using Excel to make up a phone list for their kid’s little league. It is hard to imagine that same tool can be a powerhouse when it comes to data.

Take sometime to learn data cleaning and manipulation methods in Excel. Download the following free add-ons: PowerQuery, PowerPivot, and Solver. These tools offer a lot of the same functionality as a $30K license to SAS, admittedly on a small scale.

2. R

R is a statistical programming package. While far from user friendly – you are going to have to learn to program here – you will be hard pressed to find something that R cannot do. Even better, it is free. As part of the open source community, R is constantly being updated and libraries are being created to cover almost anything you can imagine.

3. Python

Python is another free programming language that is quite popular in the data science world. Unlike R, Python is not specifically designed for data and statistics. It is a full fledged Object Oriented Programming language that has plenty of uses in the computer world.

If you spend even a little time around Data Science forums, you will see the battle of R vs Python play out over and over again. My advice – take a little time to learn the fundamentals of both.

4. MS SQL Server

I chose MS SQL Server over Oracle for 2 reasons. 1) MS SQL Server is 100 times easier to install and configure for most computer users. 2) While Oracle’s PL\SQL is definitely a more robust language, Microsoft’s T-SQL is easier to pick up.

My one caveat here would be if your desire is to become a database design or database engineer. In that case, I would suggest learning Oracle first, as all the fundamental you develop there, will be easily translated to MS SQL Server.

Another great advantage of MS SQL Server is the developer’s edition. For around $70, you can purchase a developer’s license which gives you the whole suite of tools including: SSIS and SSRS.

5. SSIS

SSIS (SQL Server Integration Services) is a robust ETL (Extract Transform Load) tool build around the Microsoft Visual Studios platform. It comes included with the developer’s edition and provides a graphical interface for building ETL processes.

6. Tableau

Until further notice, Tableau is the reigning king of data visualizations. Just download a copy of free Tableau Public and you will wonder why you spent all that effort fighting with Excel charts for all these years.

While Tableau’s analytics tools leave a lot to be desired, the time it will save you in data exploration will have you singing its praises.

7. Qlik

Admittedly, Qlik is focused more as an end user BI tool, but it provides robust data modeling capabilities. Also, Qlik’s interactive dashboards are not only easy to construct, but leave the competition in the dust when it comes to ease of use.

8. Hadoop

It’s Big Data’s world, we just live here. If you want to be a data professional in this century, you are going to need to become familiar with Big Data platforms. My suggestion is to download the Hortonworks Sandbox edition of Hadoop. It is free and they provide hours worth of tutorials. Time spent learning Pig Script and Hive Script (Big Data query languages) will be well worth your effort.

What about SAS, SSPS, Cognos, blah, blah, blah…

While these packages do have a dominant position in the world of analytics, they are not very nice about offering free or low cost versions for people wanting to get into the profession to learn. I wanted to fill my list with software that could be obtain for little or no cost.

If you are currently a college student, you have a better position. Check with your college for software partnerships. Also, check with software vendors to see if they have student editions. As of this writing, I know Tableau, SAS, and Statistica offer student versions that you may want to look into.