The AI Journey – Career advice for aspiring data professionals

A past co-worker of mine by the name of Antonio Ivanovski (AI) has written a free e-book giving some great advice on how to get your foot in the door as a data professional and how to make some smart moves to really improve your job satisfaction as well as earning potential. He not only talks the talk, but walks the walk. He has managed to quadruple his salary in a little under 5 years moving from UPS to Verizon, and now (as of this writing) he is working as a Senior Data Analyst for Google.

If you are looking to get into the field or looking for advice on how to move up, his free e-book is worth a read.

You can find it hear: AI with AI

You can also find him on LinkedIn, send him a connection request and tell him you want to know all about his Macedonian Battle Llama

SQL: Create a temporary table

Temporary tables are a great way of working on complex data requests. They are easy to create and they delete themselves after every session, so you do not have to worry about creating a big mess with a bunch of tables you need to go clean up later.

In this tutorial, I am going to use a real world example from my work in Verizon’s Cyber Security Department. This is a simplified version of ask, and I am using completely made up data. There is no data from Verizon on my website every. I simply discuss use cases to make learning analytics more grounded in the real world

Below is a list of dates, PhoneNum: phone numbers called about, and the CallerNum: the number the person is calling from. While there are many legitimate reasons for someone to call customer support from another number (I drop and break my phone so I borrow my co-workers phone and call customer support to request a replacement), a number that calls in repeatedly about many different numbers is a red flag of someone that could be a fraudster.

If you want to play along, you can download the data set here:

I am using MySql in this example as my database, but I will include the code for SQL Server, Teradata, and Oracle platforms as well.

So the ask is find CallerNum that is calling about many different PhoneNum

While I am sure you can make a complex subquery to do this job, but I’m going to show you how to use temporary tables to make this ask very simple:

As you can see above, I loaded the data into a table called dbtest.numberslist

Now to find out how many CallerNum are calling about multiple PhoneNum, a simple solution is to get a list of all distinct combinations of PhoneNum and CallerNum and then do a count of CallerNums from this distinct list. Since the list is distinct, a CallerNum calling in about the same PhoneNum will only appear once, so a CallerNum calling about multiple PhoneNums will appear multiple times.

So using temporary tables, I will create a table that holds the distinct call combinations

MySql (code is create temporary table <table name> then query to fill table

create temporary table distCalls
select distinct phonenum, callerNum from dbtest.numberslist;
Select * from distCalls -- shows what is in the table now

Now, lets see if we can find potential fraud callers, let us do a count of callerNum from the distinct temporary table

As you can see above, there are 4 numbers that have called about 4 distinct phone numbers during this time period. Again, this could be for legitimate reasons, but this is still something we look at when trying to find questionable activity.

MySQL Code

create temporary table distCalls select distinct phonenum, callerNum

from dbtest.numberslist;

SQL Server

Select distinct PhoneNum, CallerNum 
into #distCalls
from dbtest.numberslist
go 

#tableName -- indicated temporary tables in SQL Server

Teradata

Create volatile table distCalls as (
select distinct PhoneNum, CallerNum
from dbtest.numberslist)
with data
on commit preserve rows;

with data and on commit preserve rows are needed at the end if you want any data to be in your table when you go to use it

Oracle

Create private temporary table distCalls as
select distinct PhoneNum, CallerNum
from dbtest.numberslist; 

Remember, temp tables delete themselves after each session (each time you log off the database). If you are working in the same session and need to recreate the temp table for some reason, you can always drop the table just as you would any other table object in SQL.

Where to find real data sets for data science training

Practice makes perfect. But finding good data sets to practice with can be a pain. Here my current list for best places to find practice data sets:

Python Web Scraping: Using Selenium to automate web

This is follow up to how to connect to Chrome using Selenium. If you do not know how to get to a website on Chrome using Selenium, go here

To refresh. here is the code we used to open up a web page (in this case Wikipedia’s home page)

If you run this code, you should find yourself on the home page for Wikipedia

Okay, so now lets learn how to interact with page, the first thing I am going to do is to select the English language version of the page. There are a few ways go about this, but one of the easier approaches is to look at the HTML code that creates the page and to use xpaths or titles to find the object you are looking at.

Right click on the link for English and click inspect from the drop down.

If you get a body link first, you might need to right click and hit inspect again

To check if you have the right element, hover your mouse over it, and it will be highlighted on the webpage

Once you have the right element, right click on it, go to copy>Copy Xpath

Chose Xpath, not full Xpath, it makes for easier coding. You XPath should look something like this: //*[@id=”js-link-box-en”]/strong * When you go to try this, your XPath may look different. As websites are constantly updated, many of the Xpaths get updated as well. Go with the one you find when you Inspect the HTML code yourself

Now we are going to use selenium to “Find” the element we want. The code is dr.find_element_by_XPath(‘//*[@id=”js-link-box-en”]/strong’) *Note the use of single quote around the XPath, it is better to use them as many XPaths will contain double quotes

Once you have run that code, Selenium knows what element you are looking at, you can interact with it now. Let’s “click” the link

Note something i did in the code, I added a link= before my find element command. This assigned the element now to a variable. I can now use the “click()” method the variable inherited from the selenium.webdriver object to click on the English link

I could have just done this: dr.find_element_by_xpath(‘//*[@id=”js-link-box-en”]/strong’).click()

But by assigning the variable it is a) cleaner code and b) the link can be reused by my code later. Remember, it is a law of programming that you will always have to go back and fix something you haven’t seen in 6 months, so make the code as clean as possible to make future you less likely to develop a drinking problem due to having to fix poorly written code.

If you run the code above, you will move to the home English page

Lets try one more thing, lets typing a search into the search bar:

Right click > inspect the search bar, then right click>copy>copy xpath the selection in the HTML code

Now that you have the XPath, lets use the find_element_by_xpath code and a new command, send_keys() to input characters into the search box

Finally, right click on the magnifying glass>inspect>copy>copy Xpath and let us click on it to finish our search. (remember to hover over to make sure you have the right link)

Now you should find yourself on the Data Science page of Wikipedia

Now remember — the xpaths I have on this page will likely be out of date by the time you try this, so make sure to inspect the elements and get the correct XPaths for this work for you.

Python: Confusion Matrix

What is a confusion matrix?

A confusion matrix is a supervised machine learning evaluation tool that provides more insight into the overall effectiveness of a machine learning classifier. Unlike a simple accuracy metric, which is calculated by dividing the number of correctly predicted records by the total number of records, confusion matrices return 4 unique metrics for you to work with.

While I am not saying accuracy is always misleading, there are times, especially when working with examples of imbalanced data,  that accuracy can be all but useless.

Let’s consider credit card fraud. It is not uncommon that given a list of credit card transactions, that a fraud event might make up a little as 1 in 10,000 records. This is referred to a severely imbalanced data.  Now imaging a simple machine learning classifier running through that data and simply labeling everything as not fraudulent. When you checked the accuracy, it would come back as 99.99% accurate. Sounds great right? Except you missed the fraud event, the only reason to try to create the model in the first place.

A confusion matrix will show you more details, letting you know that you completely missed the fraud event. Instead of a single number result, a confusion matrix provides you will 4 metrics to evaluate. (note: the minority class – (in the case of fraud – the fraudulent events) – are labeled positive by confusion matrices. So a non-fraud event is a negative. This is not a judgement between the classes, only a naming convention)

TP = true positive – minority class (fraud) is correctly predicted as positive

FP = false positive – majority class (not fraud) is incorrectly predicted

FN = false negative – minority class (fraud) incorrectly predicted

TN = true negative – majority class (not fraud) correctly predicted

In matrix form:

confus

To run a confusion matrix in Python, Sklearn provides a method called confusion_matrix(y_test, y_pred)

y_test = actual results from the test data set

y_pred = predictions made by model on test data set

so in a pseudocode example:

model.fit(X,y)
y_pred = model.predict(X_test)

If this is at all confusing, refer to my Python SVM lesson where I create the training and testing set and run a confusion matrix (Python: Support Vector Machine (SVM))

To run a confusion matrix in Python, first run a model, then run predictions (as shown above) and then follow the code below:

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

Output looks like this:

Confu1

Now, if you want to capture the TP, TN, FP, FN into individual variables to work with, you can add the ravel() function to your confusion matrix:

TN,FP,FN,TP = confusion_matrix(y_test, y_pred).ravel()

Thank you for taking the time to read this, and good luck on your analytics journey.

Python: Support Vector Machine (SVM)

Support Vector Machine (SVM):

A Support Vector Machine, or SVM, is a popular binary classifier machine learning algorithm. For those who may not know, a binary classifier is a predictive tool that returns one of two values as the result, (YES – NO), (TRUE – FALSE), (1 – 0).  Think of it as a simple decision maker:

Should this applicant be accepted to college? (Yes – No)

Is this credit card transaction fraudulent? (Yes – No)

An SVM predictive model is built by feeding a labeled data set to the algorithm, making this a supervised machine learning model. Remember, when the training data contains the answer you are looking for, you are using a supervised learning model. The goal, of course, of a supervised learning model is that once built, you can feed the model new data which you do not know the answer to, and the model will give you the answer.

Brief explanation of an SVM:

An SVM is a discriminative classifier. It is actually an adaptation of a previously designed classifier called perceptron. (The perceptron algorithm also helped to inform the development of artificial neural networks).

The SVM works by finding the optimal hyperplane that can be used to discriminate between classes in the data set. (Classes refers to the label or “answer” column of each record.  The true/false, yes/no column in a binary set). When considering a two dimensional model, the hyperplane simply becomes a line that divides to the classes of data.

The hyperplane (or line in 2 dimensions) is informed by what are known as Support Vectors. A record from the data set is converted into a vector when fed through the algorithm (this is where a basic understanding of linear algebra comes in handy). Vectors (data records) closest to the decision boundary are called Support Vectors. It is on either side of this decision boundary that a vector is labeled by the classifier.

The focus on the support vectors and where they deem the decision boundary to be, is what informs the SVM as to where to place the optimal hyperplane. It is this focus on the support vectors as opposed to the data set as a whole, that gives SVM an advantage over a simple learner like a linear regression, when dealing with complex data sets.

Coding Exercise:

Libraries needed:

sklearn

pandas

This is the main reason I recommend the Anaconda distribution of Python, because it comes prepackaged with the most popular data science libraries.

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import metrics
from sklearn.metrics import confusion_matrix
import pandas as pd

Next, let’s look at the data set. This is the Pima Indians Diabetes data set. It is a publicly available data set consisting of 768 records. Columns are as follows:

  1. Number of times pregnant.
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
  3. Diastolic blood pressure (mm Hg).
  4. Triceps skinfold thickness (mm).
  5. 2-Hour serum insulin (mu U/ml).
  6. Body mass index (weight in kg/(height in m)^2).
  7. Diabetes pedigree function.
  8. Age (years).
  9. Class variable (0 or 1).

Data can be downloaded with the link below

pima_indians

Once you download the file, load it into python (you’re file path will be different)

df = pd.read_excel(‘C:\\Users\\blars\\Documents\\pima_indians.xlsx’)

now look at the data:

df.head()

svm1

Now keep in mind, class is our target. That is what we want to predict.

So let us start by separating the target class.

We use the pandas command .pop() to remove the Class column to the y variable, and the remained of the dataframe is now in the X

Let’s now split the data into training and test sets:

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33)

Now we will train (fit) the model. This example I am using Sklearns SVC() model for an SVM example. There are a lot of SVMs available to try if you would like to explore deeper.

Code for fitting the model:

model =SVC()
model.fit(X_train, y_train)

Now using the testing subset we withheld, we will test our model

y_pred = model.predict(X_test)

Now to see how good the model is, we will perform an accuracy test.  This simply takes all the correct guess and divides them by total guesses.

As, you can seen below, we compare the y_pred (predicted values) against y_test (actual values) and we get .7677 or 77% accuracy. Which is not a bad model for simply using defaults.

svm3

Let’s look at a confusion matrix to get just a little more in-depth info

svm4

For those not familiar with a confusion matrix, this will help you to interpret results:

First number 151 = True Negatives — this would be the number of 0’s or not diabetics correctly predicted

Second number 15 = False Positives — the number of 0’s (non-diabetics) falsely predicted to be a 1

Third number 44 = False negatives — the number of 1’s (diabetics) falsely predicted to be a 0

Fourth number 44 = True Positives — the number of 1 (diabetics) correctly predicted.

So, the model correctly identified 44 out of the 59 diabetics in the test data, and misdiagnoses 44 out the 195 non diabetics in the data sample.

To see a video version of this lesson, click the link here: Python: Build an SVM

Ensemble Modeling

In the world of analytics,modeling is a general term used to refer to the use of data mining (machine learning) methods to develop predictions. If you want to know what ad a particular user is more likely to click on, or which customers are likely to leave you for a competitor, you develop a predictive model.

There are a lot of models to choose from: Regression, Decision Trees, K Nearest Neighbor, Neural Nets, etc. They all will provide you with a prediction, but some will do better than others depending on the data you are working with. While there are certain tricks and tweaks one can do to improve the accuracy of these models, it never hurts to remember the fact that there is wisdom to be found in the masses.

The Jelly Bean Jar

I am sure everyone has come across some version of this in their life: you are at a fair or school fund raising event and someone has a large see-through jar full of jelly beans (or marbles or nickles). Next to the jar are some slips of paper with the instructions to “Guess the number of jelly beans in the jar you win!”

An interesting thing about this game, and you can try this out for yourself, is that given a reasonable number of participants, more often than not, the average guess of the group will perform better than the best individual guesser. Or in other words, imagine there are 200 jelly beans in the jar and the best guesser (the winner) guesses 215. More often than not, the average of all the guesses might be something like 210 or 190. The group cancels out their over and under guessing, resulting in a better answer than anyone individual.

How Do We Get the Average in Models?

There are countless ways to do it, and researchers are constantly trying new approaches to get that extra 2% improvement over the last model. For ease of understanding though, I am going to focus on 2 very popular methods of ensemble modeling : Random Forests & Boosted Trees.

2016-11-22_22-05-26

Random Forests:

Imagine you have a data set containing 50,000 records. We will start by randomly selecting 1000 records and creating a decision tree from those records. We will then put the records back into the data set and draw another 1000 records, creating another decision tree. The process is repeated over and over again for a predefined number of iterations (each time the data used is returned to the pool where it could possibly be picked again).

After all the sample decision trees have been created (let’s say we created 500 for the sake of argument), the model then takes the mean or average of all the models if you are looking at a regression or the mode of all the models if you are dealing with a classification.

For those unfamiliar with the terminology, a regression model looks for a numeric value as the answer. It could be the selling price of a house, a person’s weight, the price of a stock, etc. While a classification looks for classifying answers: yes or no, large – medium – small, fast or slow, etc.

Boosted Trees:

Another popular method of ensemble modeling is known as boosted trees. In this method, a simple (poor learner) model tree is created – usually 3-5 splits maybe. Then another small tree (3-5 splits) is built by using incorrect predictions for the first tree. This is repeated multiple times (say 50 in this example), building layers of trees, each one getting a little bit better than the one before it. All the layers are combined to make the final predictive model.

Oversimplified?

Now I know this may be an oversimplified explanation, and I will create some tutorials on actually building ensemble models, but sometimes I think just getting a feel for the concept is important.

So are ensemble models always the best? Not necessarily.

One thing you will learn when it comes to modeling is that no one method is the best. Each has their own strengths. The more complex the model, the longer it takes to run, so sometimes you will find speed outweighs desire for the added 2% accuracy bump. The secret is to be familiar with the different models, and to try them out in different scenarios. You will find that choosing the right model can be as much of an art as a science.

Simpson’s Paradox: How to Lie with Statistics

We’ve all heard the saying from Benjamin Disreali, “Lies, damn lies, and statistics.”

While statistics has proven to be of great benefit to mankind in almost every endeavor: inexperienced, sloppy, and downright unscrupulous statisticians have made some pretty wild claims. And because these wild claims are often presented as statistical fact, people in all industries – from business, to healthcare, to education -have chased these white elephants right down the rabbit hole.

Anyone who has taken even an introductory statistics course can tell you how easily statistics can be misrepresented. One of my favorite examples involves using bar charts to confuse the audience. Look at the chart below. It represents the number of games won by two teams in a season of beer league softball.

2017-01-10_10-28-21

At first glance, you might think Team B won twice as many games as Team A, and that is indeed the intention of the person who made this chart. But when you look at the numbers to the left, you will see Team A won 15 games to Team B’s 20. While I am no mathematician, even I know 15 is not half of 20.

This deception was perpetrated by simply adjusting the starting point of the Y – Axis. When you reset it to 0, the chart tells a different story.

2017-01-10_10-31-54.jpg

Even Honest People Can Lie by Accident

In the example above, the person creating the chart was manipulating the data on purpose to achieve a desired effect. You may look at this and say I would never deceive people like that, the truth is – you just might do it by accident.

What do I mean? Let’s take an example from an industry fraught with horrible statistics – our education system.

Below you will find a chart depicting the average math scores on a standardized test since 2000 for Happy Town, USA. You will notice the test scores are significantly lower now than they were back in 2000.

2017-01-10_10-40-48

What does this mean? Are the kids getting stupider? Has teacher quality gone down? Who should be held accountable for this? Certainly those lazy tenured teachers who are only there to collect their pensions and leach off the tax payers.

I mean look at the test scores. The average score has dipped from around 90 to close to 70. Surely something in the system is failing.

Now what if I were to tell you that the chart above – while correct – does not tell the whole story. Test scores in Happy Town, USA are actually up – if you look at the data correctly.

What we are dealing with is something known in statistics as Simpson Paradox, and even some of the brightest academic minds have published research that ignored this very important concept.

What do I mean?

Let me tell you the whole story about Happy Town, USA. Happy Town was your average American middle class town. The economic make-up of this town in 2000 was 20% of the families made over $150K, 60% made between $150K and $50K, with 20% earning less than $50K a year.

In 2008, that all changed. The recession hit causing people to lose their jobs and default on their mortgages. Families moved out, housing prices fell. Due to the new lower housing prices, families from Non-So Happy Town, USA were able to afford houses in Happy Town. They moved their families there in hopes of a better education and better life for their children.

While the schools in Happy Town were better, the teachers were not miracle workers. These kids from Not So Happy Town did not have the strong educational foundation the pre-recession residents of Happy Town did. Many teachers found themselves starting almost from scratch.

No matter how hard these new kids and their teachers tried, they could never be expected to jump right in and perform as well as the pre-2008 Happy Town kids. The economic makeup of the town shifted. The under $50K’s now represent 60% of the town’s population, with 150K-50K making up only 30% and the top earners dwindling down to 10%.

So while taking an average of all the students is not a sign of someone necessarily trying to pull the wool over your eyes, it does not tell the whole story.

To see the whole story, and to unravel Simpson Paradox, you need to look at the scores across the different economic sectors of this town which has undergone drastic changes.

2017-01-10_11-26-34

Looking at from the standpoint of economic sector, you will see the scores in each sector have improved. With the under $50K improving at an impressive rate. Clearly the teachers and staff at Happy Town School are doing their job and then some.

So while the person who took the average of the whole school may not have intended to lie with their statistics, a deeper dive into the numbers showed that the truth was hidden inside the aggregate.

Keep this in mind next time someone shows you falling SAT scores, crime stats, or disease rates. All of these elements are easily affected by a shift in demographics. If you don’t see the breakdown, don’t believe the hype.

Feedback Loops in Predictive Models

Predictive models are full of perilous traps for the uninitiated. With the ease of use of some modeling tools like JMP or SAS, you can literally point and click your way into a predictive model. These models will give you results. And a lot of times, the results are good. But how do you measure the goodness of the results?

I will be doing a series of lessons on model evaluation. This is one of the more difficult concepts for many to grasp, as some of it may seem subjective. In this lesson I will be covering feedback loops and showing how they can sometimes improve, and other times destroy, a model.

What is a feedback loop?

A feedback loop in modeling is where the results of the model are somehow fed back into the model (sometimes intentionally, other times not). One simple example might be an ad placement model.

Imagine you built a model determining where  on a page to place an ad based on the webpage visitor. When a visitor in group A sees an ad on the left margin, he clicks on it. This click is fed back into the model, meaning left margin placement will have more weight when selecting where to place the ad when another group A visitor comes to your page.

This is good, and in this case – intentional. The model is constantly retraining itself using a feedback loop.

When feedback loops go bad…

Gaming the system.

Build a better mousetrap.. the mice get smarter.

Imagine a predictive model  developed to determine entrance into a university. Let’s say when you initially built the model, you discovered that students who took German in high school seemed to be better students overall. Now as we all know, correlation is not causation. Perhaps this was just a blip in your data set, or maybe it was just the language most commonly offered at the better high schools. The truth is, you don’t actually know.

How can this be a problem?

Competition to get into universities (especially highly sought after universities) is fierce to say the least. There are entire industries designed to help students get past the admissions process. These industries use any insider knowledge they can glean, and may even try reverse engineering the admissions algorithm.

The result – a feedback loop

These advisers will learn that taking German greatly increases a student’s chance of admission at this imaginary university. Soon they will be advising prospective students (and their parents) who otherwise would not have any chance of being accepted into your school, to sign up for German classes. Well now you have a bunch of students, who may no longer be the best fit, making their way past your model.

What to do?

Feedback loops can be tough to anticipate, so one method to guard against them is to retrain your model every once in a while. I even suggest retooling the model (removing some factors in an attempt to determine if a rogue factor – i.e. German class, is holding too much weight in your model).

And always keep in mind that these models are just that – models. They are not fortune tellers. Their accuracy should constantly be criticized and methods questioned. Because while ad clicks or college admissions are one thing, policing and criminal sentencing algorithms run the risk of being much more harmful.

Left unchecked, the feedback loop of a predictive criminal activity model in any large city in the United States will almost always teach the computer to emulate the worst of human behavior – racism, sexism, and class discrimination.

Since minority males from poor neighborhoods dis-proportionally make up our current prison population, any model that takes race, sex, and economic status into account will inevitably determine a 19 year old black male from a poor neighborhood is a criminal. We will have then violated the basic tenant of our justice system – innocent until proven guilty.