ETL : Extract, Transform, Load

It you ever actually intend to work with data in the real world, you will quickly come face to face with two truths:

  1. Data is messy
  2. The data you need is never in one place

While creating predictive models and showing off your brilliant math skills is infinitely more sexy, before you even begin to build a simple linear regression until you have all your data together and in the correct format.

If you haven’t heard it before, it is a pretty common estimate that a data scientist spends 60 to 70 percent of their time working as a data “janitor”. While on a smaller scale, most of this data manipulation and merging (often referred to as munging) can be done manually, in a professional environment, you are going to want to find a way to try to automate as much of the process as possible.

ETL Tools

This is where ETL Tools come into play. Some of the more popular ones on the market are:

  • Oracle Warehouse Builder (OWB)
  • SAS Data Management
  • PowerCenter Informatica
  • SQL Server Integration Services (SSIS)

SSIS is the one I am most familiar with (also the one I use professionally) so it will be the one I will be focusing on in future lessons.

What these ETL Tools do is help you automate the data munging process.

Extract:

When discussing ETL, extract means pulling data from a source (or many sources). Consider this, you have Customer table residing on a SQL Server Database, a Sales Person table sitting on a Teradata Database, and a list of purchases kept on an Excel spreadsheet.

Using an SSIS tool, you can connect to these three different data sources and query the data (Extract) into memory for you to work with.

Transform:

Transform in ETL refers to manipulating the data.

Pulling from the example above, in the SQL Server table you have

Cust_ID         Cust_NM
9081            Bob Smith

The Excel sheet says

Cust_ID         Cust_NM
CN9081          Bob Smith

In the Excel spreadsheet, someone has put a CN in front of the Cust_ID. This is a common problem you will run across when working with real world data.  Using an ETL tool, you can set up steps that will either remove the CN (or add it- your prerogative).

Load:

Finally, once you have extracted the data from the different sources, transformed it – so differences in the data are correct, you come to the final step: loading it into a location of your choice – usually a data warehouse table.

Splunk: Introduction to Real Time Data Analysis – Setting Alerts

Splunk really shows its power in the realm of real time analysis of unstructed data. A professional implementation of Splunk involves some sort of machine produced data stream being fed into Splunk. This could be web clicks, social media feeds, sensor readings from mechanical devices, log files, etc.

In our example we are going to be working with computer log files I pulled from a server.

Download Data File:  event

Log into Splunk and select Add Data from the home screen

2016-11-30_08-56-59

Select upload

2016-11-30_08-57-31.jpg

Select File

2016-11-30_09-00-27

Select the event.csv file I provided

2016-11-30_08-58-58.jpg

Select Next

2016-11-30_09-01-45.jpg

Select Next (since we are on a local machine, the default Input settings will do fine)

2016-11-30_09-02-03.jpg

Finally, hit Submit

2016-11-30_09-02-18

Now we have the full log file loaded, let’s try filtering the data load down a little.

Put error* in the search bar. (* is a wild card in Splunk)

Now only Errors are showing up in our data.

2016-11-30_09-03-51.jpg

Now try the following in the search bar

error* terminal*

Now my log files are filtered down to TerminalServices Errors only.

2016-11-30_09-06-12

Notice the top bar graph. It shows an abnormal increase in these particular errors in October.

2016-11-30_09-06-39

Setting Alerts

This seems like there has been an abnormal increase in these particular errors. Wouldn’t it be nice to know if these errors were starting to fire off again.

Splunk lets us set Alerts to do just that. Above the bar graph, you will find a drop down menu Save As – click on it and then select Alert

2016-11-30_09-08-13.jpg

Give the Alert a name.

I don’t want to run this on a schedule. Instead I clicked Real-time

I set the Trigger Conditions to go off when more than 50 of these errors appear.

2016-11-30_09-10-02.jpg

Under Add Trigger Actions, I select Add to Triggered Alerts

2016-11-30_09-10-44.jpg

Select your Severity

2016-11-30_09-11-08.jpg

Now the Alert is saved

2016-11-30_09-11-49

If you select Alerts in the top menu, you can see the newly saved alert too.

2016-11-30_09-12-20.jpg

 

 

 

 

R: Twitter Sentiment Analysis

Having a solid understanding of current public sentiment can be a great tool. When deciding if a new marketing campaign is being met warmly, or if a news release about the CEO is causing customers get angry, people in charge of handling a company’s public image need these answers fast. And in the world of social media, we can get those answers fast. One simple, yet effective, tool for testing the public waters is to run a sentiment analysis.

A sentiment analysis works like this. We take a bunch of tweets about whatever we are looking for (in this example we will be looking at President Obama). We then parse those tweets out into individual words and we count the number of positive words and compare it to the number of negative words.

Now the simplicity of this model misses out on some things. Sarcasm can easily missed. Ex. “Oh GREAT job Obama. Thanks for tanking the country once again”. Our model will count 2 positive words (Great and Thanks) and 1 negative word (tanking) giving us an overall score of positive 1.

There are more complex methods for dealing with the issue above, but you’ll be surprised at how good the system works all by itself. While, yes we are going to misread a few tweets, we have the ability to read thousands of tweets, so the larger volume of data negates the overall effect of the sarcastic ones.

First thing we need to do is go get a list of good and bad words. You could make your own up, but there are plenty of pre-populated lists on the Internet for free. The one I will be using is from the University of Illinois at Chicago. You can find the list here:

http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

Once you go to the page, click on Opinion Lexicon and then download the rar file.

You can dowload from the link below, but I want you to know the source in case this link breaks.

Now open the rar file and move the two text files to a folder you can work from.

Next let’s make sure we have the right packages installed. For this we will need, TwitteR, plyr, stringr, and xlsx. If you do not  have these packages installed, you can do so using the following code. (just change out TwitteR for whatever package you need to install)

install.packages("TwitteR")

Now load the libraries

library(stringr)
library(twitteR)
library(xlsx)
library(plyr)

and connect to the Twitter API. If you do not already have a connection set up, check out my lesson on connecting to Twitter: R: Connect to Twitter with R

api_key<- "insert consumer key here"
api_secret <- "insert consumer secret here"
access_token <- "insert access token here"
access_token_secret <- "insert access token secret here
setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)

Okay, so now remember where you stored the text files we just downloaded and set that location as your working directory (wd). Note that we use forward slashes here, even if you are on a Windows box.

setwd("C:/Users/Benjamin/Documents")
neg = scan("negative-words.txt", what="character", comment.char=";")
pos = scan("positive-words.txt", what="character", comment.char=";")

scan looks through the text files and pulls words that start with characters and ignores comment lines that start with ;

You should now have 2 lists of positive and negative words.

You can add words to either list using a  vector operation. Below I added wtf – a popular Internet abbreviation for What the F@#$@ to the negative words

neg = c(neg, 'wtf')

Okay, now here is the engine that runs our analysis. I have tried to comment on what certain commands you may not recognize do.  I have lessons on most features listed here, and will make more lessons on anything missing. If I were to try to explain this step by step, this page would be 10000 lines long and no one would read it.

score.sentiment = function(tweets, pos.words, neg.words)
 
{
 
require(plyr)
require(stringr)

scores = laply(tweets, function(tweet, pos.words, neg.words) {



tweet = gsub('https://','',tweet) # removes https://
tweet = gsub('http://','',tweet) # removes http://
tweet=gsub('[^[:graph:]]', ' ',tweet) ## removes graphic characters 
       #like emoticons 
tweet = gsub('[[:punct:]]', '', tweet) # removes punctuation 
tweet = gsub('[[:cntrl:]]', '', tweet) # removes control characters
tweet = gsub('\\d+', '', tweet) # removes numbers
tweet=str_replace_all(tweet,"[^[:graph:]]", " ") 

tweet = tolower(tweet) # makes all letters lowercase

word.list = str_split(tweet, '\\s+') # splits the tweets by word in a list
 
words = unlist(word.list) # turns the list into vector
 
pos.matches = match(words, pos.words) ## returns matching 
          #values for words from list 
neg.matches = match(words, neg.words)
 
pos.matches = !is.na(pos.matches) ## converts matching values to true of false
neg.matches = !is.na(neg.matches)
 
score = sum(pos.matches) - sum(neg.matches) # true and false are 
                #treated as 1 and 0 so they can be added
 
return(score)
 
}, pos.words, neg.words )
 
scores.df = data.frame(score=scores, text=tweets)
 
return(scores.df)
 
}

Now let’s get some tweets and analyze them. Note, if your computer is slow or old, you can lower the number of tweets to process. Just change n= to a lower number like 100 or 50

tweets = searchTwitter('Obama',n=2500)
Tweets.text = laply(tweets,function(t)t$getText()) # gets text from Tweets

analysis = score.sentiment(Tweets.text, pos, neg) # calls sentiment function

Now lets look at the results. The quickest method available to us is to simply run a histogram

hist(analysis$score)

My results looks like this

2016-11-25_14-39-39

If 0 is completely neutral most people are generally neutral about the president and more people have positives tweets then negatives ones. This is not uncommon for an outgoing president. They generally seem to get a popularity boost after the election is over.

Finally, if you want to save your results, you can export them to excel.

write.xlsx(analysis, "myResults.xlsx")

And you will end up with a file like this

2016-11-25_14-44-06.jpg

Benford’s Law: Fraud Detection by the Numbers

For most HTM professionals, accountancy is not your primary profession. Yet between contracts, capital equipment purchases, and parts spending, many HTM managers are, in effect, running multi-million dollar corporations. With such high dollar items like imaging glassware and ultrasound probes, the potential for fraud runs high. A less than honest employee looking to supplement his or her income can take advantage of chaos often found in a busy shop to help themselves to some high dollar items.

How can you guard yourself against wolves hidden amongst your flock?

The answer may lie in an interesting statistical phenomena known as Benford’s Law, also called the First-Digit Law. Benford’s law has been used by CPAs and fraud investigators to detect abnormalities since the 1930’s.

How does it work?

Imagine I have a bag containing 9 ping pong balls labeled 1 – 9. I ask you to close your eyes, reach into the bag, and pull out a ball. What is the probability you will pull out a 5? Probability theory tells us the answer is 1/9 or approximately 11%. That is because 1 ball out of the 9 in the bag is labeled 5. The same thing applies to any number you choose.

So now we are going to replace the bag of balls with a spreadsheet containing parts purchases over the past year. Looking at the column with purchase prices, I am going to take the first non-zero digit from each row. So if the price is $130.00, I will take 1. If the price is $67.34, I will take 6. Now, I will throw all of those first digits into my imaginary bag, have you reach in blindfolded and pull out a 5.

What do you think the probability was you would pull out a five? 11%? Actually it is 7.8%.

Huh?

I know what you are thinking. I have a bag of random numbers 1-9. Shouldn’t the probability of pulling any digit be 1/9 (11%)? Well not according to Benford’s Law.

Benford’s Law states that first digits taken from an organic set of numbers (i.e. numbers without artificial constraints) will follow the unique distribution seen below.

benford-law

So the probability of pulling a 1 is 30% while the probability of a 9 is under 5%.

The distribution is calculated using the following formula:

2016-11-15_19-18-53

Where the probability (p) of the number in question (n) can be calculated using a logarithm.

While it is a mathematical law, Benford’s Law is a phenomena meaning it is not fully understood how or why it works. The simplest explanation I can offer you involves looking at the following table demonstrating a 10% yearly compounded interest on a $1 investment.

benford1

Notice how the first digit stays 1 for 8 years, but only 4 years at 2, and 1 year by the time it reaches 9. If you were to continue through 10’s to 100’s you would notice the same pattern repeating itself.

How can I use this?

Simply enough, take your parts purchasing history, truncate the first non-zero digit and create a histogram of your results. If it does not look like the distribution seen above, that is a good sign something is wrong. People trying to perpetrate fraud will often try to cover their tracks. Human intervention – altering order histories, removing purchases from ledger, will alter the organic nature of your data, and in turn, alter the distribution of your first digits.

While not the perfect catch-all by any means, Benford’s Law is used regularly by CPAs to alert them to possible fraud, book keeping errors, or other accounting irregularities. It is a simple tool which costs nothing to implement and should be part of the toolbox of any manager overseeing high volume spending.

Python: An Interesting Problem with Pandas

I was writing a little tongue and cheek article for LinkedIn on fraud detection using frequency distributions (you can read the article here: LinkedIn). While this was a non-technical article, I wanted to use some histograms from a real data set, so I uploaded a spread sheet into Python and went to work.

While working with the data I ran into an interesting problem that had me chasing my tail for about 10 minutes before I figured it out. It is a fun little problem involving Series and Dataframes.

As always, you can upload the data set here: FraudCheck1

Upload the data.

import pandas as pd
df = pd.read_excel
("C:\\Users\\Benjamin\\OneDrive\\Documents\\article\\python\\FraudCheck1.xlsx")
df.head()

pandasProb.jpg

The data is pretty simple here. We are concerned with our answer column and the CreatedBy (which is the employee ID).  What I am trying to do is see if the “answer”  (a reading from an electric meter) are really random or if they have been contrived by someone trying to fake the data.

First, I want to get the readings for all the employees, so I used pop() to place the answer column into a separate list.

df1 = df

y = df1.pop("answer")

pandasProb1.jpg

Then, to make my histogram more pleasant looking, I decided to only use the last digit before the decimal. That way I will have 10 bars (0-9). (Remember, this is solely for making charts for an article. So I was not concerned with any more stringent methods of normalization)

What I am doing below is int(199.7%10). Remember % is the modulus – leaves you with the remainder and int converts your float to an integers. So 199.7 is cut to 199. The 199/10 remainder = 9.

a= []
i = 0 
while i < len(y):
     a.append(int(y[i]%10))
     i += 1
a[1:10]

pandasProb2

Then I created my histogram.

%matplotlib inline
from matplotlib import pyplot as plt
plt.hist(a)

pandasProb3.jpg

Now my problem

Now I want graph only the answers from employee 619, so first I filter out all rows but the ones for employee 619.

df2 = df.query('CreatedBy == 619')
y1 =df2.pop("answer")

Then I ran my loop to turn my answers into a single digit.

And I get an error.  Why?

pandasProb4.jpg

Well the answer lies in the datatypes we are working with. Pandas read_excel function creates a Dataframe.

When you pop a column from a dataframe, you end up with a Series. And remember a series is an indexed listing of values.

Let’s look at our Series. Check out the point my line is pointing to below. Notice how my index jumps from 31 to 62. My while loop counts by 1, so after 31, I went looking for y1[32] and it doesn’t exist.

pandasProb5.jpg

Using .tolist() converts our Series to a list and now our while loop works.

pandasProb6.jpg

And now we can build another histogram.

pandasProb7.jpg

The Code

import pandas as pd
df = pd.read_excel
("C:\\Users\\Benjamin\\OneDrive\\Documents\\article\\python\\FraudCheck1.xlsx")
df.head()

df1 = df
y =df1.pop("answer")

a= []
i = 0 
while i < len(y):
   a.append(int(y[i]%10))
   i += 1
a[1:10]

%matplotlib inline
from matplotlib import pyplot as plta1 = []
i = 0
while i < len(y2):
 a1.append(int(y2[i])%10)
 i = i+1
a1[1:10]

plt.hist(a)

df2 = df.query('CreatedBy == 619')
y1 =df2.pop("answer")

y2= y1.tolist()
type(y2)

a1 = []
i = 0
while i < len(y2):
    a1.append(int(y2[i])%10)
    i = i+1
a1[1:10]

plt.hist(a1)

 

SQL: Working with Date/Time Functions

SQL has a pretty extensive list of date time functions that come in handy when writing queries. I am going to cover a handful of the most common ones today.

GETDATE()

Getdate() returns the current datetime

select getdate()

sqlDate6

Date Parts

use AdventureWorks2012
go

SELECT *
 FROM [AdventureWorks2012].[Sales].[ShoppingCartItem]

sqlDate.jpg

First, looking at the data above. The data columns are in a format known as datetime in SQL (you may also hear many refer to it as a time stamp). It contains a date and time. What if you only want the date?

Convert()

SELECT convert(date, ModifiedDate)
 FROM [AdventureWorks2012].[Sales].[ShoppingCartItem]

sqlDate1.jpg

To select the time only

SELECT convert(time, ModifiedDate)
 FROM [AdventureWorks2012].[Sales].[ShoppingCartItem]

sqlDate2.jpg

YEAR(), MONTH(), DAY

SELECT year(ModifiedDate) as Year, month(modifieddate) as Month,
day(modifieddate) as day
from sales.SalesOrderDetail

sqlDate3

You can filter by date elements too.

SELECT year(ModifiedDate) as Year, month(modifieddate) as Month,
day(modifieddate) as day
from sales.SalesOrderDetail
where day(modifieddate) like '28'

sqlDate4.jpg

Datepart()

You can extract time elements using Datepart()

SELECT datepart(hour, modifieddate) as hour, datepart(Minute, ModifiedDate) as minute
 FROM [AdventureWorks2012].[Sales].[ShoppingCartItem]

sqlDate7.jpg

DateAdd()

dateadd() lets you add or subtract days or months or years to a date. In the example below I add 10 days. The syntax is dateadd(interval(d, m, y), amount, date)

select dateadd(d,10,getdate())

sqlDate8.jpg

DateDiff()

Datediff() gives you the difference between two dates. Look at the dates in the query below:

sqlDate5.jpg

We want to know how many days difference between DueDate and ModifiedDate

SELECT datediff(d,duedate, ModifiedDate)
 FROM [AdventureWorks2012].[Purchasing].[PurchaseOrderDetail]

sqlDate9.jpg

 

 

 

R: Data Types Tutorial

In my introduction to R, I covered Lists and Vectors briefly, but I will offer a quick review of them.

Lists

Lists in R are collections of data. These data can be of any type. You can have numbers, letters, and booleans all in a single list if you so choose.

The most important things to remember about lists is that they store any data type and you create them using the syntax: list(data,data,…,data)

dataTy

Vectors

Vectors are a collection of like data types

Here I created 2 vectors – one numeric and one character

syntax: c(data,data,…data1)

dataTy1.jpg

One thing to keep in mind with R is that unlike most programming languages – the indexes in R start at 1.

(1,3) — returns 1rst and 3rd element in vector

(1:3) — returns 1rst through 3rd element

dataTy2

Matrices

A matrix is a multiple dimension vector. All columns must be same length and all elements must be same type.

You shape you matrix by providing row and column size(nrow = and ncol =). You can also name the columns and rows using dimnames=

syntax: matrix (vector, nrow=,ncol=, dimnames=)

dataTy3

Arrays

Arrays allow you to build multi-dimensional arrays. In the example below, I build a 2 dimensional 3×4 array

dataTy4.jpg

Factor

A factor takes a vector and aggregates like items.

As you can see when I print the factor out, it shows all elements in the vector and provides a list of unique elements

nlevels() returns the number of unique elements.

dataTy5

summary() returns a list of the unique elements and a count of items in those elements.

dataTy6

Dataframe

Dataframes allow you to make a tabular table. Also, unlike matrices, the columns in a dataframe can contain different data types

dataTy7

 

 

Python: Lists and Dictionaries

Lists and Dictionaries are kind of like super variables. They allow you to store multiple data under a singular name. In the example below, I first assign the variable ‘a‘ a value of 6. Next when I go to assign it a value of 7, it replaces 6.

What if I want to keep both 6 and 7?

pylist

Sure I could use 2 variables, (ex. a and b), or I can use a list.

List

A list is a collection of data in Python. You create a list by assigning a variable a group of data enclosed in square brackets [].

Notice that lists are indexed, meaning you call them up using the following syntax: variable[index]

Note that indexes in Python start at 0. So in our set of 4 elements, the indexes would be 0,1,2,3. Notice when I try to use index 4, I get an out of range error.

pyfund

Lists can also hold strings, or combinations of values in one list.

pyfund1

Dictionaries

This of dictionaries as lists with named indexes. This makes looking up values easier (especially when your code get more complicated).

Each data element is paired with a named index. A dictionary is denoted through the use of curly braces {} and : separating the named index and the value.

pythonfundaments9


If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT. 

Follow this link for more Python content: Python

How I Found Love Using Pivot Tables

Okay, a little background is in order here. I work for a Clinical Engineering Department in a large hospital system. Our main purpose is to inspect and repair medical equipment, from MRIs and CT Scanners down to IV pumps.

Now I know the title says love, and I promise there is a love connection, just be patient.

While doing some database work on the system we use to track repairs, I decided to do a little data exploration (I don’t have a lot of hobbies). I asked myself, “What equipment is breaking down the most?” I figured this could be a valuable piece of information. So, I exported six months worth of repair history into a CSV file.

Using Excel, I started playing with Pivot Tables. I started by checking to see what types of equipment seemed to break down the most. Turns out it was infusion pumps, not a real surprise to anyone who has ever worked in the field.

excelpivot2

But looking a little more closely. One hospital out of my system used Brand A and they wanted to get rid of them out of the belief they were the unreliable. However, a quick look at the data provided otherwise.

excelpivot3

Okay, so after I unsullied the reputation of the Brand A pump, I decided “Why not look at the repair rates of individual pieces of equipment?” (I know, I live a WILD life)

Below is the list from one of my hospitals. Pay special attention to the area highlighted in red. The amount of repair work orders opened for dental chairs was way off anything my 20 plus years of experience in the field would have led me to expect.

excelpivot1.jpg

So I decided to dig a little further. Well, it turns out all 78 work orders we opened by one of our younger (24 year old) single technicians. A quick walk up to the dental department quickly explained why the dental chairs needed so much attention. The problem was about 5’2″, long blond hair, a cute smile, and (even in scrubs) quite the little body.

So there you go. Young love revealed itself through the power of the pivot table.