Splunk: Introduction to Real Time Data Analysis – Setting Alerts

Splunk really shows its power in the realm of real time analysis of unstructed data. A professional implementation of Splunk involves some sort of machine produced data stream being fed into Splunk. This could be web clicks, social media feeds, sensor readings from mechanical devices, log files, etc.

In our example we are going to be working with computer log files I pulled from a server.

Download Data File:  event

Log into Splunk and select Add Data from the home screen


Select upload


Select File


Select the event.csv file I provided


Select Next


Select Next (since we are on a local machine, the default Input settings will do fine)


Finally, hit Submit


Now we have the full log file loaded, let’s try filtering the data load down a little.

Put error* in the search bar. (* is a wild card in Splunk)

Now only Errors are showing up in our data.


Now try the following in the search bar

error* terminal*

Now my log files are filtered down to TerminalServices Errors only.


Notice the top bar graph. It shows an abnormal increase in these particular errors in October.


Setting Alerts

This seems like there has been an abnormal increase in these particular errors. Wouldn’t it be nice to know if these errors were starting to fire off again.

Splunk lets us set Alerts to do just that. Above the bar graph, you will find a drop down menu Save As – click on it and then select Alert


Give the Alert a name.

I don’t want to run this on a schedule. Instead I clicked Real-time

I set the Trigger Conditions to go off when more than 50 of these errors appear.


Under Add Trigger Actions, I select Add to Triggered Alerts


Select your Severity


Now the Alert is saved


If you select Alerts in the top menu, you can see the newly saved alert too.






R: Twitter Sentiment Analysis

Having a solid understanding of current public sentiment can be a great tool. When deciding if a new marketing campaign is being met warmly, or if a news release about the CEO is causing customers get angry, people in charge of handling a company’s public image need these answers fast. And in the world of social media, we can get those answers fast. One simple, yet effective, tool for testing the public waters is to run a sentiment analysis.

A sentiment analysis works like this. We take a bunch of tweets about whatever we are looking for (in this example we will be looking at President Obama). We then parse those tweets out into individual words and we count the number of positive words and compare it to the number of negative words.

Now the simplicity of this model misses out on some things. Sarcasm can easily missed. Ex. “Oh GREAT job Obama. Thanks for tanking the country once again”. Our model will count 2 positive words (Great and Thanks) and 1 negative word (tanking) giving us an overall score of positive 1.

There are more complex methods for dealing with the issue above, but you’ll be surprised at how good the system works all by itself. While, yes we are going to misread a few tweets, we have the ability to read thousands of tweets, so the larger volume of data negates the overall effect of the sarcastic ones.

First thing we need to do is go get a list of good and bad words. You could make your own up, but there are plenty of pre-populated lists on the Internet for free. The one I will be using is from the University of Illinois at Chicago. You can find the list here:


Once you go to the page, click on Opinion Lexicon and then download the rar file.

You can dowload from the link below, but I want you to know the source in case this link breaks.

Now open the rar file and move the two text files to a folder you can work from.

Next let’s make sure we have the right packages installed. For this we will need, TwitteR, plyr, stringr, and xlsx. If you do not  have these packages installed, you can do so using the following code. (just change out TwitteR for whatever package you need to install)


Now load the libraries


and connect to the Twitter API. If you do not already have a connection set up, check out my lesson on connecting to Twitter: R: Connect to Twitter with R

api_key<- "insert consumer key here"
api_secret <- "insert consumer secret here"
access_token <- "insert access token here"
access_token_secret <- "insert access token secret here

Okay, so now remember where you stored the text files we just downloaded and set that location as your working directory (wd). Note that we use forward slashes here, even if you are on a Windows box.

neg = scan("negative-words.txt", what="character", comment.char=";")
pos = scan("positive-words.txt", what="character", comment.char=";")

scan looks through the text files and pulls words that start with characters and ignores comment lines that start with ;

You should now have 2 lists of positive and negative words.

You can add words to either list using a  vector operation. Below I added wtf – a popular Internet abbreviation for What the F@#$@ to the negative words

neg = c(neg, 'wtf')

Okay, now here is the engine that runs our analysis. I have tried to comment on what certain commands you may not recognize do.  I have lessons on most features listed here, and will make more lessons on anything missing. If I were to try to explain this step by step, this page would be 10000 lines long and no one would read it.

score.sentiment = function(tweets, pos.words, neg.words)

scores = laply(tweets, function(tweet, pos.words, neg.words) {

tweet = gsub('https://','',tweet) # removes https://
tweet = gsub('http://','',tweet) # removes http://
tweet=gsub('[^[:graph:]]', ' ',tweet) ## removes graphic characters 
       #like emoticons 
tweet = gsub('[[:punct:]]', '', tweet) # removes punctuation 
tweet = gsub('[[:cntrl:]]', '', tweet) # removes control characters
tweet = gsub('\\d+', '', tweet) # removes numbers
tweet=str_replace_all(tweet,"[^[:graph:]]", " ") 

tweet = tolower(tweet) # makes all letters lowercase

word.list = str_split(tweet, '\\s+') # splits the tweets by word in a list
words = unlist(word.list) # turns the list into vector
pos.matches = match(words, pos.words) ## returns matching 
          #values for words from list 
neg.matches = match(words, neg.words)
pos.matches = !is.na(pos.matches) ## converts matching values to true of false
neg.matches = !is.na(neg.matches)
score = sum(pos.matches) - sum(neg.matches) # true and false are 
                #treated as 1 and 0 so they can be added
}, pos.words, neg.words )
scores.df = data.frame(score=scores, text=tweets)

Now let’s get some tweets and analyze them. Note, if your computer is slow or old, you can lower the number of tweets to process. Just change n= to a lower number like 100 or 50

tweets = searchTwitter('Obama',n=2500)
Tweets.text = laply(tweets,function(t)t$getText()) # gets text from Tweets

analysis = score.sentiment(Tweets.text, pos, neg) # calls sentiment function

Now lets look at the results. The quickest method available to us is to simply run a histogram


My results looks like this


If 0 is completely neutral most people are generally neutral about the president and more people have positives tweets then negatives ones. This is not uncommon for an outgoing president. They generally seem to get a popularity boost after the election is over.

Finally, if you want to save your results, you can export them to excel.

write.xlsx(analysis, "myResults.xlsx")

And you will end up with a file like this


Benford’s Law: Fraud Detection by the Numbers

For most HTM professionals, accountancy is not your primary profession. Yet between contracts, capital equipment purchases, and parts spending, many HTM managers are, in effect, running multi-million dollar corporations. With such high dollar items like imaging glassware and ultrasound probes, the potential for fraud runs high. A less than honest employee looking to supplement his or her income can take advantage of chaos often found in a busy shop to help themselves to some high dollar items.

How can you guard yourself against wolves hidden amongst your flock?

The answer may lie in an interesting statistical phenomena known as Benford’s Law, also called the First-Digit Law. Benford’s law has been used by CPAs and fraud investigators to detect abnormalities since the 1930’s.

How does it work?

Imagine I have a bag containing 9 ping pong balls labeled 1 – 9. I ask you to close your eyes, reach into the bag, and pull out a ball. What is the probability you will pull out a 5? Probability theory tells us the answer is 1/9 or approximately 11%. That is because 1 ball out of the 9 in the bag is labeled 5. The same thing applies to any number you choose.

So now we are going to replace the bag of balls with a spreadsheet containing parts purchases over the past year. Looking at the column with purchase prices, I am going to take the first non-zero digit from each row. So if the price is $130.00, I will take 1. If the price is $67.34, I will take 6. Now, I will throw all of those first digits into my imaginary bag, have you reach in blindfolded and pull out a 5.

What do you think the probability was you would pull out a five? 11%? Actually it is 7.8%.


I know what you are thinking. I have a bag of random numbers 1-9. Shouldn’t the probability of pulling any digit be 1/9 (11%)? Well not according to Benford’s Law.

Benford’s Law states that first digits taken from an organic set of numbers (i.e. numbers without artificial constraints) will follow the unique distribution seen below.


So the probability of pulling a 1 is 30% while the probability of a 9 is under 5%.

The distribution is calculated using the following formula:


Where the probability (p) of the number in question (n) can be calculated using a logarithm.

While it is a mathematical law, Benford’s Law is a phenomena meaning it is not fully understood how or why it works. The simplest explanation I can offer you involves looking at the following table demonstrating a 10% yearly compounded interest on a $1 investment.


Notice how the first digit stays 1 for 8 years, but only 4 years at 2, and 1 year by the time it reaches 9. If you were to continue through 10’s to 100’s you would notice the same pattern repeating itself.

How can I use this?

Simply enough, take your parts purchasing history, truncate the first non-zero digit and create a histogram of your results. If it does not look like the distribution seen above, that is a good sign something is wrong. People trying to perpetrate fraud will often try to cover their tracks. Human intervention – altering order histories, removing purchases from ledger, will alter the organic nature of your data, and in turn, alter the distribution of your first digits.

While not the perfect catch-all by any means, Benford’s Law is used regularly by CPAs to alert them to possible fraud, book keeping errors, or other accounting irregularities. It is a simple tool which costs nothing to implement and should be part of the toolbox of any manager overseeing high volume spending.

Python: An Interesting Problem with Pandas

I was writing a little tongue and cheek article for LinkedIn on fraud detection using frequency distributions (you can read the article here: LinkedIn). While this was a non-technical article, I wanted to use some histograms from a real data set, so I uploaded a spread sheet into Python and went to work.

While working with the data I ran into an interesting problem that had me chasing my tail for about 10 minutes before I figured it out. It is a fun little problem involving Series and Dataframes.

As always, you can upload the data set here: FraudCheck1

Upload the data.

import pandas as pd
df = pd.read_excel


The data is pretty simple here. We are concerned with our answer column and the CreatedBy (which is the employee ID).  What I am trying to do is see if the “answer”  (a reading from an electric meter) are really random or if they have been contrived by someone trying to fake the data.

First, I want to get the readings for all the employees, so I used pop() to place the answer column into a separate list.

df1 = df

y = df1.pop("answer")


Then, to make my histogram more pleasant looking, I decided to only use the last digit before the decimal. That way I will have 10 bars (0-9). (Remember, this is solely for making charts for an article. So I was not concerned with any more stringent methods of normalization)

What I am doing below is int(199.7%10). Remember % is the modulus – leaves you with the remainder and int converts your float to an integers. So 199.7 is cut to 199. The 199/10 remainder = 9.

a= []
i = 0 
while i < len(y):
     i += 1


Then I created my histogram.

%matplotlib inline
from matplotlib import pyplot as plt


Now my problem

Now I want graph only the answers from employee 619, so first I filter out all rows but the ones for employee 619.

df2 = df.query('CreatedBy == 619')
y1 =df2.pop("answer")

Then I ran my loop to turn my answers into a single digit.

And I get an error.  Why?


Well the answer lies in the datatypes we are working with. Pandas read_excel function creates a Dataframe.

When you pop a column from a dataframe, you end up with a Series. And remember a series is an indexed listing of values.

Let’s look at our Series. Check out the point my line is pointing to below. Notice how my index jumps from 31 to 62. My while loop counts by 1, so after 31, I went looking for y1[32] and it doesn’t exist.


Using .tolist() converts our Series to a list and now our while loop works.


And now we can build another histogram.


The Code

import pandas as pd
df = pd.read_excel

df1 = df
y =df1.pop("answer")

a= []
i = 0 
while i < len(y):
   i += 1

%matplotlib inline
from matplotlib import pyplot as plta1 = []
i = 0
while i < len(y2):
 i = i+1


df2 = df.query('CreatedBy == 619')
y1 =df2.pop("answer")

y2= y1.tolist()

a1 = []
i = 0
while i < len(y2):
    i = i+1



SQL: Working with Date/Time Functions

SQL has a pretty extensive list of date time functions that come in handy when writing queries. I am going to cover a handful of the most common ones today.


Getdate() returns the current datetime

select getdate()


Date Parts

use AdventureWorks2012

 FROM [AdventureWorks2012].[Sales].[ShoppingCartItem]


First, looking at the data above. The data columns are in a format known as datetime in SQL (you may also hear many refer to it as a time stamp). It contains a date and time. What if you only want the date?


SELECT convert(date, ModifiedDate)
 FROM [AdventureWorks2012].[Sales].[ShoppingCartItem]


To select the time only

SELECT convert(time, ModifiedDate)
 FROM [AdventureWorks2012].[Sales].[ShoppingCartItem]



SELECT year(ModifiedDate) as Year, month(modifieddate) as Month,
day(modifieddate) as day
from sales.SalesOrderDetail


You can filter by date elements too.

SELECT year(ModifiedDate) as Year, month(modifieddate) as Month,
day(modifieddate) as day
from sales.SalesOrderDetail
where day(modifieddate) like '28'



You can extract time elements using Datepart()

SELECT datepart(hour, modifieddate) as hour, datepart(Minute, ModifiedDate) as minute
 FROM [AdventureWorks2012].[Sales].[ShoppingCartItem]



dateadd() lets you add or subtract days or months or years to a date. In the example below I add 10 days. The syntax is dateadd(interval(d, m, y), amount, date)

select dateadd(d,10,getdate())



Datediff() gives you the difference between two dates. Look at the dates in the query below:


We want to know how many days difference between DueDate and ModifiedDate

SELECT datediff(d,duedate, ModifiedDate)
 FROM [AdventureWorks2012].[Purchasing].[PurchaseOrderDetail]





R: Data Types Tutorial

In my introduction to R, I covered Lists and Vectors briefly, but I will offer a quick review of them.


Lists in R are collections of data. These data can be of any type. You can have numbers, letters, and booleans all in a single list if you so choose.

The most important things to remember about lists is that they store any data type and you create them using the syntax: list(data,data,…,data)



Vectors are a collection of like data types

Here I created 2 vectors – one numeric and one character

syntax: c(data,data,…data1)


One thing to keep in mind with R is that unlike most programming languages – the indexes in R start at 1.

(1,3) — returns 1rst and 3rd element in vector

(1:3) — returns 1rst through 3rd element



A matrix is a multiple dimension vector. All columns must be same length and all elements must be same type.

You shape you matrix by providing row and column size(nrow = and ncol =). You can also name the columns and rows using dimnames=

syntax: matrix (vector, nrow=,ncol=, dimnames=)



Arrays allow you to build multi-dimensional arrays. In the example below, I build a 2 dimensional 3×4 array



A factor takes a vector and aggregates like items.

As you can see when I print the factor out, it shows all elements in the vector and provides a list of unique elements

nlevels() returns the number of unique elements.


summary() returns a list of the unique elements and a count of items in those elements.



Dataframes allow you to make a tabular table. Also, unlike matrices, the columns in a dataframe can contain different data types




Python: Lists and Dictionaries

Lists and Dictionaries are kind of like super variables. They allow you to store multiple data under a singular name. In the example below, I first assign the variable ‘a‘ a value of 6. Next when I go to assign it a value of 7, it replaces 6.

What if I want to keep both 6 and 7?


Sure I could use 2 variables, (ex. a and b), or I can use a list.


A list is a collection of data in Python. You create a list by assigning a variable a group of data enclosed in square brackets [].

Notice that lists are indexed, meaning you call them up using the following syntax: variable[index]

Note that indexes in Python start at 0. So in our set of 4 elements, the indexes would be 0,1,2,3. Notice when I try to use index 4, I get an out of range error.


Lists can also hold strings, or combinations of values in one list.



This of dictionaries as lists with named indexes. This makes looking up values easier (especially when your code get more complicated).

Each data element is paired with a named index. A dictionary is denoted through the use of curly braces {} and : separating the named index and the value.


If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT. 

Follow this link for more Python content: Python

How I Found Love Using Pivot Tables

Okay, a little background is in order here. I work for a Clinical Engineering Department in a large hospital system. Our main purpose is to inspect and repair medical equipment, from MRIs and CT Scanners down to IV pumps.

Now I know the title says love, and I promise there is a love connection, just be patient.

While doing some database work on the system we use to track repairs, I decided to do a little data exploration (I don’t have a lot of hobbies). I asked myself, “What equipment is breaking down the most?” I figured this could be a valuable piece of information. So, I exported six months worth of repair history into a CSV file.

Using Excel, I started playing with Pivot Tables. I started by checking to see what types of equipment seemed to break down the most. Turns out it was infusion pumps, not a real surprise to anyone who has ever worked in the field.


But looking a little more closely. One hospital out of my system used Brand A and they wanted to get rid of them out of the belief they were the unreliable. However, a quick look at the data provided otherwise.


Okay, so after I unsullied the reputation of the Brand A pump, I decided “Why not look at the repair rates of individual pieces of equipment?” (I know, I live a WILD life)

Below is the list from one of my hospitals. Pay special attention to the area highlighted in red. The amount of repair work orders opened for dental chairs was way off anything my 20 plus years of experience in the field would have led me to expect.


So I decided to dig a little further. Well, it turns out all 78 work orders we opened by one of our younger (24 year old) single technicians. A quick walk up to the dental department quickly explained why the dental chairs needed so much attention. The problem was about 5’2″, long blond hair, a cute smile, and (even in scrubs) quite the little body.

So there you go. Young love revealed itself through the power of the pivot table.

Python: Fun with Central Tendency

Now numpy provides easy functions for finding central tendency – you can find them in my Numpy Part II lesson: Python: Numpy Part II.

But we have learned enough about Python so far, that maybe it would be more fun to build our own functions. In this lesson we are going to build our own statistics library with mean, median, mode, and quantile

Our Data




or even easier:



Remember with median – if your data contains an odd number of elements (n%2==1), you just take the middle value. However, if your data contains and even number of elements (n%2==0) then you add the two middle numbers and divide by 2.

We handle that through the use of an if statement in our function.



For mode, we want to find the most common element in the list. For this task, I will import Counter from collections.

d.most_common() returns each unique element and the number of times it appears

d.most_common(1) returns the


or in function form:


In my numpy part II lesson I use a more elegant solution, but I want you to see that there is always more than one way to do something in Python.


Quantiles are cut points in set of data. They can represent the bottom ten percent of the data or the top 75% or any % from 0 to 100.

In my solution, I am adding a slicing argument to the sorted() function. Below shows all elements up to (but not including) index 4


quantile function:


If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT. 

Follow this link for more Python content: Python