Splunk: Install Splunk Light on Ubuntu Linux

In this lesson, I will go over how to install Splunk Light Desktop on Ubuntu Linux.

Open your browser and search for splunk light download

Click on the Splunk Light Software Free Download


Click on the Linux tab and select .deb download


If you go to the file drawer – downloads, you will see the Splunk install file.


Now search system for terminal


With terminal open type the following:

sudo dpkg -i (path and file name of your download file)


Once installed, go to the following folder opt/splunk/bin

cd /opt/splunk/bin

sudo ./splunk start --accept-license


Now you can open your browser and go to http://127.0.01:8000

Login – your first log in they will want you to change your password


And now we are up and running


SAS: Bar Chart

Let’s create a bar chart in SAS. There are two ways you can go about it: Code or work with the menus provided. We will start with the menus.

Start by clicking on Task and Utilities


Select Graph > Bar Chart


In the new window pane that opens, we need to add Data. Select the browse button


Expand SASHELP (a collection of data sets that came with SAS)


We are going to pick SHOES


Now the red * indicates required fields. The only one bar charts require is Category variable. Hit the plus sign.


Select Product – notice how the code automatically fills in for you now.


Hit the little running man and your chart will appear in the Results


Let’s change the color. – go to Options>Bar Details>Apply bar Color and pick a new color


And now it is red (or whatever color you chose)


Go back to the Data tab and add Region to Group Variable


Notice we now have two dimensions Product and Region. Also notice a legend is auto generated.


Now let’s try doing via code, use the following code:

proc sgplot data=SASHELP.SHOES;
 vbar Product / group=Region groupdisplay=Stack name='Bar';
 yaxis grid;

Note I changed groupdisplay to Stack. Here are the results



SAS: Importing .txt .cvs and .xlsx files

SAS is a data based system. It is not much use if you can’t get data into it. In the following exercises, I will show you how to import common data files.

You can download some practice data files here: examplefiles

This is a zip file. Unzip the folder and copy the contents (three files) into a place of your choosing. Now open up SAS Studio:

Select MyFolders on the left pain and select the upload files icon


Now select Choose Files


Your files should now be located under My Folders


Import Txt

If you double click on mileage.txt in the left panel, a preview of the data will pop up.


Now to import this information into SAS, we use the following code.

DATA mileage;
INFILE '/folders/myfolders/mileage.txt';
INPUT year miles;

DATA – names your data set

INFILE choose data source

INPUT – names variables(columns)

Now click the Run icon


Now check your results, you data is loaded.


A quick note, SAS chooses a single space as the default delimiter. If your file has a different delimiter like comma.

INFILE '/folders/myfolders/mileage.txt' DLM=",";

tab delimited
INFILE '/folders/myfolders/mileage.txt' DLM= '09'x

Import CSV

Now lets import our CSV file

DATA salaryHosp;
INFILE "/folders/myfolders/salaryHosp.csv" DSD Firstobs=2;
INPUT job$ years salary;

DSD helps to deal with missing values

Firstobs = 2 lets SAS know to start collecting data on the second row – since the first row in our CSV file is headers.


Import Excel

Importing Excel is pretty straight forward. Use the following code and SAS will take headers from the Excel file to name your variables.

proc import out=sensors DATAFILE="/folders/myfolders/sensors.xlsx" 

the results




SAS: Introduction to SAS

SAS is a dominant force in the world of analytics. While some may argue it is on its way out, a quick query of current Data Scientist job listing in the NYC area shows many companies are still looking for SAS experience.

Luckily for us, SAS has a free University Edition that we can work with to learn SAS for free.

You can Google SAS University Edition or follow the link below:


First you will need to download and install VirtualBox, you can download it from the SAS page. They have a link.


Next download the SAS  OVA package


Once downloaded, just double click on the OVA file and it should automatically install itself into VirtualBox. Note – you must have VirtualBox installed first.

Now, before we can start, you have to make a shared folder on our computer. I just created the following in on my PC.  C:\SasUniversity\Myfolders

Now start VirtualBox: Go to SAS University Edition


Right click and click Settings


Now select Shared Folders – click the blue folder with the green plus sign on the right


Type in the path of your folder, the name of your folder and check Auto-mount and Make Permanent


Close the Setting window out and double click on  SAS University Edition to get it started.

Once it fully loads, you will get a screen like this


Type the path given into a browser on your machine.

In my case the path is: http://localhost:10080

Once it opens, click Stat SAS Studio


Here is SAS Studio. The Left panel of the screen manages folders and other admin tasks. The right panel is where we type our code. We are going to start with a very simple task of creating a small data set of car names, # of cylinders, and # of doors.


Here is the code, I will explain  below the picture.


Now to breakdown the code (note the ; (semicolon) at the end of each line)

DATA cars;  -- this names our data set
INPUT NAME$ CYLINDERS DOORS;  -- names our variables (or columns if you 
                                 will) note the $ after NAME - this 
                                 indicates NAME is
                                 a string
datalines;  -- the data to input into our set will follow below
               you may see some people use CARDS; in place of 
               datalines I believe this dates back to earlier  
              versions where one told how much memory to associate 
              to each row. Now they are interchangeable.
Miata 4 2
Sonata 4 4
Corvette 8 2
Mustang 8 2
Civic 4 2
Accord 4 4
Elentra 4 4
;                 -- end of datalines
RUN;              -- run command - like the go in SQL


Now click the Run icon above your code


And view the output – we have created a data set – congrats!!



R: Twitter Sentiment Analysis

Having a solid understanding of current public sentiment can be a great tool. When deciding if a new marketing campaign is being met warmly, or if a news release about the CEO is causing customers get angry, people in charge of handling a company’s public image need these answers fast. And in the world of social media, we can get those answers fast. One simple, yet effective, tool for testing the public waters is to run a sentiment analysis.

A sentiment analysis works like this. We take a bunch of tweets about whatever we are looking for (in this example we will be looking at President Obama). We then parse those tweets out into individual words and we count the number of positive words and compare it to the number of negative words.

Now the simplicity of this model misses out on some things. Sarcasm can easily missed. Ex. “Oh GREAT job Obama. Thanks for tanking the country once again”. Our model will count 2 positive words (Great and Thanks) and 1 negative word (tanking) giving us an overall score of positive 1.

There are more complex methods for dealing with the issue above, but you’ll be surprised at how good the system works all by itself. While, yes we are going to misread a few tweets, we have the ability to read thousands of tweets, so the larger volume of data negates the overall effect of the sarcastic ones.

First thing we need to do is go get a list of good and bad words. You could make your own up, but there are plenty of pre-populated lists on the Internet for free. The one I will be using is from the University of Illinois at Chicago. You can find the list here:


Once you go to the page, click on Opinion Lexicon and then download the rar file.

You can dowload from the link below, but I want you to know the source in case this link breaks.

Now open the rar file and move the two text files to a folder you can work from.

Next let’s make sure we have the right packages installed. For this we will need, TwitteR, plyr, stringr, and xlsx. If you do not  have these packages installed, you can do so using the following code. (just change out TwitteR for whatever package you need to install)


Now load the libraries


and connect to the Twitter API. If you do not already have a connection set up, check out my lesson on connecting to Twitter: R: Connect to Twitter with R

api_key<- "insert consumer key here"
api_secret <- "insert consumer secret here"
access_token <- "insert access token here"
access_token_secret <- "insert access token secret here

Okay, so now remember where you stored the text files we just downloaded and set that location as your working directory (wd). Note that we use forward slashes here, even if you are on a Windows box.

neg = scan("negative-words.txt", what="character", comment.char=";")
pos = scan("positive-words.txt", what="character", comment.char=";")

scan looks through the text files and pulls words that start with characters and ignores comment lines that start with ;

You should now have 2 lists of positive and negative words.

You can add words to either list using a  vector operation. Below I added wtf – a popular Internet abbreviation for What the F@#$@ to the negative words

neg = c(neg, 'wtf')

Okay, now here is the engine that runs our analysis. I have tried to comment on what certain commands you may not recognize do.  I have lessons on most features listed here, and will make more lessons on anything missing. If I were to try to explain this step by step, this page would be 10000 lines long and no one would read it.

score.sentiment = function(tweets, pos.words, neg.words)

scores = laply(tweets, function(tweet, pos.words, neg.words) {

tweet = gsub('https://','',tweet) # removes https://
tweet = gsub('http://','',tweet) # removes http://
tweet=gsub('[^[:graph:]]', ' ',tweet) ## removes graphic characters 
       #like emoticons 
tweet = gsub('[[:punct:]]', '', tweet) # removes punctuation 
tweet = gsub('[[:cntrl:]]', '', tweet) # removes control characters
tweet = gsub('\\d+', '', tweet) # removes numbers
tweet=str_replace_all(tweet,"[^[:graph:]]", " ") 

tweet = tolower(tweet) # makes all letters lowercase

word.list = str_split(tweet, '\\s+') # splits the tweets by word in a list
words = unlist(word.list) # turns the list into vector
pos.matches = match(words, pos.words) ## returns matching 
          #values for words from list 
neg.matches = match(words, neg.words)
pos.matches = !is.na(pos.matches) ## converts matching values to true of false
neg.matches = !is.na(neg.matches)
score = sum(pos.matches) - sum(neg.matches) # true and false are 
                #treated as 1 and 0 so they can be added
}, pos.words, neg.words )
scores.df = data.frame(score=scores, text=tweets)

Now let’s get some tweets and analyze them. Note, if your computer is slow or old, you can lower the number of tweets to process. Just change n= to a lower number like 100 or 50

tweets = searchTwitter('Obama',n=2500)
Tweets.text = laply(tweets,function(t)t$getText()) # gets text from Tweets

analysis = score.sentiment(Tweets.text, pos, neg) # calls sentiment function

Now lets look at the results. The quickest method available to us is to simply run a histogram


My results looks like this


If 0 is completely neutral most people are generally neutral about the president and more people have positives tweets then negatives ones. This is not uncommon for an outgoing president. They generally seem to get a popularity boost after the election is over.

Finally, if you want to save your results, you can export them to excel.

write.xlsx(analysis, "myResults.xlsx")

And you will end up with a file like this


R: Decision Trees (Regression)

Decision Trees are popular supervised machine learning algorithms. You will often find the abbreviation CART when reading up on decision trees. CART stands for Classification and Regression Trees.

In this example we are going to create a Regression Tree. Meaning we are going to attempt to build a model that can predict a numeric value.

We are going to start by taking a look at the data. In this example we are going to be using the Iris data set native to R. This data set



As you can see, our data has 5 variables – Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. The first 4 variables refer to measurements of flower parts and the species identifies which species of iris this flower represents.

In the Classification example, we tried to predict the Species of flower. In this example we are going to try to predict the Sepal.Length

In order to build our decision tree, first we need to install the correct package.



Next we are going to create our tree. Since we want to predict Sepal.Length – that will be the first element in our fit equation.

fit <- rpart(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width+ Species, 
 method="anova", data=iris )

Note the method in this model is anova. This means we are going to try to predict a number value. If we were doing a classifier model, the method would be class.

Now let’s plot out our model

plot(fit, uniform=TRUE, 
 main="Regression Tree for Sepal Length")
 text(fit, use.n=TRUE, cex = .6)

Note the splits are marked – like the top split is Petal.Length < 4.25

Also, at the terminating point of each branch, you see and n= . The number following this is the number of elements from the data file that fit at the end of that branch.


While this model actually works out pretty good, one thing to look for is over fitting. A good sign of that would be having a bunch of branches terminating with n values of 1 or 2. This means the model is tuned too much to the test data and when run up against a new set of data it will most likely result in poor predictions.

Of course we can look at some of the numbers if you are so inclined.


Notice the xerror (cross validation error) gets better with each split. That is something you want to look out for. If that number starts to creep up as the splits increase, that is a sign you may want to prune some of the branches. I will show how to do that in another lesson.

To get a better picture of the change in xerror as the splits increase, let’s look at a new visualization


This produces 2 charts, 1rst on shows how R-Squared improves as splits increase (remember R-squared gets better as it approaches 1 so this model is improving with each spit)

The second chart shows how xerror decreases with each split. For models that need pruning, you would see the curve starting to go back up as the splits increase. Imagine is split 6 was higher than split 5.


Okay, so finally now that we know the model is good, let’s make a prediction.

testData  <-data.frame (Species = 'setosa', Sepal.Width = 4, Petal.Length =1.2,
predict(fit, testData, method = "anova")


So as you can see, based on our test data, the model predicts our Sepal.Length will be approx 5.17.


R: Decision Trees (Classification)

Decision Trees are popular supervised machine learning algorithms. You will often find the abbreviation CART when reading up on decision trees. CART stands for Classification and Regression Trees.

In this example we are going to create a Classification Tree. Meaning we are going to attempt to classify our data into one of the (three in this case) classes.

We are going to start by taking a look at the data. In this example we are going to be using the Iris data set native to R. This data set



As you can see, our data has 5 variables – Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. The first 4 variables refer to measurements of flower parts and the species identifies which species of iris this flower represents. What we are going to attempt to do here is develop a predictive model that will allow us to identify the species of iris based on measurements.

The species we are trying to predict are setosa, virginica, and versicolor. These are our three classes we are trying to classify our data as.

In order to build our decision tree, first we need to install the correct package.



Next we are going to create our tree. Since we want to predict Species – that will be the first element in our fit equation.

fit <- rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
 method="class", iris)

Now, let’s take a look at the tree.


To understand what the output says, according to our model, if the Pedal.Length is < 2.45 then the flower is classified as setosa. If not, it goes to the next split – Petal Width. If < 1.75 then versicolor, else virginica.


Now, we want to take a look at how good the model is.


I am not going to harp too much on the stats here, but lets look down at the table on the bottom. The first row has a CP = 0.50.  This means (approx) that the first split reduced the relative error by 0.5. You can see this in the rel error in the second row.

Now the 2nd row CP = 0.44, so the second split improved the rel error in the third row to 0.06.

Now personally, when just trying to get a quick overview of the goodness of the model, I look at the xerror (cross validation error) of the final row. 0.10 is a nice low number.


Okay, now lets make a prediction. Start by creating some test data

testData <-data.frame (Sepal.Length = 1, Sepal.Width = 4, Petal.Length =1.2, 
+ Petal.Width=0.3)

Now let’s predict

predict(fit, testData, type="class")

Here is the output:


As you can see, the model predicted setosa. If you look back at the tree, you will see why.

Let’s do one more prediction

predict (fit, newdata, type="class")

Here is the output


The model predicts 1,2,3 are virginica and 4 is versicolor.

Now go find some more data and try this out.

R: laply and ldply from plyr package

R is great programming language when it comes to manipulating data. That is one of the reasons it is so loved by data scientists and statisticians. Being an open source project, R also has the advantage of lots of additional packages that add even more functionality to the language.

The package I am focusing on today is the plyr package. I am just going to barely dip into this package, as I am only go to cover two functions from the package (laply and ldply).  I am covering these as I will be using them in a later lesson how to perform sentiment analysis on Twitter data.

First things first though, you need to download the package.


The functions

laply takes in a list, applies a function, and exports the results into an array.

ldply takes in a list, applies a function, and exports the results into a dataframe.

The syntax for both is simple enough:

laply(list, function(x){ func.. })

We are going to do a simple example, creating a list of words and passing this list to a function that will count the characters in each string and then we will multiply the result times 2.

#create data
l = list('dog','cat','horse','donkey')

l1 = laply(l, function(x){x1=nchar(x) 
                          x1 = x1*2})

If you now check l1, it will return an array of 6,6,10,12 (disclaimer – since this is a single column array  R actually places it into a simple vector)

Now let’s try ldply which return a dataframe ( more useful in my opinion)

l2 = ldply(1, function(x) {x1 =nchar(x)*2})
#add words to l2 dataframe
l2$word <- l

Checking the output of l2 now returns a two column dataframe.


R: Connect to Twitter with R

You can do a lot in the way of text analytics with Twitter. But in order to do so, first you need to connect with Twitter.

In order to do so, first you need to set up an account on Twitter and get an API key. Once you have created your account, go to the following website: https://dev.twitter.com/

Once there, click My apps


Click Create New App


Give name, description (it doesn’t matter – matter anything up), for website I just used http://test.de


Go to Keys and Access Tokens


Get your Consumer Key (API Key), Consumer Secret (API Secret), Access Token, and Access Token Secret

Now open R and install the TwitteR package.


Now load the library


Next we are going to use the API keys you collected to authorize our connection to the Twitter API (Twitter uses OAuth by the way)

api_key<- "insert consumer key here"
api_secret <- "insert consumer secret here"
access_token <- "insert access token here"
access_token_secret <- "insert access token secret here

Now to test to see if your connection is good


Benford’s Law: Fraud Detection by the Numbers

For most HTM professionals, accountancy is not your primary profession. Yet between contracts, capital equipment purchases, and parts spending, many HTM managers are, in effect, running multi-million dollar corporations. With such high dollar items like imaging glassware and ultrasound probes, the potential for fraud runs high. A less than honest employee looking to supplement his or her income can take advantage of chaos often found in a busy shop to help themselves to some high dollar items.

How can you guard yourself against wolves hidden amongst your flock?

The answer may lie in an interesting statistical phenomena known as Benford’s Law, also called the First-Digit Law. Benford’s law has been used by CPAs and fraud investigators to detect abnormalities since the 1930’s.

How does it work?

Imagine I have a bag containing 9 ping pong balls labeled 1 – 9. I ask you to close your eyes, reach into the bag, and pull out a ball. What is the probability you will pull out a 5? Probability theory tells us the answer is 1/9 or approximately 11%. That is because 1 ball out of the 9 in the bag is labeled 5. The same thing applies to any number you choose.

So now we are going to replace the bag of balls with a spreadsheet containing parts purchases over the past year. Looking at the column with purchase prices, I am going to take the first non-zero digit from each row. So if the price is $130.00, I will take 1. If the price is $67.34, I will take 6. Now, I will throw all of those first digits into my imaginary bag, have you reach in blindfolded and pull out a 5.

What do you think the probability was you would pull out a five? 11%? Actually it is 7.8%.


I know what you are thinking. I have a bag of random numbers 1-9. Shouldn’t the probability of pulling any digit be 1/9 (11%)? Well not according to Benford’s Law.

Benford’s Law states that first digits taken from an organic set of numbers (i.e. numbers without artificial constraints) will follow the unique distribution seen below.


So the probability of pulling a 1 is 30% while the probability of a 9 is under 5%.

The distribution is calculated using the following formula:


Where the probability (p) of the number in question (n) can be calculated using a logarithm.

While it is a mathematical law, Benford’s Law is a phenomena meaning it is not fully understood how or why it works. The simplest explanation I can offer you involves looking at the following table demonstrating a 10% yearly compounded interest on a $1 investment.


Notice how the first digit stays 1 for 8 years, but only 4 years at 2, and 1 year by the time it reaches 9. If you were to continue through 10’s to 100’s you would notice the same pattern repeating itself.

How can I use this?

Simply enough, take your parts purchasing history, truncate the first non-zero digit and create a histogram of your results. If it does not look like the distribution seen above, that is a good sign something is wrong. People trying to perpetrate fraud will often try to cover their tracks. Human intervention – altering order histories, removing purchases from ledger, will alter the organic nature of your data, and in turn, alter the distribution of your first digits.

While not the perfect catch-all by any means, Benford’s Law is used regularly by CPAs to alert them to possible fraud, book keeping errors, or other accounting irregularities. It is a simple tool which costs nothing to implement and should be part of the toolbox of any manager overseeing high volume spending.