BBC News: Indonesia traffic jam: 12 die in Java gridlock during Ramadan

Indonesia traffic jam: 12 die in Java gridlock during Ramadan –

SSRS: Drill down

Let’s make the report at little more readable and interactive.

We are going to do a drill down on the names in our report. This mean when the report first opens, you will only see one row for each name. If you want to see the details of that person, you can click on a + by their name and the details for that person will pop down like an accordion.

Below your report design window, you should see another window called Row Groups

Go to the area marked Details (green arrow) and right click – Group Properties…


Click Visibility > Hide > Display can be toggled.

(Note I am selecting Name1 not Name. If you remember Name was the column we hid. Name1 is the column created when we created the Parent Group)


Now when you open your report, you will see it is in drill down format.


Blockchain: Immutable Ledger

The next concept we need to cover to properly understand blockchain is the concept of an Immutable Ledger.  To translate this term into something that looks more like English, an Immutable Ledger simply means a record that cannot be changed.

The idea behind all of this is data security and proof that the data has not been altered. Why are we so concerned here? In a blockchain application like BitCoin, we are tracking transactions of money. Imagine if you sent me an electronic funds transfer for $100. How would you feel if I hacked into your bank and changed that $100, to $100,000? (For the record, if you tried that with my account, the computer would just laugh at you. The only time you’ll see a number that big associated with me is when you are looking at my student loans LOL).

Anyway, back to the point, you want to make sure that when you set up a transfer for $100, no one can alter it. With blockchain, that is true even if you want it to be altered (you made a mistake). If you want to fix an error, you will have to add another transaction to the blockchain to correct the issue. This is actually good accounting practice, as once an entry is made into a ledger, it should never be removed or altered.

Think of this like purchasing a car. If you go to your neighbor and buy his used car for $2000. You give him the money, and he signs over a title to you. The title is proof of ownership. To ensure that your neighbor cannot just take the car back, you take the title down to the Department of Motor Vehicles and have the title registered in your name. Now you have a record of your transaction should the ownership of the vehicle ever come into question.

So how does blockchain ensure immutability of the ledger? It all resides in the concept of the hash. If hacker tries to alter anything in the block below, its hash will change. Now the hash will no long match the previous hash in the second block. So, the hacker would have to change the next block, and the block after that, etc.


And even if they were able to pull that off, remember that the blockchain resides on multiple computers. The hacker would need to make all of these changes simultaneously. When you consider the millions of nodes that make up a blockchain environment like BitCoin, you will see that would be impossible.


In the next lesson, we will be looking at peer to peer distributed networks

ETL : Extract, Transform, Load

It you ever actually intend to work with data in the real world, you will quickly come face to face with two truths:

  1. Data is messy
  2. The data you need is never in one place

While creating predictive models and showing off your brilliant math skills is infinitely more sexy, before you even begin to build a simple linear regression until you have all your data together and in the correct format.

If you haven’t heard it before, it is a pretty common estimate that a data scientist spends 60 to 70 percent of their time working as a data “janitor”. While on a smaller scale, most of this data manipulation and merging (often referred to as munging) can be done manually, in a professional environment, you are going to want to find a way to try to automate as much of the process as possible.

ETL Tools

This is where ETL Tools come into play. Some of the more popular ones on the market are:

  • Oracle Warehouse Builder (OWB)
  • SAS Data Management
  • PowerCenter Informatica
  • SQL Server Integration Services (SSIS)

SSIS is the one I am most familiar with (also the one I use professionally) so it will be the one I will be focusing on in future lessons.

What these ETL Tools do is help you automate the data munging process.


When discussing ETL, extract means pulling data from a source (or many sources). Consider this, you have Customer table residing on a SQL Server Database, a Sales Person table sitting on a Teradata Database, and a list of purchases kept on an Excel spreadsheet.

Using an SSIS tool, you can connect to these three different data sources and query the data (Extract) into memory for you to work with.


Transform in ETL refers to manipulating the data.

Pulling from the example above, in the SQL Server table you have

Cust_ID         Cust_NM
9081            Bob Smith

The Excel sheet says

Cust_ID         Cust_NM
CN9081          Bob Smith

In the Excel spreadsheet, someone has put a CN in front of the Cust_ID. This is a common problem you will run across when working with real world data.  Using an ETL tool, you can set up steps that will either remove the CN (or add it- your prerogative).


Finally, once you have extracted the data from the different sources, transformed it – so differences in the data are correct, you come to the final step: loading it into a location of your choice – usually a data warehouse table.

Splunk: Introduction to Real Time Data Analysis – Setting Alerts

Splunk really shows its power in the realm of real time analysis of unstructed data. A professional implementation of Splunk involves some sort of machine produced data stream being fed into Splunk. This could be web clicks, social media feeds, sensor readings from mechanical devices, log files, etc.

In our example we are going to be working with computer log files I pulled from a server.

Download Data File:  event

Log into Splunk and select Add Data from the home screen


Select upload


Select File


Select the event.csv file I provided


Select Next


Select Next (since we are on a local machine, the default Input settings will do fine)


Finally, hit Submit


Now we have the full log file loaded, let’s try filtering the data load down a little.

Put error* in the search bar. (* is a wild card in Splunk)

Now only Errors are showing up in our data.


Now try the following in the search bar

error* terminal*

Now my log files are filtered down to TerminalServices Errors only.


Notice the top bar graph. It shows an abnormal increase in these particular errors in October.


Setting Alerts

This seems like there has been an abnormal increase in these particular errors. Wouldn’t it be nice to know if these errors were starting to fire off again.

Splunk lets us set Alerts to do just that. Above the bar graph, you will find a drop down menu Save As – click on it and then select Alert


Give the Alert a name.

I don’t want to run this on a schedule. Instead I clicked Real-time

I set the Trigger Conditions to go off when more than 50 of these errors appear.


Under Add Trigger Actions, I select Add to Triggered Alerts


Select your Severity


Now the Alert is saved


If you select Alerts in the top menu, you can see the newly saved alert too.






R: Twitter Sentiment Analysis

Having a solid understanding of current public sentiment can be a great tool. When deciding if a new marketing campaign is being met warmly, or if a news release about the CEO is causing customers get angry, people in charge of handling a company’s public image need these answers fast. And in the world of social media, we can get those answers fast. One simple, yet effective, tool for testing the public waters is to run a sentiment analysis.

A sentiment analysis works like this. We take a bunch of tweets about whatever we are looking for (in this example we will be looking at President Obama). We then parse those tweets out into individual words and we count the number of positive words and compare it to the number of negative words.

Now the simplicity of this model misses out on some things. Sarcasm can easily missed. Ex. “Oh GREAT job Obama. Thanks for tanking the country once again”. Our model will count 2 positive words (Great and Thanks) and 1 negative word (tanking) giving us an overall score of positive 1.

There are more complex methods for dealing with the issue above, but you’ll be surprised at how good the system works all by itself. While, yes we are going to misread a few tweets, we have the ability to read thousands of tweets, so the larger volume of data negates the overall effect of the sarcastic ones.

First thing we need to do is go get a list of good and bad words. You could make your own up, but there are plenty of pre-populated lists on the Internet for free. The one I will be using is from the University of Illinois at Chicago. You can find the list here:

Once you go to the page, click on Opinion Lexicon and then download the rar file.

You can dowload from the link below, but I want you to know the source in case this link breaks.

Now open the rar file and move the two text files to a folder you can work from.

Next let’s make sure we have the right packages installed. For this we will need, TwitteR, plyr, stringr, and xlsx. If you do not  have these packages installed, you can do so using the following code. (just change out TwitteR for whatever package you need to install)


Now load the libraries


and connect to the Twitter API. If you do not already have a connection set up, check out my lesson on connecting to Twitter: R: Connect to Twitter with R

api_key<- "insert consumer key here"
api_secret <- "insert consumer secret here"
access_token <- "insert access token here"
access_token_secret <- "insert access token secret here

Okay, so now remember where you stored the text files we just downloaded and set that location as your working directory (wd). Note that we use forward slashes here, even if you are on a Windows box.

neg = scan("negative-words.txt", what="character", comment.char=";")
pos = scan("positive-words.txt", what="character", comment.char=";")

scan looks through the text files and pulls words that start with characters and ignores comment lines that start with ;

You should now have 2 lists of positive and negative words.

You can add words to either list using a  vector operation. Below I added wtf – a popular Internet abbreviation for What the F@#$@ to the negative words

neg = c(neg, 'wtf')

Okay, now here is the engine that runs our analysis. I have tried to comment on what certain commands you may not recognize do.  I have lessons on most features listed here, and will make more lessons on anything missing. If I were to try to explain this step by step, this page would be 10000 lines long and no one would read it.

score.sentiment = function(tweets, pos.words, neg.words)

scores = laply(tweets, function(tweet, pos.words, neg.words) {

tweet = gsub('https://','',tweet) # removes https://
tweet = gsub('http://','',tweet) # removes http://
tweet=gsub('[^[:graph:]]', ' ',tweet) ## removes graphic characters 
       #like emoticons 
tweet = gsub('[[:punct:]]', '', tweet) # removes punctuation 
tweet = gsub('[[:cntrl:]]', '', tweet) # removes control characters
tweet = gsub('\\d+', '', tweet) # removes numbers
tweet=str_replace_all(tweet,"[^[:graph:]]", " ") 

tweet = tolower(tweet) # makes all letters lowercase

word.list = str_split(tweet, '\\s+') # splits the tweets by word in a list
words = unlist(word.list) # turns the list into vector
pos.matches = match(words, pos.words) ## returns matching 
          #values for words from list 
neg.matches = match(words, neg.words)
pos.matches = ! ## converts matching values to true of false
neg.matches = !
score = sum(pos.matches) - sum(neg.matches) # true and false are 
                #treated as 1 and 0 so they can be added
}, pos.words, neg.words )
scores.df = data.frame(score=scores, text=tweets)

Now let’s get some tweets and analyze them. Note, if your computer is slow or old, you can lower the number of tweets to process. Just change n= to a lower number like 100 or 50

tweets = searchTwitter('Obama',n=2500)
Tweets.text = laply(tweets,function(t)t$getText()) # gets text from Tweets

analysis = score.sentiment(Tweets.text, pos, neg) # calls sentiment function

Now lets look at the results. The quickest method available to us is to simply run a histogram


My results looks like this


If 0 is completely neutral most people are generally neutral about the president and more people have positives tweets then negatives ones. This is not uncommon for an outgoing president. They generally seem to get a popularity boost after the election is over.

Finally, if you want to save your results, you can export them to excel.

write.xlsx(analysis, "myResults.xlsx")

And you will end up with a file like this


Benford’s Law: Fraud Detection by the Numbers

For most HTM professionals, accountancy is not your primary profession. Yet between contracts, capital equipment purchases, and parts spending, many HTM managers are, in effect, running multi-million dollar corporations. With such high dollar items like imaging glassware and ultrasound probes, the potential for fraud runs high. A less than honest employee looking to supplement his or her income can take advantage of chaos often found in a busy shop to help themselves to some high dollar items.

How can you guard yourself against wolves hidden amongst your flock?

The answer may lie in an interesting statistical phenomena known as Benford’s Law, also called the First-Digit Law. Benford’s law has been used by CPAs and fraud investigators to detect abnormalities since the 1930’s.

How does it work?

Imagine I have a bag containing 9 ping pong balls labeled 1 – 9. I ask you to close your eyes, reach into the bag, and pull out a ball. What is the probability you will pull out a 5? Probability theory tells us the answer is 1/9 or approximately 11%. That is because 1 ball out of the 9 in the bag is labeled 5. The same thing applies to any number you choose.

So now we are going to replace the bag of balls with a spreadsheet containing parts purchases over the past year. Looking at the column with purchase prices, I am going to take the first non-zero digit from each row. So if the price is $130.00, I will take 1. If the price is $67.34, I will take 6. Now, I will throw all of those first digits into my imaginary bag, have you reach in blindfolded and pull out a 5.

What do you think the probability was you would pull out a five? 11%? Actually it is 7.8%.


I know what you are thinking. I have a bag of random numbers 1-9. Shouldn’t the probability of pulling any digit be 1/9 (11%)? Well not according to Benford’s Law.

Benford’s Law states that first digits taken from an organic set of numbers (i.e. numbers without artificial constraints) will follow the unique distribution seen below.


So the probability of pulling a 1 is 30% while the probability of a 9 is under 5%.

The distribution is calculated using the following formula:


Where the probability (p) of the number in question (n) can be calculated using a logarithm.

While it is a mathematical law, Benford’s Law is a phenomena meaning it is not fully understood how or why it works. The simplest explanation I can offer you involves looking at the following table demonstrating a 10% yearly compounded interest on a $1 investment.


Notice how the first digit stays 1 for 8 years, but only 4 years at 2, and 1 year by the time it reaches 9. If you were to continue through 10’s to 100’s you would notice the same pattern repeating itself.

How can I use this?

Simply enough, take your parts purchasing history, truncate the first non-zero digit and create a histogram of your results. If it does not look like the distribution seen above, that is a good sign something is wrong. People trying to perpetrate fraud will often try to cover their tracks. Human intervention – altering order histories, removing purchases from ledger, will alter the organic nature of your data, and in turn, alter the distribution of your first digits.

While not the perfect catch-all by any means, Benford’s Law is used regularly by CPAs to alert them to possible fraud, book keeping errors, or other accounting irregularities. It is a simple tool which costs nothing to implement and should be part of the toolbox of any manager overseeing high volume spending.

Python: An Interesting Problem with Pandas

I was writing a little tongue and cheek article for LinkedIn on fraud detection using frequency distributions (you can read the article here: LinkedIn). While this was a non-technical article, I wanted to use some histograms from a real data set, so I uploaded a spread sheet into Python and went to work.

While working with the data I ran into an interesting problem that had me chasing my tail for about 10 minutes before I figured it out. It is a fun little problem involving Series and Dataframes.

As always, you can upload the data set here: FraudCheck1

Upload the data.

import pandas as pd
df = pd.read_excel


The data is pretty simple here. We are concerned with our answer column and the CreatedBy (which is the employee ID).  What I am trying to do is see if the “answer”  (a reading from an electric meter) are really random or if they have been contrived by someone trying to fake the data.

First, I want to get the readings for all the employees, so I used pop() to place the answer column into a separate list.

df1 = df

y = df1.pop("answer")


Then, to make my histogram more pleasant looking, I decided to only use the last digit before the decimal. That way I will have 10 bars (0-9). (Remember, this is solely for making charts for an article. So I was not concerned with any more stringent methods of normalization)

What I am doing below is int(199.7%10). Remember % is the modulus – leaves you with the remainder and int converts your float to an integers. So 199.7 is cut to 199. The 199/10 remainder = 9.

a= []
i = 0 
while i < len(y):
     i += 1


Then I created my histogram.

%matplotlib inline
from matplotlib import pyplot as plt


Now my problem

Now I want graph only the answers from employee 619, so first I filter out all rows but the ones for employee 619.

df2 = df.query('CreatedBy == 619')
y1 =df2.pop("answer")

Then I ran my loop to turn my answers into a single digit.

And I get an error.  Why?


Well the answer lies in the datatypes we are working with. Pandas read_excel function creates a Dataframe.

When you pop a column from a dataframe, you end up with a Series. And remember a series is an indexed listing of values.

Let’s look at our Series. Check out the point my line is pointing to below. Notice how my index jumps from 31 to 62. My while loop counts by 1, so after 31, I went looking for y1[32] and it doesn’t exist.


Using .tolist() converts our Series to a list and now our while loop works.


And now we can build another histogram.


The Code

import pandas as pd
df = pd.read_excel

df1 = df
y =df1.pop("answer")

a= []
i = 0 
while i < len(y):
   i += 1

%matplotlib inline
from matplotlib import pyplot as plta1 = []
i = 0
while i < len(y2):
 i = i+1


df2 = df.query('CreatedBy == 619')
y1 =df2.pop("answer")

y2= y1.tolist()

a1 = []
i = 0
while i < len(y2):
    i = i+1



SQL: Working with Date/Time Functions

SQL has a pretty extensive list of date time functions that come in handy when writing queries. I am going to cover a handful of the most common ones today.


Getdate() returns the current datetime

select getdate()


Date Parts

use AdventureWorks2012

 FROM [AdventureWorks2012].[Sales].[ShoppingCartItem]


First, looking at the data above. The data columns are in a format known as datetime in SQL (you may also hear many refer to it as a time stamp). It contains a date and time. What if you only want the date?


SELECT convert(date, ModifiedDate)
 FROM [AdventureWorks2012].[Sales].[ShoppingCartItem]


To select the time only

SELECT convert(time, ModifiedDate)
 FROM [AdventureWorks2012].[Sales].[ShoppingCartItem]



SELECT year(ModifiedDate) as Year, month(modifieddate) as Month,
day(modifieddate) as day
from sales.SalesOrderDetail


You can filter by date elements too.

SELECT year(ModifiedDate) as Year, month(modifieddate) as Month,
day(modifieddate) as day
from sales.SalesOrderDetail
where day(modifieddate) like '28'



You can extract time elements using Datepart()

SELECT datepart(hour, modifieddate) as hour, datepart(Minute, ModifiedDate) as minute
 FROM [AdventureWorks2012].[Sales].[ShoppingCartItem]



dateadd() lets you add or subtract days or months or years to a date. In the example below I add 10 days. The syntax is dateadd(interval(d, m, y), amount, date)

select dateadd(d,10,getdate())



Datediff() gives you the difference between two dates. Look at the dates in the query below:


We want to know how many days difference between DueDate and ModifiedDate

SELECT datediff(d,duedate, ModifiedDate)
 FROM [AdventureWorks2012].[Purchasing].[PurchaseOrderDetail]