Python Testing Shell

You can use the shell below to try Python code you learn in the lessons

To run Python scripts, type them out in the code editor and press the run button to execute:

To run console commands (run each line from a command prompt), click the down arrow and choose the down arrow and then >_ Console

Return to main python page: Python

Data Jobs: What does a Data Architect do?

If experience has taught me anything, it is that while companies and organizations have gotten much better at collecting data, most of this data is still just stored away in unstructured data stores. So while the data is “technically” there, it is not a whole lot of use to anyone trying to build a report or create a machine learning model. In order to get actually use out of all the data being stored, it needs to organized into some sort usable structure: Enter the Data Architect.

Data Architect is a job title I have held on multiple occasions throughout my career. My best description when explaining the job to other people is that I was kind of like a data janitor. You see, everyone takes their data and dumps it into the storage closet. So you find some billing data on the second shelf, tucked away behind employee HR records. The floor is cluttered with server logs and six years worth of expense reports stored as PDFs with completely useless names like er123.pdf.

As a data architect, it was my job to try to organize this mess and put the data into some sort of structure that lends itself to reporting or modeling. So data architects have to be well versed in data modeling, data storage, and data governance.

Data Modeling

ERD diagram showing an HR database design

Data modeling is basically taking raw data dumps and organizing them into structure that fit the needs of company. It could involve creating an HR database like above or creating a series of aggregated tables designed for reporting or dashboarding. It is the job of the data architect to best fit business needs to a data platform: be it a transactional database, a data warehouse, or perhaps a data lake.

Data Storage

Data architects also need to address data storage. While I often defer to the server crew in the IT department, as a data architect I do advise on to how and where to store data. Cloud infrastructure and cheaper faster disk storage has made a lot of data storage decisions easier, but it is good to have a working understanding of storage platforms.

Data Governance

Data governance is all about how the data is managed from a regulatory and security standpoint. It is the practice of deciding who can have access to what data. Can some data be “outside facing” versus should some data sit behind multiple firewalls in a DMZ zone.

You will often work hand in hand with Legal and Security departments when figuring out how data governance practices will be implemented.

SSRS: Drill down

Let’s make the report at little more readable and interactive.

We are going to do a drill down on the names in our report. This mean when the report first opens, you will only see one row for each name. If you want to see the details of that person, you can click on a + by their name and the details for that person will pop down like an accordion.

Below your report design window, you should see another window called Row Groups

Go to the area marked Details (green arrow) and right click – Group Properties…


Click Visibility > Hide > Display can be toggled.

(Note I am selecting Name1 not Name. If you remember Name was the column we hid. Name1 is the column created when we created the Parent Group)


Now when you open your report, you will see it is in drill down format.


Blockchain: Immutable Ledger

The next concept we need to cover to properly understand blockchain is the concept of an Immutable Ledger.  To translate this term into something that looks more like English, an Immutable Ledger simply means a record that cannot be changed.

The idea behind all of this is data security and proof that the data has not been altered. Why are we so concerned here? In a blockchain application like BitCoin, we are tracking transactions of money. Imagine if you sent me an electronic funds transfer for $100. How would you feel if I hacked into your bank and changed that $100, to $100,000? (For the record, if you tried that with my account, the computer would just laugh at you. The only time you’ll see a number that big associated with me is when you are looking at my student loans LOL).

Anyway, back to the point, you want to make sure that when you set up a transfer for $100, no one can alter it. With blockchain, that is true even if you want it to be altered (you made a mistake). If you want to fix an error, you will have to add another transaction to the blockchain to correct the issue. This is actually good accounting practice, as once an entry is made into a ledger, it should never be removed or altered.

Think of this like purchasing a car. If you go to your neighbor and buy his used car for $2000. You give him the money, and he signs over a title to you. The title is proof of ownership. To ensure that your neighbor cannot just take the car back, you take the title down to the Department of Motor Vehicles and have the title registered in your name. Now you have a record of your transaction should the ownership of the vehicle ever come into question.

So how does blockchain ensure immutability of the ledger? It all resides in the concept of the hash. If hacker tries to alter anything in the block below, its hash will change. Now the hash will no long match the previous hash in the second block. So, the hacker would have to change the next block, and the block after that, etc.


And even if they were able to pull that off, remember that the blockchain resides on multiple computers. The hacker would need to make all of these changes simultaneously. When you consider the millions of nodes that make up a blockchain environment like BitCoin, you will see that would be impossible.


In the next lesson, we will be looking at peer to peer distributed networks

ETL : Extract, Transform, Load

It you ever actually intend to work with data in the real world, you will quickly come face to face with two truths:

  1. Data is messy
  2. The data you need is never in one place

While creating predictive models and showing off your brilliant math skills is infinitely more sexy, before you even begin to build a simple linear regression until you have all your data together and in the correct format.

If you haven’t heard it before, it is a pretty common estimate that a data scientist spends 60 to 70 percent of their time working as a data “janitor”. While on a smaller scale, most of this data manipulation and merging (often referred to as munging) can be done manually, in a professional environment, you are going to want to find a way to try to automate as much of the process as possible.

ETL Tools

This is where ETL Tools come into play. Some of the more popular ones on the market are:

  • Oracle Warehouse Builder (OWB)
  • SAS Data Management
  • PowerCenter Informatica
  • SQL Server Integration Services (SSIS)

SSIS is the one I am most familiar with (also the one I use professionally) so it will be the one I will be focusing on in future lessons.

What these ETL Tools do is help you automate the data munging process.


When discussing ETL, extract means pulling data from a source (or many sources). Consider this, you have Customer table residing on a SQL Server Database, a Sales Person table sitting on a Teradata Database, and a list of purchases kept on an Excel spreadsheet.

Using an SSIS tool, you can connect to these three different data sources and query the data (Extract) into memory for you to work with.


Transform in ETL refers to manipulating the data.

Pulling from the example above, in the SQL Server table you have

Cust_ID         Cust_NM
9081            Bob Smith

The Excel sheet says

Cust_ID         Cust_NM
CN9081          Bob Smith

In the Excel spreadsheet, someone has put a CN in front of the Cust_ID. This is a common problem you will run across when working with real world data.  Using an ETL tool, you can set up steps that will either remove the CN (or add it- your prerogative).


Finally, once you have extracted the data from the different sources, transformed it – so differences in the data are correct, you come to the final step: loading it into a location of your choice – usually a data warehouse table.

Splunk: Introduction to Real Time Data Analysis – Setting Alerts

Splunk really shows its power in the realm of real time analysis of unstructed data. A professional implementation of Splunk involves some sort of machine produced data stream being fed into Splunk. This could be web clicks, social media feeds, sensor readings from mechanical devices, log files, etc.

In our example we are going to be working with computer log files I pulled from a server.

Download Data File:  event

Log into Splunk and select Add Data from the home screen


Select upload


Select File


Select the event.csv file I provided


Select Next


Select Next (since we are on a local machine, the default Input settings will do fine)


Finally, hit Submit


Now we have the full log file loaded, let’s try filtering the data load down a little.

Put error* in the search bar. (* is a wild card in Splunk)

Now only Errors are showing up in our data.


Now try the following in the search bar

error* terminal*

Now my log files are filtered down to TerminalServices Errors only.


Notice the top bar graph. It shows an abnormal increase in these particular errors in October.


Setting Alerts

This seems like there has been an abnormal increase in these particular errors. Wouldn’t it be nice to know if these errors were starting to fire off again.

Splunk lets us set Alerts to do just that. Above the bar graph, you will find a drop down menu Save As – click on it and then select Alert


Give the Alert a name.

I don’t want to run this on a schedule. Instead I clicked Real-time

I set the Trigger Conditions to go off when more than 50 of these errors appear.


Under Add Trigger Actions, I select Add to Triggered Alerts


Select your Severity


Now the Alert is saved


If you select Alerts in the top menu, you can see the newly saved alert too.






R: Twitter Sentiment Analysis

Having a solid understanding of current public sentiment can be a great tool. When deciding if a new marketing campaign is being met warmly, or if a news release about the CEO is causing customers get angry, people in charge of handling a company’s public image need these answers fast. And in the world of social media, we can get those answers fast. One simple, yet effective, tool for testing the public waters is to run a sentiment analysis.

A sentiment analysis works like this. We take a bunch of tweets about whatever we are looking for (in this example we will be looking at President Obama). We then parse those tweets out into individual words and we count the number of positive words and compare it to the number of negative words.

Now the simplicity of this model misses out on some things. Sarcasm can easily missed. Ex. “Oh GREAT job Obama. Thanks for tanking the country once again”. Our model will count 2 positive words (Great and Thanks) and 1 negative word (tanking) giving us an overall score of positive 1.

There are more complex methods for dealing with the issue above, but you’ll be surprised at how good the system works all by itself. While, yes we are going to misread a few tweets, we have the ability to read thousands of tweets, so the larger volume of data negates the overall effect of the sarcastic ones.

First thing we need to do is go get a list of good and bad words. You could make your own up, but there are plenty of pre-populated lists on the Internet for free. The one I will be using is from the University of Illinois at Chicago. You can find the list here:

Once you go to the page, click on Opinion Lexicon and then download the rar file.

You can dowload from the link below, but I want you to know the source in case this link breaks.

Now open the rar file and move the two text files to a folder you can work from.

Next let’s make sure we have the right packages installed. For this we will need, TwitteR, plyr, stringr, and xlsx. If you do not  have these packages installed, you can do so using the following code. (just change out TwitteR for whatever package you need to install)


Now load the libraries


and connect to the Twitter API. If you do not already have a connection set up, check out my lesson on connecting to Twitter: R: Connect to Twitter with R

api_key<- "insert consumer key here"
api_secret <- "insert consumer secret here"
access_token <- "insert access token here"
access_token_secret <- "insert access token secret here

Okay, so now remember where you stored the text files we just downloaded and set that location as your working directory (wd). Note that we use forward slashes here, even if you are on a Windows box.

neg = scan("negative-words.txt", what="character", comment.char=";")
pos = scan("positive-words.txt", what="character", comment.char=";")

scan looks through the text files and pulls words that start with characters and ignores comment lines that start with ;

You should now have 2 lists of positive and negative words.

You can add words to either list using a  vector operation. Below I added wtf – a popular Internet abbreviation for What the F@#$@ to the negative words

neg = c(neg, 'wtf')

Okay, now here is the engine that runs our analysis. I have tried to comment on what certain commands you may not recognize do.  I have lessons on most features listed here, and will make more lessons on anything missing. If I were to try to explain this step by step, this page would be 10000 lines long and no one would read it.

score.sentiment = function(tweets, pos.words, neg.words)

scores = laply(tweets, function(tweet, pos.words, neg.words) {

tweet = gsub('https://','',tweet) # removes https://
tweet = gsub('http://','',tweet) # removes http://
tweet=gsub('[^[:graph:]]', ' ',tweet) ## removes graphic characters 
       #like emoticons 
tweet = gsub('[[:punct:]]', '', tweet) # removes punctuation 
tweet = gsub('[[:cntrl:]]', '', tweet) # removes control characters
tweet = gsub('\\d+', '', tweet) # removes numbers
tweet=str_replace_all(tweet,"[^[:graph:]]", " ") 

tweet = tolower(tweet) # makes all letters lowercase

word.list = str_split(tweet, '\\s+') # splits the tweets by word in a list
words = unlist(word.list) # turns the list into vector
pos.matches = match(words, pos.words) ## returns matching 
          #values for words from list 
neg.matches = match(words, neg.words)
pos.matches = ! ## converts matching values to true of false
neg.matches = !
score = sum(pos.matches) - sum(neg.matches) # true and false are 
                #treated as 1 and 0 so they can be added
}, pos.words, neg.words )
scores.df = data.frame(score=scores, text=tweets)

Now let’s get some tweets and analyze them. Note, if your computer is slow or old, you can lower the number of tweets to process. Just change n= to a lower number like 100 or 50

tweets = searchTwitter('Obama',n=2500)
Tweets.text = laply(tweets,function(t)t$getText()) # gets text from Tweets

analysis = score.sentiment(Tweets.text, pos, neg) # calls sentiment function

Now lets look at the results. The quickest method available to us is to simply run a histogram


My results looks like this


If 0 is completely neutral most people are generally neutral about the president and more people have positives tweets then negatives ones. This is not uncommon for an outgoing president. They generally seem to get a popularity boost after the election is over.

Finally, if you want to save your results, you can export them to excel.

write.xlsx(analysis, "myResults.xlsx")

And you will end up with a file like this


Benford’s Law: Fraud Detection by the Numbers

For most HTM professionals, accountancy is not your primary profession. Yet between contracts, capital equipment purchases, and parts spending, many HTM managers are, in effect, running multi-million dollar corporations. With such high dollar items like imaging glassware and ultrasound probes, the potential for fraud runs high. A less than honest employee looking to supplement his or her income can take advantage of chaos often found in a busy shop to help themselves to some high dollar items.

How can you guard yourself against wolves hidden amongst your flock?

The answer may lie in an interesting statistical phenomena known as Benford’s Law, also called the First-Digit Law. Benford’s law has been used by CPAs and fraud investigators to detect abnormalities since the 1930’s.

How does it work?

Imagine I have a bag containing 9 ping pong balls labeled 1 – 9. I ask you to close your eyes, reach into the bag, and pull out a ball. What is the probability you will pull out a 5? Probability theory tells us the answer is 1/9 or approximately 11%. That is because 1 ball out of the 9 in the bag is labeled 5. The same thing applies to any number you choose.

So now we are going to replace the bag of balls with a spreadsheet containing parts purchases over the past year. Looking at the column with purchase prices, I am going to take the first non-zero digit from each row. So if the price is $130.00, I will take 1. If the price is $67.34, I will take 6. Now, I will throw all of those first digits into my imaginary bag, have you reach in blindfolded and pull out a 5.

What do you think the probability was you would pull out a five? 11%? Actually it is 7.8%.


I know what you are thinking. I have a bag of random numbers 1-9. Shouldn’t the probability of pulling any digit be 1/9 (11%)? Well not according to Benford’s Law.

Benford’s Law states that first digits taken from an organic set of numbers (i.e. numbers without artificial constraints) will follow the unique distribution seen below.


So the probability of pulling a 1 is 30% while the probability of a 9 is under 5%.

The distribution is calculated using the following formula:


Where the probability (p) of the number in question (n) can be calculated using a logarithm.

While it is a mathematical law, Benford’s Law is a phenomena meaning it is not fully understood how or why it works. The simplest explanation I can offer you involves looking at the following table demonstrating a 10% yearly compounded interest on a $1 investment.


Notice how the first digit stays 1 for 8 years, but only 4 years at 2, and 1 year by the time it reaches 9. If you were to continue through 10’s to 100’s you would notice the same pattern repeating itself.

How can I use this?

Simply enough, take your parts purchasing history, truncate the first non-zero digit and create a histogram of your results. If it does not look like the distribution seen above, that is a good sign something is wrong. People trying to perpetrate fraud will often try to cover their tracks. Human intervention – altering order histories, removing purchases from ledger, will alter the organic nature of your data, and in turn, alter the distribution of your first digits.

While not the perfect catch-all by any means, Benford’s Law is used regularly by CPAs to alert them to possible fraud, book keeping errors, or other accounting irregularities. It is a simple tool which costs nothing to implement and should be part of the toolbox of any manager overseeing high volume spending.