QLIK: Build a Dashboard Part 2 – Formulas and Set Analysis

This is part two of my Qlik Sense tutorial on how to build a dashboard. In this part I will be covering some more advanced topics, namely formulas and set analysis.

If you want to play along, but have not taken part 1 yet, here is a link to part 1: QLIK: Build a Dashboard Part 1

Formulas

Let’s start with formulas. Open up your Dashboard and hit the edit button to get to the editor

qlikd2p1.jpg

Now click on the Sum(Scheduled) KPI we built. Next, on the right side bar, click Data > the arrow by Sum(Scheduled)

qlikd2p2.jpg

Now click on the orange fx – this will launch the expression editor

qlikd2p3.jpg

This is the expression editor. It is currently summing up the values of the Scheduled column. Scheduled is a binary with 1 meaning scheduled inspection and 0 not scheduled. So summing up this column will tell me how many scheduled inspections I have in my data set.

qlikd2p4.jpg

Let’s alter the expression a little. Add ‘/100‘ to the end. This will divide our expression by 100

qlikd2p5.jpg

Here is the result

qlikd2p6.jpg

Now, let go back into our expression editor. Notice the menus on the right. The aggregation menu tells us the options available for aggregation. Let’s use Count

qlikd2p7.jpg

Delete the current expression and start typing Count(WorkOrderKey), you should notice a helper box popping up asking if this is the field you are looking for.

qlikd2p8.jpg

Notice the colors in your expression. Blue is a key word while gold is a field name from your data

qlikd2p9.jpg

Hit apply, now you have a count of all work orders in your data

qlikd2p10

Got back to the expression editor one more time, and type the following:

Count(WorkOrderKey) – Sum(Scheduled)

This will give us the total work orders – number of scheduled work orders. The result being total unscheduled work orders.

qlikd2p11.jpg

Set Analysis

Set analysis is Qlik’s way of localizing filters. If I were to apply a filter of Scheduled = True to my dashboard, all charts and graphs in my dashboard would be filtered to only show Scheduled work orders. But what if I didn’t want that. What if I only want to filter my bar chart by Scheduled work orders? That is where set analysis comes in.

The syntax for set analysis is {<filtering expression>}

Bar Chart Part 1

Lets do one, and you will better understand. Click on the bar chart > Measures>fx to open expression editor

qlikd2p12

Now inside the brackets of the expression add the following: {$<Scheduled={1}>}

Your final expression should look like: Count({$<Scheduled={1}>}WorkOrderKey)

  • Count( – count number of elements inside brackets
  • {  – start of set analysis
  • $ – this means I  am going to filter on the data currently on the dashboard. So if I have Hospital A checked on my filter list, I am only going to look at data from Hospital A. If I wanted to use all data, regardless of active filters on the page, I would replace my $ with a 1
  • <> – our set analysis lives inside the carrots
  • Scheduled = {1} – filter my data to only show rows with Scheduled = 1 ** note that I have placed 1 inside {}. This is a syntax requirement. The argument must be in {}
  • } – close out set analysis
  • WorkOrderKey – the element I am going to Count
  • ) – close Count

End result. Only counting Scheduled Work Orders

qlikd2p13.jpg

Bar Chart Part 2 – date filtering

Now we are going to add one more element, and I promise, this will be the most difficult thing we tackle here. And it actually looks a lot worse than it actually is.

Go back into your expression editor and type this

Count({$<Scheduled={1},RequestedOn.autoCalendar.Month={“>=$(=date(max(RequestedOn.autoCalendar.Month)-2))”}>}WorkOrderKey)

Looks horrible right? Well, let’s break it down.

Count({$<Scheduled={1},RequestedOn.autoCalendar.Month={“>=$(=date(max(RequestedOn.autoCalendar.Month)-2))”}>}WorkOrderKey)

Everything in blue was covered in the last example, so we can get rid of that.

Now we just have this.

RequestedOn.autoCalendar.Month={“>=$(=date(max(RequestedOn.autoCalendar.Month)-2))”}

autoCalendar is a feature Qlik adds to your data so you can easily work with parts of a date field. RequestedOn.autoCalendar.Month is simply the month from the field RequestedOn. So 9/28/15 = September.  To simplify the explanation, I will be replacing RequestedOn.autoCalendar.Month with Month

Month = {“>=$(=date(max(Month)-2))”}

  • Month =  — filter by Month
  • { — remember 1 had to be in {} for schedule. The argument must be in {}
  • “>=  — >= means greater than or equal to. We need the ” because otherwise Qlik would think > meant end of set analysis
  • $ – again this means only filter active data currently on the dashboard
  • (=date(max(Month) – (=date(Max()) – these are built in functions that find the max value in the data set from field given – Month and return the value as a date value not a string
  • -2 — go back 2 month from max date

So, all the big ridiculously looking expression says is – Count all scheduled work orders in the data set that have a Requested On date >= to October    ( December – 2 Months)

Still on the bar chart, now click Add under Measures

qlikd2p15.jpg

Now click fx to enter expression editor

qlikd2p16.jpg

The the following:

Count({$<Scheduled={1}, Closed={0},RequestedOn.autoCalendar.Month={“>=$(=date(max(RequestedOn.autoCalendar.Month)-2))”}>}WorkOrderKey)

I have highlighted the only new code added. Closed = {0} means the work order is still open. So I have added a second measure that only counts open scheduled work orders

So now if you look at your graph, you have 2 bars per month.The blue is our first expression and pink is our second. So we now have all Scheduled Work Order and Open Scheduled work orders for each month for easy comparision

qlikd2p17.jpg

Gauge – How many pumps are in the shop?

Grab gauge from the left bar and drag it to the bottom of your sheet.

qlikd2p18.jpg

Click Add Dimension > Fx to go to expression editor

qlikd2p19.jpg

Now type the following:

count({$<ClassCode={‘PUMPINFU’}, Closed = {0}, Scheduled = {0}>}WorkOrderKey)/count(distinct{$<ClassCode={‘PUMPINFU’}>}EquipmentKey)

Again, don’t let it intimidate you, let’s break it down.

First, notice that what we have here is 2 counting expressions – one divided by the other

count({$<ClassCode={‘PUMPINFU’}, Closed = {0}, Scheduled = {0}>}WorkOrderKey)/count(distinct{$<ClassCode={‘PUMPINFU’}>}EquipmentKey)

The first expression simply states – Count WorkOrderKeys of all rows where Closed is 0 and Scheduled is 0 and the ClassCode = PUMPINFU – this means count all rows with open unscheduled work orders where the piece of equipment was an infusion pump.  (** PUMPINFU is just the class code we use in our database **)

The second expression says count all rows contain infusion pumps. I added the distinct key word so that any individual piece of equipment that appears more than once in my list won’t be counted twice by accident.

So, my expression is – Number of Open Infusion Pump Work Orders / Number of Infusion Pumps

Now, lets make it look nice.

On the right bar, set Number formatting to Number

qlikd2p20

Chance Formatting to % with 1 decimal place

qlikd2p21.jpg

Now click Appearance, set our max Range Limit to 0.3 (30%), and click use segments

qlikd2p22.jpg

Click Add limit, use scroll bar or just type in 0.1

qlikd2p24

Hit Add limit again, set it for 0.2

qlikd2p25.jpg

Click on any of your boxes now and you will get color options

qlikd2p26.jpg

Now I did tradition green, yellow, red, but you can use any color combination you want.

qlikd2p27.jpg

You now have all the information you need to finish the dashboard on your own – which is really the best way to learn. Go back, look at my examples. The set analysis really is the hardest part to worry about for this dashboard. If you get stuck, Qlik has a great user community that can answer most questions. Or, you can leave a question in the comments or email me at Analytics4all@yahoo.com

 

 

 

 

 

QLIK: Build a Dashboard Part 1

qlikDash

In this two part lesson, we are going to build the above dashboard using QLIK Sense. If you don’t already have QLIK Sense, you can download a free trial here: Download

You can download the data set I am working with here: WO

Create new app

Once you have QLIK Sense up and running, click Create new app

qlikDP1

Name you new app. I am calling mine DASH

qlikDP2.jpg

 

Click Open App

Next click Add data

qlikSense4

Now select Excel files – I created my real dashboard using the SQL connector, however it is easier to load the Excel files. So I am using Excel in this lesson. I will show how to load SQL into QLIK in another lesson.

qlikSense5

Navigate to the folder you saved the downloaded files to and select WO.xlsx

qlikDP3

Note – you can uncheck columns you do not want to use from your data, but I have already removed all unwanted columns for this exercise.

qlikDP4

click Load data and finish on the bottom

qlikDP5

Once data loads, click Edit the sheet

qlikDP6.jpg

Now we have a blank sheet to work with

qlikDP7

Build the Dashboard

Click on Bar chart and drag it into the upper right corner

You will now be asked for a Dimension and a Measure. If you are not familiar with these terms in data, a dimension is like a category while a measure is something you can add, count, etc.

qlikDP8.jpg

For Dimension, we will select RequestedOn > Month

qlikDP9

For Measure, select WorkOrderKey

qlikDP10.jpgNow select Count

qlikDP11

Now resize your bar chart by dragging it by the corner.

qlikDP12.jpg

Next, drag a KPI over from the options in the left column. Place it to the right of your Bar Chart

As a measure, select Scheduled > SUM

qlikDP14

Now select Filter pane from the left column and place it next to your KPI. For dimension, select FacilityCode

qlikdp15.jpg Now, let’s drag over a Line Chart. Place it below your bar chart. Set Dimension to RequestedON> Date. Set your measure to WorkOrderKey> Count

qlikdp16

Now, let’s add a Pie Chart to the right of our line chart. Set Dimension to DepartmentCode and Measure to WorkOrderKey > Count

qlikdp17.jpg

Now lets clean it up. Click in the upper right corners of your charts that say Click to add title. Give your charts some titles.

 

 

Once you have done that, click the Done button in the upper right corner.

qlikdp20

Now play with your new dashboard.

qlikdp21.jpg

In the next lesson, I will show how to add more versatility to your dashboard through used of formulas and Set Analysis.

 

QLIK: QLIK Sense Dashboard

This article here is more of a show and tell than a how to lesson. I wanted to demonstrate a really cool product that I have been demoing for work called QLIK Sense. I work for a Clinical Engineering Department at a multi-site hospital system. While the main thrust of my job is data related, the department as a whole is responsible for the maintenance of medical equipment, everything from IV pumps to MRI machines. Anyone who has ever worked into a busy maintenance/service environment knows how difficult it can be to optimize labor so that proper resources are allocated to the right jobs.

One great tool that can help is an operations dashboard. This gives everyone in the shop a single point of reference for what is happening at any given time. Using QLIK Sense, I was able quickly to connect my work order database and within less than an hour, I had the following dashboard created.

qlikDash.jpg

Closer Look at the Elements

Across the top I have, Scheduled Inspections (Total vs Still Open), Repairs Calls Opened Today, Repair Calls Closed Today, and Open High Risk Equipment Repair Calls

qlikDash8.jpg

Next I have a trend line showing work orders created vs closed over time. Next to that, I have a pie chart showing what departments the majority of my repair calls are originating from.

qlikDash9.jpg

On the bottom, I have a gauge that shows the number of IV Pumps in the shop for repair vs the total number of pumps in Inventory. The goal is to keep that number below 20%. So as the needle starts edging towards red, we know to move resources onto pumps.

Finally, I have open repair work orders by employee.

qlikDash10.jpg

Cool Features

QLIK Sense provides more than just a dashboard. It is totally interactive. You can use basic filters like the one below:

qlikDash11.jpg

Or you can simple click on any element with the dashboard to filter the data.

Notice that after selecting Monitor Technicians from the Department Pie Chart, my Repair Work Orders by Employee chart has automatically filtered down to only the employees with work orders open in that department.

qlikDash3

With the trend line graph, I can use the “lasso” to select a range I want to filter by.

qlikDash1.jpg

Of course you can simply “fly over” element in the dashboard to get information as well.

qlikDash5.jpg

Now I have an effective BI tool

I know a lot of people out there are still living in the world of static reporting. Getting near real time information up in front of the decision makers can allow them to make near real time decisions. The days of  basing decisions off of a static report run daily or even (gasp.. ) weekly are over. We need to start deploying tools like this dashboard so we can manage less by what we think is happening and more by what we know is happening.

 

Excel: Adjusting Named Ranges

Okay, so you created a spreadsheet. To make it easier for your fellow coworkers to use, you implemented range names. And it all works great, that is until you need to change the actual range.

Look at this example. I had created range call Class which spanned across 16 cells representing 16 classes being offered. It worked great until my boss informs me that my school will now be offering 20 classes.

rangeAdj4.jpg

How do you change the range to include classes 17-20?

In the Formulas tab of the Ribbon, you will find an icon called Name Manager 

rangeAdj.jpg

Highlight the named range you are working with  – (in our case – Class)

Double click

rangeAdj1.jpg

Click on the little box to the right of the Refers to: box

rangeAdj2.jpg

Now highlight the new range

rangeAdj3

Click OK and OK. Now your range should contain your new cells.

 

 

 

 

PIG: Use GRUNT to Access PIG from the Command Line

I have already shown you how to use PIG from the Ambari web interface. Now I will show you how to use GRUNT to run PIG Scripts from the command line.

First things first. Before you can use the command terminal, you need to log into Hadoop via the VMware terminal. Click on your running VM and hit Alt-F5

Log in using root and hadoop as your password. You will then be asked to create a new password.

PigGrunt.jpg

Now go to http://127.0.0.1:8888 and click View Advanced Options

hortonworks8

Here you will see instructions for accessing Hadoop via SSH Client (command line access).

Also, in case someone out there doesn’t know. localhost and 127.0.0.1 mean the same thing and are interchangeable.

PigGrunt1.jpg

Putty

On of the best ways to access SSH from Windows is through a free program called Putty. You can find a link for it here:putty

Once you have it downloaded, click on the exe file and fill in the information as seen below. IP or Hostname(127.0.0.1 or localhost) and Port 2222. Make sure SSH is checked. If you want, you can save these setting like I did. I named mine HadoopLearning.

Next, click Open

PigGrunt2.jpg

Log in using root and your new password

PigGrunt3.jpg

Okay, now for a quick explanation of what will seem confusing to some. You are currently logged into a Linux computer (CentOS to be exact) that came prepacked with Hadoop already installed. So, in order to interact with Hadoop we need to use one of the two following commands first (hdfs dfs or hadoop fs) – either one works. It is just a matter of personal choice. I like hdfs dfs, but feel free to use hadoop fs

hdfs dfs -ls /

This command gives me the listing (ls) of the root hadoop folder.

PigGrunt4.jpg

I know my files are in user/maria_dev so let’s look in there now.

hdfs dfs -ls /user/maria_dev

You can see I have some csv files in this folder we can work with.

PigGrunt5.jpg

Now we are in a working terminal and have data, the next step is to start PIG. Now pay close attention, this part is very difficult. Go to your command terminal and type: pig

You will see the terminal working and you should end up with a prompt that says grunt>

PigGrunt6.jpg

grunt>

PigGrunt7.jpg

uP = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') as (Model, Price);

dump uP;

You will see the MapReduce job running.

PigGrunt8.jpg

And here are your results.

PigGrunt9.jpg

 

PIG: Working With Data on Hadoop

Pig – or more officially Pig Latin – is a scripting language used to abstract over Map Reduce. What that means in English is that you can write relatively simple Pig scripts and they are translated in the background to Map Reduce – a much less user friendly language.

Pig can be used to perform basic data analysis. It is also commonly used to perform ETL (extract transform load) jobs. For those not familiar with ETL, when raw data comes into the system, it is often unstructured and in need of some basic cleaning before it can be analyzed or stored into a database structure like HBase. The ETL process handles all the organization and cleaning to give the data some level of structure.

Let’s open up Hadoop and and try our hand as some scripting.

Exercise File Download: PigExerciseFiles

In my Intro to Hadoop lesson I uploaded KMeans2.csv to /user/maria_dev : Hadoop: Introduction to Hadoop using Pig

Go to that folder upload the UltrasoundPrice.csv price.

Hit Upload

Hortonworks13

Select your file and confirm the Upload by hitting Upload again.

Hortonworks15

Now you should have both CSV files loaded into HDFS

Pig1.jpg

Now, let’s go to the PIG editor

hortonworks18

Click New Script. Let’s call this one Pig2

hortonworks19

Let’s start by loading Kmeans2.csv

kmeansData = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
 ID, Model, WO, Labor, Cost, Department);

dump kmeansData;

Hit Execute

Pig2.jpg

Here are the results

Pig3.jpg

Now to load UltrasoundPrice.csv into PigStorage

kmeansData = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
 ID, Model, WO, Labor, Cost, Department);

ultraPrice = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') AS (
 Model, Price);

dump ultraPrice;

Pig3.jpg

JOIN

Now let us join our two data sets (combine the two into one). It is really simple in PIG. We are going to join on the Model name of course (since model is the column that contains matching values in both files).

kmeansData = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);

ultraPrice = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') AS (
Model, Price);

joined = JOIN kmeansData by Model, ultraPrice by Model;

dump joined;

Pig4.jpg

Choose Columns to Store

What if you don’t want all of the columns. In PIG, you can use the commands FOREACH and GENERATE to create a new data set with only the columns you choose.

So if DataSetA contains columns A, B, C, D and you only want A and C

FOREACH DataSetA Generate A, C;

In our example, I only want Model, WO, Department, and Price

kmeansData = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);

ultraPrice = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') AS (
Model, Price);

joined = JOIN kmeansData by Model, ultraPrice by Model;

pickCols = FOREACH joined GENERATE kmeansData::Model, WO, Department, Price;

dump pickCols;

 

Notice something weird in the code above? kmeansData::Model? This is due to the fact that since we joined two data sets with matching column names, we need to tell PIG which of the two Models we want. I just choose kmeansData by the way. I could have used ultraPrice instead.

Anyway, here are the results

Pig5

Filtering

Now lets say I only want to see rows where the WO column (which stands for Work Order Count by the way) is greater than 5.

kmeansData = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);

ultraPrice = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') AS (
Model, Price);

joined = JOIN kmeansData by Model, ultraPrice by Model;

pickCols = FOREACH joined GENERATE kmeansData::Model, WO, Department, Price;

filterCols = FILTER pickCols BY WO > 5;

dump filterCols;

Pig6

Save Results to a File

Dumping your results to a screen is fine if you just running some ad-hoc analysis, but if you want to use this data later, you can save it to a file.

kmeansData = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);

ultraPrice = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') AS (
Model, Price);

joined = JOIN kmeansData by Model, ultraPrice by Model;

pickCols = FOREACH joined GENERATE kmeansData::Model, WO, Department, Price;

filterCols = FILTER pickCols BY WO > 5;

STORE filterCols INTO 'filterResults' USING PigStorage(',');

Note that we tell PigStorage to use ‘,’ as the separator in the file. We could have used \t (tabs) or * or other separators if we had wanted to.

Now the first thing you notice is that your do not get any Results, just a Log file. This is okay, nothing went wrong. Notice the nice green bar above.

Pig7.jpg

To see your results, go to HDFS (square ICON on the top bar of the screen)

Go to /user/maria_dev/ and you will notice your new file is actually a new folder. Click on it.

Pig8.jpg

You will have two files, _Success ( a log) and your data: part-r-00000 (your name may differ)

Pig9.jpg

Double click on part-r-00000 to preview your new file

Pig10.jpg

 

Hadoop: Introduction to Hadoop using Pig

We live in the era of Big Data. Whether you aspire to work as Data Scientist, or you are just curious as to what all the hype is about, this introductory lesson on Hadoop using Pig will help to shed some light on the topic.

First off, what is Hadoop?

Hadoop is the Apache open source version of Google’s MapReduce. At its heart, Hadoop is a database management system, not to different in purpose than database systems you are already familiar with like Oracle and SQL Server. But  Hadoop is a data management system designed to work with massive data sets.

When working with Oracle, DB2, SQL Server, etc, if you want to handle massive data sets, you need a monster computer. And when a need for more resources pops up, you have no choice but to scale up vertically. That means, you need to add more memory, more hard drives, more processors to your computer –  an expensive and failure prone approach.

Hadoop, on the other hand, allows for horizontal scaling. You do not need to buy a massive database server. Instead Hadoop can run across multiple commodity( ie cheaper) servers, leveraging the collective memory and processor power of the group. This is in part due to the MapReduce algorithm.

MapReduce works like this. Imagine you have 100 documents and you want to know what the most common word in the all the documents is. Instead off looking at all 100 documents, MapReduce Maps (creates) 100 separate jobs where each document is scanned by a separate computer. Then once each document has been scanned, the results of all 100 scans are fed into the Reducer which aggregates the 100 results into 1 single result.

Enough Theory, Example time

I am going to be using The Hortonworks Sandbox. There are few approaches to setting this up – VMWare, Virtal Box, or 30 day free trail on Azure. Azure is without a doubt the simplest set up (it is literally like downloading an app on your phone), but after 30 days they want money and I am cheap. So I will be using Virtual Box.

You can download Virtual Box here: Virtual Box

You are also going to need to download the Hortonworks Sandbox: Sandbox

Make sure you download to appropriate version. I picked the VirtualBox Install

hortonworks1.jpg

hortonworks2

Install VirtualBox and open the VM VirtualBox Manager

Go to File > Import Appliance…

hortonworks3.jpg

Add the ova image you downloaded from Hortonworks and hit Next

hortonworks4.jpg

Keep the default setting and hit Import. The import can take anywhere from 5 minutes to 1/2 hour based on the speed of your machine.

hortonworks5.jpg

Once the import is complete, start up Hortonworks. Just double click on the instance you want. (Note I have instances loaded on my machine, most people will only have one).

hortonworks6

Once it finishes loading, your screen should look like this.

Open up a browser and go to http://127.0.0.1:8888/  – refer to your screen for the correct address as it may be different.

hortonworks7.jpg

The following screen should appear.

hortonworks8.jpg

Follow the URL and use username and password provided (note your’s may differ from my screen shot)

hortonworks9.jpg

You will come to a screen like this. This is the Ambari environment – part of the Hortonworks Sandbox.  You may notice some red alert icons on the screen – these are common and generally will not cause you any problems – at least not as far as what we are doing here.

hortonworks10.jpg

Next, you are going to need to download this CSV file if you want to play along: KMeans2

Now, go to the square box icon in the upper right corner and select HDFS Files (HDFS stands for Hadoop File System in case you were wondering)

Hortonworks11

Now go to user

Hortonworks12.jpg

I then select maria_dev since that is who I am logged in as

Hortonworks12.jpg

Hit the Upload button

Hortonworks13

Hit Browse and search for the Kmeans2 file

Hortonworks14.jpg

Confirm your Upload

Hortonworks15.jpg

Now your file is uploaded to Hadoop. Notice the permissions column. If you have any Linux experience, this should look familiar. Hortonwork’s Sandbox is packaged with CentOS Linux.

Hortonworks16.jpg

You can double click on the file and get a file preview

Hortonworks17.jpg

PIG Script

Okay, so we have built a Hadoop Sandbox and loaded uploaded some data. How do we get to it?

One method for accessing data in Hadoop is PIG. It is a language similar to SQL. Let’s take a look at some quick examples.

First let’s get to our PIG editor. Go to the square icon on the top bar and select Pig.

hortonworks18.jpg

Select New Script and give it a name. I named mine Pig1

hortonworks19

Below is the first example of our script. I’ll explain below

dataLoad = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);

dump dataLoad;
  • dataLoad – just a variable. This is the variable I will be dumping my data into
  • LOAD’/user/maria_dev/KMeans2.csv’ – this tells PIG to load the data file we uploaded
  • USING PigStorage(‘,’) – This means we will be storing the data in a comma separated format
  • AS(ID, Model,…etc; – This names columns
  • dump dataLoad; – dumps the results of my query to the screen.

Note all lines need to end in a semicolon ; 

Before you hit Execute, note that PIG is not like SQL. PIG executes as a batch, so nothing will appear on the screen until the entire batch is finished processing

hortonworks20

And we have success…

hortonworks21.jpg

One more bit of code. I will save more in depth PIG for future lessons. This is really just a quick glimpse of Hadoop and how it works.

In this example below, we are  using the LIMIT command to limit our results to the first 10.

dataLoad = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);

dataTop10 = LIMIT dataLoad 10;

dump dataLoad;

hortonworks22.jpg

And here are our results.

hortonworks23.jpg

 

Python: Pivot Tables with Pandas

Pandas provides Python with lots of advanced data management tools. Being able to create pivots tables is one of the cooler tools.

You can download the data set I will be working with here: KMeans1

First, let’s upload our data into Python

import pandas as pd
df = pd.read_excel("C:\Users\Benjamin\Documents\KMeans1.xlsx")
df.head()

pandasPivotTable.jpg

Let’s create our first pivot table.

The syntax is = pd.pivot_table(dataframe, index = Columns you want to group by, values = Columns you want to aggregate, aggfunc = type of aggregation)

pd.pivot_table(df, index='Department', values = 'AVG Labor', aggfunc = 'sum')

pandasPivotTable1.jpg

We can group by more than one column

pd.pivot_table(df, index=['Department','Model'], values = 'AVG Labor',  
aggfunc= 'mean')

pandasPivotTable2.jpg

You can also have multiple value columns

 pd.pivot_table(df, index='Department', values = ['AVG Labor','Labor Cost'],
  aggfunc= 'sum')

pandasPivotTable3.jpg

Now, one catch is, what if you want to have different aggregate functions. What if I want to get the mean of AVG Labor, but I want the sum of Labor Cost columns.

In this case we are going to use the groupby().aggregate()

import numpy as np
df.groupby('Department').aggregate({'AVG Labor':np.mean, 'Labor Cost': np.sum})

pandasPivotTable4.jpg

CPLEX: An introduction to Linear Programming with OPL

IBM’s CPLEX is a commercial quality Optimization product. It can handle 10’s of 1000’s of variables as well as massive data sets. As with most commercial software it is very expensive, but they do offer a limited edition you can download to get some experience working with it.

You can find the free downloads here: CPLEX Download

In this example, we will be using CPLEX OPL (optimization programming language) to solve a very simple linear problem. Check out the already solved solution below:

As you can see, we are deciding which of 2 types of furniture, Table and Chairs. Our goal is maximize our profit, however we only have 300 units of material and 110 units of labor available.

phpsimplex1

Let’s break it down into some basic formulas now.

First, what are the two main elements we have here? Furniture (Table, Chair) and Resources (Material and Labor). All other elements are related to either Furniture or Resources in some way.

How Many? is related to Furniture. To make this easier to follow we will shorten How Many? to HM

Profit per Unit is also related to furniture – let’s call it P

So Total Profit = HM[Table]*P[Table] + HM[Chair]+P[Chair]

In a bit more formal format: cplexFurn.jpg

Coding in CPLEX

Now let’s code this in CPLEX. First we need to create a new OPL Project

cplexFurn1.jpg

Name your project and make sure to check Create Model and Create Data

cplexFurn2.jpg

You will note you now have two new files, a .mod and .dat

  • .mod = model file
  • .dat = data file

cplexFurn3.jpg

We are going to start by building our model file.

The model file consists of 4 parts

  • Variable declaration
  • Decision variable declaration
  • Objective
  • Constraints

Variable Declaration

{string} furniture = ...;   
{string} resources = ...;
int P[furniture] = ...;
int Avail[resources] = ...;
int ResValue[resources][furniture] = ...;

First- the =…; means the variable will find its values in the .dat file. We will get there soon.

{string} furniture = …;   — blue box
{string} resources = …;  — red box

cplexFurn4.jpg

int P[furniture] = …;  – red box
int Avail[resources] = …;  –  green box
int ResValues[resources][furniture] = …;  – purple box

cplexFurn5

The [] after the variable name tells which string the variable is related to. If you having trouble seeing what I am talking about, look at the images below.

cplexFurn6.jpg

cplexFurn7

cplexFurn8.jpg

I know this is confusing, but hopefully it will clear up as we move forward.

Decision variable declaration

dvar int HM[furniture];

Our decision variable (dvar) is how many units to build (HM). This is of course related to furniture. As the picture below shows:

cplexFurn9.jpg

Objective:

maximize 
      sum(i in furniture)
                  HM[i]*P[i];

Our object is to maximize profit. The equation above is simply the OPL way of writing:

HM[Table]*P[Table] + HM[Chair]+P[Chair]

sum(i in furniture) is like a for loop where the results of each iteration are summed up. The two values passed through i would be Table and Chair.

Constraints:

subject to {

      forall(i in resources)
               ct1:
                   sum(j in furniture)
                         ResValue[i][j]*HM[j]<=Avail[i];

}

Our final chunk of code in our model (.mod) file is our Constraints. Constraints are started by using the code: subject to {}

Forall(i in resources) – is just a for loop iterating through (materials, labor)

ct1: – just names your constraint – an issue in bigger problems where you might have many constraints.

To clarify I will walk you through the sum(j in furniture) loop.

  1. ResValue[Materials][Table]*HM[Table]<=Avail[Materials]
  2. ResValue[Materials][Chair]*HM[Chair]<=Avail[Materials]
  3. ResValue[Labor][Table]*HM[Table]<=Avail[Labor]
  4. ResValue[Labor][Chair]*HM[Chair]<=Avail[Labor]

Here is what the code looks like in CPLEX

cplexFurn10.jpg

Now, let’s fill in our .dat (data file)

The code should be pretty self explanatory, note the use of {} for our string variables

furniture = {"Table","Chair"};
resources = {"Wood", "Labor"};
P = [6,8];
Avail = [300,110];
ResValue = [[30,20],[5,10]];

Here is the code in CPLEX

cplexFurn11.jpg

Run the code

To run the code, we first need to create a running configuration. Right click on Run Configurations > New > Run Configuration

cplexFurn12.jpg

Name your run configuration

cplexFurn13.jpg

Now, select you .mod and .dat files one at a time and drag them into your new running config. When you get the pop up warning, just click okay. Repeat this task for both files

cplexFurn14.jpg

Here is what your screen should look like.

cplexFurn15.jpg

Right click your running config and click Run this

cplexFurn16.jpg

Click Okay to save your mod and dat file

cplexFurn17.jpg

And now we have our results. 96 is out total profit with 4 Tables and 9 Chairs being built. Just like our Solver solution above.

cplexFurn18.jpg

The Code

.mod file

{string} furniture = ...; 
{string} resources = ...;
int P[furniture] = ...;
int Avail[resources] = ...;
int ResValue[resources][furniture] = ...;

dvar int HM[furniture];

maximize 
 sum(i in furniture)
    P[i]*HM[i];
 
subject to {

forall(i in resources)
 ct1:
 sum(j in furniture)
 ResValue[i,j]*HM[j]<=Avail[i];

}

.dat file

furniture = {"Table","Chair"};
resources = {"Wood", "Labor"};
P = [6,8];
Avail = [300,110];
ResValue = [[30,20],[5,10]];

 

 

Python: An Interesting Problem with Pandas

I was writing a little tongue and cheek article for LinkedIn on fraud detection using frequency distributions (you can read the article here: LinkedIn). While this was a non-technical article, I wanted to use some histograms from a real data set, so I uploaded a spread sheet into Python and went to work.

While working with the data I ran into an interesting problem that had me chasing my tail for about 10 minutes before I figured it out. It is a fun little problem involving Series and Dataframes.

As always, you can upload the data set here: FraudCheck1

Upload the data.

import pandas as pd
df = pd.read_excel
("C:\\Users\\Benjamin\\OneDrive\\Documents\\article\\python\\FraudCheck1.xlsx")
df.head()

pandasProb.jpg

The data is pretty simple here. We are concerned with our answer column and the CreatedBy (which is the employee ID).  What I am trying to do is see if the “answer”  (a reading from an electric meter) are really random or if they have been contrived by someone trying to fake the data.

First, I want to get the readings for all the employees, so I used pop() to place the answer column into a separate list.

df1 = df

y = df1.pop("answer")

pandasProb1.jpg

Then, to make my histogram more pleasant looking, I decided to only use the last digit before the decimal. That way I will have 10 bars (0-9). (Remember, this is solely for making charts for an article. So I was not concerned with any more stringent methods of normalization)

What I am doing below is int(199.7%10). Remember % is the modulus – leaves you with the remainder and int converts your float to an integers. So 199.7 is cut to 199. The 199/10 remainder = 9.

a= []
i = 0 
while i < len(y):
     a.append(int(y[i]%10))
     i += 1
a[1:10]

pandasProb2

Then I created my histogram.

%matplotlib inline
from matplotlib import pyplot as plt
plt.hist(a)

pandasProb3.jpg

Now my problem

Now I want graph only the answers from employee 619, so first I filter out all rows but the ones for employee 619.

df2 = df.query('CreatedBy == 619')
y1 =df2.pop("answer")

Then I ran my loop to turn my answers into a single digit.

And I get an error.  Why?

pandasProb4.jpg

Well the answer lies in the datatypes we are working with. Pandas read_excel function creates a Dataframe.

When you pop a column from a dataframe, you end up with a Series. And remember a series is an indexed listing of values.

Let’s look at our Series. Check out the point my line is pointing to below. Notice how my index jumps from 31 to 62. My while loop counts by 1, so after 31, I went looking for y1[32] and it doesn’t exist.

pandasProb5.jpg

Using .tolist() converts our Series to a list and now our while loop works.

pandasProb6.jpg

And now we can build another histogram.

pandasProb7.jpg

The Code

import pandas as pd
df = pd.read_excel
("C:\\Users\\Benjamin\\OneDrive\\Documents\\article\\python\\FraudCheck1.xlsx")
df.head()

df1 = df
y =df1.pop("answer")

a= []
i = 0 
while i < len(y):
   a.append(int(y[i]%10))
   i += 1
a[1:10]

%matplotlib inline
from matplotlib import pyplot as plta1 = []
i = 0
while i < len(y2):
 a1.append(int(y2[i])%10)
 i = i+1
a1[1:10]

plt.hist(a)

df2 = df.query('CreatedBy == 619')
y1 =df2.pop("answer")

y2= y1.tolist()
type(y2)

a1 = []
i = 0
while i < len(y2):
    a1.append(int(y2[i])%10)
    i = i+1
a1[1:10]

plt.hist(a1)