PIG: Working With Data on Hadoop

Pig – or more officially Pig Latin – is a scripting language used to abstract over Map Reduce. What that means in English is that you can write relatively simple Pig scripts and they are translated in the background to Map Reduce – a much less user friendly language.

Pig can be used to perform basic data analysis. It is also commonly used to perform ETL (extract transform load) jobs. For those not familiar with ETL, when raw data comes into the system, it is often unstructured and in need of some basic cleaning before it can be analyzed or stored into a database structure like HBase. The ETL process handles all the organization and cleaning to give the data some level of structure.

Let’s open up Hadoop and and try our hand as some scripting.

Exercise File Download: PigExerciseFiles

In my Intro to Hadoop lesson I uploaded KMeans2.csv to /user/maria_dev : Hadoop: Introduction to Hadoop using Pig

Go to that folder upload the UltrasoundPrice.csv price.

Hit Upload

Hortonworks13

Select your file and confirm the Upload by hitting Upload again.

Hortonworks15

Now you should have both CSV files loaded into HDFS

Pig1.jpg

Now, let’s go to the PIG editor

hortonworks18

Click New Script. Let’s call this one Pig2

hortonworks19

Let’s start by loading Kmeans2.csv

kmeansData = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
 ID, Model, WO, Labor, Cost, Department);

dump kmeansData;

Hit Execute

Pig2.jpg

Here are the results

Pig3.jpg

Now to load UltrasoundPrice.csv into PigStorage

kmeansData = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
 ID, Model, WO, Labor, Cost, Department);

ultraPrice = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') AS (
 Model, Price);

dump ultraPrice;

Pig3.jpg

JOIN

Now let us join our two data sets (combine the two into one). It is really simple in PIG. We are going to join on the Model name of course (since model is the column that contains matching values in both files).

kmeansData = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);

ultraPrice = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') AS (
Model, Price);

joined = JOIN kmeansData by Model, ultraPrice by Model;

dump joined;

Pig4.jpg

Choose Columns to Store

What if you don’t want all of the columns. In PIG, you can use the commands FOREACH and GENERATE to create a new data set with only the columns you choose.

So if DataSetA contains columns A, B, C, D and you only want A and C

FOREACH DataSetA Generate A, C;

In our example, I only want Model, WO, Department, and Price

kmeansData = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);

ultraPrice = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') AS (
Model, Price);

joined = JOIN kmeansData by Model, ultraPrice by Model;

pickCols = FOREACH joined GENERATE kmeansData::Model, WO, Department, Price;

dump pickCols;

 

Notice something weird in the code above? kmeansData::Model? This is due to the fact that since we joined two data sets with matching column names, we need to tell PIG which of the two Models we want. I just choose kmeansData by the way. I could have used ultraPrice instead.

Anyway, here are the results

Pig5

Filtering

Now lets say I only want to see rows where the WO column (which stands for Work Order Count by the way) is greater than 5.

kmeansData = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);

ultraPrice = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') AS (
Model, Price);

joined = JOIN kmeansData by Model, ultraPrice by Model;

pickCols = FOREACH joined GENERATE kmeansData::Model, WO, Department, Price;

filterCols = FILTER pickCols BY WO > 5;

dump filterCols;

Pig6

Save Results to a File

Dumping your results to a screen is fine if you just running some ad-hoc analysis, but if you want to use this data later, you can save it to a file.

kmeansData = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);

ultraPrice = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') AS (
Model, Price);

joined = JOIN kmeansData by Model, ultraPrice by Model;

pickCols = FOREACH joined GENERATE kmeansData::Model, WO, Department, Price;

filterCols = FILTER pickCols BY WO > 5;

STORE filterCols INTO 'filterResults' USING PigStorage(',');

Note that we tell PigStorage to use ‘,’ as the separator in the file. We could have used \t (tabs) or * or other separators if we had wanted to.

Now the first thing you notice is that your do not get any Results, just a Log file. This is okay, nothing went wrong. Notice the nice green bar above.

Pig7.jpg

To see your results, go to HDFS (square ICON on the top bar of the screen)

Go to /user/maria_dev/ and you will notice your new file is actually a new folder. Click on it.

Pig8.jpg

You will have two files, _Success ( a log) and your data: part-r-00000 (your name may differ)

Pig9.jpg

Double click on part-r-00000 to preview your new file

Pig10.jpg