I have already shown you how to use PIG from the Ambari web interface. Now I will show you how to use GRUNT to run PIG Scripts from the command line.
First things first. Before you can use the command terminal, you need to log into Hadoop via the VMware terminal. Click on your running VM and hit Alt-F5
Log in using root and hadoop as your password. You will then be asked to create a new password.
Here you will see instructions for accessing Hadoop via SSH Client (command line access).
Also, in case someone out there doesn’t know. localhost and 127.0.0.1 mean the same thing and are interchangeable.
Putty
On of the best ways to access SSH from Windows is through a free program called Putty. You can find a link for it here:putty
Once you have it downloaded, click on the exe file and fill in the information as seen below. IP or Hostname(127.0.0.1 or localhost) and Port 2222. Make sure SSH is checked. If you want, you can save these setting like I did. I named mine HadoopLearning.
Next, click Open
Log in using root and your new password
Okay, now for a quick explanation of what will seem confusing to some. You are currently logged into a Linux computer (CentOS to be exact) that came prepacked with Hadoop already installed. So, in order to interact with Hadoop we need to use one of the two following commands first (hdfs dfs or hadoop fs) – either one works. It is just a matter of personal choice. I like hdfs dfs, but feel free to use hadoop fs
hdfs dfs -ls /
This command gives me the listing (ls) of the root hadoop folder.
I know my files are in user/maria_dev so let’s look in there now.
hdfs dfs -ls /user/maria_dev
You can see I have some csv files in this folder we can work with.
Now we are in a working terminal and have data, the next step is to start PIG. Now pay close attention, this part is very difficult. Go to your command terminal and type: pig
You will see the terminal working and you should end up with a prompt that says grunt>
grunt>
uP = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') as (Model, Price);
dump uP;
Pig – or more officially Pig Latin – is a scripting language used to abstract over Map Reduce. What that means in English is that you can write relatively simple Pig scripts and they are translated in the background to Map Reduce – a much less user friendly language.
Pig can be used to perform basic data analysis. It is also commonly used to perform ETL (extract transform load) jobs. For those not familiar with ETL, when raw data comes into the system, it is often unstructured and in need of some basic cleaning before it can be analyzed or stored into a database structure like HBase. The ETL process handles all the organization and cleaning to give the data some level of structure.
Let’s open up Hadoop and and try our hand as some scripting.
Go to that folder upload the UltrasoundPrice.csv price.
Hit Upload
Select your file and confirm the Upload by hitting Upload again.
Now you should have both CSV files loaded into HDFS
Now, let’s go to the PIG editor
Click New Script. Let’s call this one Pig2
Let’s start by loading Kmeans2.csv
kmeansData = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);
dump kmeansData;
Hit Execute
Here are the results
Now to load UltrasoundPrice.csv into PigStorage
kmeansData = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);
ultraPrice = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') AS (
Model, Price);
dump ultraPrice;
JOIN
Now let us join our two data sets (combine the two into one). It is really simple in PIG. We are going to join on the Model name of course (since model is the column that contains matching values in both files).
kmeansData = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);
ultraPrice = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') AS (
Model, Price);
joined = JOIN kmeansData by Model, ultraPrice by Model;
dump joined;
Choose Columns to Store
What if you don’t want all of the columns. In PIG, you can use the commands FOREACH and GENERATE to create a new data set with only the columns you choose.
So if DataSetA contains columns A, B, C, D and you only want A and C
FOREACH DataSetA Generate A, C;
In our example, I only want Model, WO, Department, and Price
kmeansData = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);
ultraPrice = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') AS (
Model, Price);
joined = JOIN kmeansData by Model, ultraPrice by Model;
pickCols = FOREACH joined GENERATE kmeansData::Model, WO, Department, Price;
dump pickCols;
Notice something weird in the code above? kmeansData::Model? This is due to the fact that since we joined two data sets with matching column names, we need to tell PIG which of the two Models we want. I just choose kmeansData by the way. I could have used ultraPrice instead.
Anyway, here are the results
Filtering
Now lets say I only want to see rows where the WO column (which stands for Work Order Count by the way) is greater than 5.
kmeansData = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);
ultraPrice = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') AS (
Model, Price);
joined = JOIN kmeansData by Model, ultraPrice by Model;
pickCols = FOREACH joined GENERATE kmeansData::Model, WO, Department, Price;
filterCols = FILTER pickCols BY WO > 5;
dump filterCols;
Save Results to a File
Dumping your results to a screen is fine if you just running some ad-hoc analysis, but if you want to use this data later, you can save it to a file.
kmeansData = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);
ultraPrice = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') AS (
Model, Price);
joined = JOIN kmeansData by Model, ultraPrice by Model;
pickCols = FOREACH joined GENERATE kmeansData::Model, WO, Department, Price;
filterCols = FILTER pickCols BY WO > 5;
STORE filterCols INTO 'filterResults' USING PigStorage(',');
Note that we tell PigStorage to use ‘,’ as the separator in the file. We could have used \t (tabs) or * or other separators if we had wanted to.
Now the first thing you notice is that your do not get any Results, just a Log file. This is okay, nothing went wrong. Notice the nice green bar above.
To see your results, go to HDFS (square ICON on the top bar of the screen)
Go to /user/maria_dev/ and you will notice your new file is actually a new folder. Click on it.
You will have two files, _Success ( a log) and your data: part-r-00000 (your name may differ)
Double click on part-r-00000 to preview your new file
We live in the era of Big Data. Whether you aspire to work as Data Scientist, or you are just curious as to what all the hype is about, this introductory lesson on Hadoop using Pig will help to shed some light on the topic.
First off, what is Hadoop?
Hadoop is the Apache open source version of Google’s MapReduce. At its heart, Hadoop is a database management system, not to different in purpose than database systems you are already familiar with like Oracle and SQL Server. But Hadoop is a data management system designed to work with massive data sets.
When working with Oracle, DB2, SQL Server, etc, if you want to handle massive data sets, you need a monster computer. And when a need for more resources pops up, you have no choice but to scale up vertically. That means, you need to add more memory, more hard drives, more processors to your computer – an expensive and failure prone approach.
Hadoop, on the other hand, allows for horizontal scaling. You do not need to buy a massive database server. Instead Hadoop can run across multiple commodity( ie cheaper) servers, leveraging the collective memory and processor power of the group. This is in part due to the MapReduce algorithm.
MapReduce works like this. Imagine you have 100 documents and you want to know what the most common word in the all the documents is. Instead off looking at all 100 documents, MapReduce Maps (creates) 100 separate jobs where each document is scanned by a separate computer. Then once each document has been scanned, the results of all 100 scans are fed into the Reducer which aggregates the 100 results into 1 single result.
Enough Theory, Example time
I am going to be using The Hortonworks Sandbox. There are few approaches to setting this up – VMWare, Virtal Box, or 30 day free trail on Azure. Azure is without a doubt the simplest set up (it is literally like downloading an app on your phone), but after 30 days they want money and I am cheap. So I will be using Virtual Box.
You are also going to need to download the Hortonworks Sandbox: Sandbox
Make sure you download to appropriate version. I picked the VirtualBox Install
Install VirtualBox and open the VM VirtualBox Manager
Go to File > Import Appliance…
Add the ova image you downloaded from Hortonworks and hit Next
Keep the default setting and hit Import. The import can take anywhere from 5 minutes to 1/2 hour based on the speed of your machine.
Once the import is complete, start up Hortonworks. Just double click on the instance you want. (Note I have instances loaded on my machine, most people will only have one).
Once it finishes loading, your screen should look like this.
Open up a browser and go to http://127.0.0.1:8888/ – refer to your screen for the correct address as it may be different.
The following screen should appear.
Follow the URL and use username and password provided (note your’s may differ from my screen shot)
You will come to a screen like this. This is the Ambari environment – part of the Hortonworks Sandbox. You may notice some red alert icons on the screen – these are common and generally will not cause you any problems – at least not as far as what we are doing here.
Next, you are going to need to download this CSV file if you want to play along: KMeans2
Now, go to the square box icon in the upper right corner and select HDFS Files (HDFS stands for Hadoop File System in case you were wondering)
Now go to user
I then select maria_dev since that is who I am logged in as
Hit the Upload button
Hit Browse and search for the Kmeans2 file
Confirm your Upload
Now your file is uploaded to Hadoop. Notice the permissions column. If you have any Linux experience, this should look familiar. Hortonwork’s Sandbox is packaged with CentOS Linux.
You can double click on the file and get a file preview
PIG Script
Okay, so we have built a Hadoop Sandbox and loaded uploaded some data. How do we get to it?
One method for accessing data in Hadoop is PIG. It is a language similar to SQL. Let’s take a look at some quick examples.
First let’s get to our PIG editor. Go to the square icon on the top bar and select Pig.
Select New Script and give it a name. I named mine Pig1
Below is the first example of our script. I’ll explain below
dataLoad = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);
dump dataLoad;
dataLoad – just a variable. This is the variable I will be dumping my data into
LOAD’/user/maria_dev/KMeans2.csv’ – this tells PIG to load the data file we uploaded
USING PigStorage(‘,’) – This means we will be storing the data in a comma separated format
AS(ID, Model,…etc; – This names columns
dump dataLoad; – dumps the results of my query to the screen.
Note all lines need to end in a semicolon ;
Before you hit Execute, note that PIG is not like SQL. PIG executes as a batch, so nothing will appear on the screen until the entire batch is finished processing
And we have success…
One more bit of code. I will save more in depth PIG for future lessons. This is really just a quick glimpse of Hadoop and how it works.
In this example below, we are using the LIMIT command to limit our results to the first 10.