Hadoop: Introduction to Hadoop using Pig

We live in the era of Big Data. Whether you aspire to work as Data Scientist, or you are just curious as to what all the hype is about, this introductory lesson on Hadoop using Pig will help to shed some light on the topic.

First off, what is Hadoop?

Hadoop is the Apache open source version of Google’s MapReduce. At its heart, Hadoop is a database management system, not to different in purpose than database systems you are already familiar with like Oracle and SQL Server. But  Hadoop is a data management system designed to work with massive data sets.

When working with Oracle, DB2, SQL Server, etc, if you want to handle massive data sets, you need a monster computer. And when a need for more resources pops up, you have no choice but to scale up vertically. That means, you need to add more memory, more hard drives, more processors to your computer –  an expensive and failure prone approach.

Hadoop, on the other hand, allows for horizontal scaling. You do not need to buy a massive database server. Instead Hadoop can run across multiple commodity( ie cheaper) servers, leveraging the collective memory and processor power of the group. This is in part due to the MapReduce algorithm.

MapReduce works like this. Imagine you have 100 documents and you want to know what the most common word in the all the documents is. Instead off looking at all 100 documents, MapReduce Maps (creates) 100 separate jobs where each document is scanned by a separate computer. Then once each document has been scanned, the results of all 100 scans are fed into the Reducer which aggregates the 100 results into 1 single result.

Enough Theory, Example time

I am going to be using The Hortonworks Sandbox. There are few approaches to setting this up – VMWare, Virtal Box, or 30 day free trail on Azure. Azure is without a doubt the simplest set up (it is literally like downloading an app on your phone), but after 30 days they want money and I am cheap. So I will be using Virtual Box.

You can download Virtual Box here: Virtual Box

You are also going to need to download the Hortonworks Sandbox: Sandbox

Make sure you download to appropriate version. I picked the VirtualBox Install

hortonworks1.jpg

hortonworks2

Install VirtualBox and open the VM VirtualBox Manager

Go to File > Import Appliance…

hortonworks3.jpg

Add the ova image you downloaded from Hortonworks and hit Next

hortonworks4.jpg

Keep the default setting and hit Import. The import can take anywhere from 5 minutes to 1/2 hour based on the speed of your machine.

hortonworks5.jpg

Once the import is complete, start up Hortonworks. Just double click on the instance you want. (Note I have instances loaded on my machine, most people will only have one).

hortonworks6

Once it finishes loading, your screen should look like this.

Open up a browser and go to http://127.0.0.1:8888/  – refer to your screen for the correct address as it may be different.

hortonworks7.jpg

The following screen should appear.

hortonworks8.jpg

Follow the URL and use username and password provided (note your’s may differ from my screen shot)

hortonworks9.jpg

You will come to a screen like this. This is the Ambari environment – part of the Hortonworks Sandbox.  You may notice some red alert icons on the screen – these are common and generally will not cause you any problems – at least not as far as what we are doing here.

hortonworks10.jpg

Next, you are going to need to download this CSV file if you want to play along: KMeans2

Now, go to the square box icon in the upper right corner and select HDFS Files (HDFS stands for Hadoop File System in case you were wondering)

Hortonworks11

Now go to user

Hortonworks12.jpg

I then select maria_dev since that is who I am logged in as

Hortonworks12.jpg

Hit the Upload button

Hortonworks13

Hit Browse and search for the Kmeans2 file

Hortonworks14.jpg

Confirm your Upload

Hortonworks15.jpg

Now your file is uploaded to Hadoop. Notice the permissions column. If you have any Linux experience, this should look familiar. Hortonwork’s Sandbox is packaged with CentOS Linux.

Hortonworks16.jpg

You can double click on the file and get a file preview

Hortonworks17.jpg

PIG Script

Okay, so we have built a Hadoop Sandbox and loaded uploaded some data. How do we get to it?

One method for accessing data in Hadoop is PIG. It is a language similar to SQL. Let’s take a look at some quick examples.

First let’s get to our PIG editor. Go to the square icon on the top bar and select Pig.

hortonworks18.jpg

Select New Script and give it a name. I named mine Pig1

hortonworks19

Below is the first example of our script. I’ll explain below

dataLoad = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);

dump dataLoad;
  • dataLoad – just a variable. This is the variable I will be dumping my data into
  • LOAD’/user/maria_dev/KMeans2.csv’ – this tells PIG to load the data file we uploaded
  • USING PigStorage(‘,’) – This means we will be storing the data in a comma separated format
  • AS(ID, Model,…etc; – This names columns
  • dump dataLoad; – dumps the results of my query to the screen.

Note all lines need to end in a semicolon ; 

Before you hit Execute, note that PIG is not like SQL. PIG executes as a batch, so nothing will appear on the screen until the entire batch is finished processing

hortonworks20

And we have success…

hortonworks21.jpg

One more bit of code. I will save more in depth PIG for future lessons. This is really just a quick glimpse of Hadoop and how it works.

In this example below, we are  using the LIMIT command to limit our results to the first 10.

dataLoad = LOAD '/user/maria_dev/KMeans2.csv' USING PigStorage(',') AS (
ID, Model, WO, Labor, Cost, Department);

dataTop10 = LIMIT dataLoad 10;

dump dataLoad;

hortonworks22.jpg

And here are our results.

hortonworks23.jpg

 

Leave a Reply