SSIS Lesson 3 – Derived Column

SSIS Lesson 3 – Derived Column

In this lesson, we are going to take the first basic step towards building a data warehouse. We are going to create a historical table from the table we built in Lesson 2. If you didn’t do Lesson 2, you can find the SQL needed to build the table at the bottom of this lesson.

If you remember the table we built in Lesson 2, we had a list of students and the work teams they were assigned to. In lesson 2, this list was designed to change every week, and we only wanted to know what the current week data was.  However, now Principal Skinner has asked if we can keep a listing of all the weeks so he can go back and see how many times someone was assigned to one of the work teams.

The table below is our starting table.

l3

What we want to do is copy the table’s contents into a new historical table and mark it somehow so we know what week each kid was on each team. We are going to do this by creating a new column called INSERT_DT which will record the date the data was added to the historical table.

First things first, let’s build our historical table in SQL Server

CREATE TABLE [dbo].[STUDENT_TEAM_HIST](
       [STUDENT_NM] [varchar](50) NULL,
       [STUDENT_AGE] [int] NULL,
       [STUDENT_TEAM] [varchar](50) NULL,
       [INSERT_DT] [datetime] NULL
)

Now go into SSIS and go to your Training Project

Create a new Package, I called mine Lesson_3

2018-03-16_14-18-13.png

Drag a Data Flow object onto your design window and click on it

2018-03-16_15-03-23.png

Now drag an OLE DB Source box over and click on it

2018-03-16_14-22-51.png

Select dbo.Student_Name from the drop down and click OK

2018-03-16_14-22-36.png

Now we are going to use a new object – Derived Column, drag it over and connect the Blue arrow from the bottom of the OLE DB Source box to it

Click on Derived Column: Purple Arrow (far left) – Name your new column, I named mine INSERT_DT

Green Arrow (Middle) : in the Expression column, type GETDATE()

Yellow Arrow (far right) : once you tab out of the Expression column, you will see your Data Type changed itself to be a timestamp

Red Arrow (Bottom) : hit Okay

2018-03-16_14-25-18

What we just did was create a new column called INSERT_DT which will use the GETDATE() method to get the current date and time. Next we are going to use this column we just created.

Grab and OLE DB Destination and drag it over. Connect it with the outbound arrow from the Derived Column box

2018-03-16_14-41-01

Select the new dbo.STUDENT_TEAM_HIST table we created and next click on Mappings

2018-03-16_14-41-31

You will see you now have four columns feeding into your new destination table. If you go back and look at your source table, you will see you still only have 3 columns there. The fourth column is the one we just derived.

Hit okay, now we are ready to run the package

2018-03-16_14-42-16.png

Right click on Package and hit Execute Package

2018-03-16_14-42-41.png

It should run. Notice the little comments telling you how many rows moved.

2018-03-16_14-43-16.png

Go to SQL Server and check your new table. You will see the INSERT_DT

2018-03-16_14-43-59

Okay, now to see the historical data in action

Let’s change Micah’s team from Red to Blue in our STUDENT_TEAM table

update [dbo].[STUDENT_TEAM]
set STUDENT_TEAM = 'Blue'
where STUDENT_NM = 'Micah'

select *
from dbo.STUDENT_TEAM

2018-03-16_14-45-27.png

Go back to SSIS, hit the red box at the top to end debugging mode

2018-03-16_15-16-29.png

Right click and execute the package again

Now run your query again. Note the two different teams Micah was on are listed. Along with the timestamps so you can tell when he was on which team.

2018-03-16_14-46-29

SQL Query to build table from lesson 2

CREATE TABLE [dbo].[STUDENT_TEAM](
       [STUDENT_NM] [varchar](50) NULL,
       [STUDENT_AGE] [int] NULL,
       [STUDENT_TEAM] [varchar](50) NULL
)

insert into dbo.STUDENT_TEAM
       (STUDENT_NM, STUDENT_AGE, STUDENT_TEAM)
VALUES
       ('Bob', 15, 'Red'),
       ('Claire', 17, 'Blue'),
       ('Chris', 17, 'Blue'),
       ('Candice', 14, 'Red'),
       ('Holly', 16, 'Blue'),
       ('Micah',15,'Red'),
       ('Stephanie', 16, 'Blue'),
       ('Joshua', 14, 'Red'),
       ('Sherrie', 18, 'Blue')

SSIS: Lesson 2 Import data from CSV into database

SSIS: Lesson 2

Import data from CSV into database

In the first lesson we exported data from a table in our SQL Server database to a CSV file. Today we are going to do just the opposite, we are going to bring data from a CSV file into our database.

I like to try to provide close to real world scenarios when I make my tutorials. Obviously in the simpler intro lessons, that can be difficult. That being said, for this lesson, imagine you manage a database for a school. Every week 9 students are selected to be on either the Red Team (Hall Safety Monitors) or the Blue Team (Lunch Room Aids). Every Friday, the Vice Principal Smith picks the student names at random and puts them into a CSV file that is emailed out to all the teachers. You point out that if you could just get the names into your database, you could put the information on the school’s main Intranet page, so the teachers don’t need to download and open the same CSV file 300 times a week.

So our task is to upload the new weekly CSV file into the database every Friday at the end of the school day.

Here is this week’s CSV file:

Lesson_2

First, before heading into SSIS, we need to create a destination table in our database for the CSV file. Here is the file we want to import into our database, you’ll note it has three columns, two columns are strings and one is an integer:

2018-03-12_12-28-28

So go into SSMS (SQL Server Management Studio) and run the following query to create the table:

CREATE TABLE STUDENT_TEAM (
 STUDENT_NM varchar(50),
 STUDENT_AGE int,
 STUDENT_TEAM varchar(50))
GO

This code will create an empty table with 3 columns named STUDENT_NM, STUDENT_AGE, STUDENT_TEAM.

Now you can minimize SSMS and open up Visual Studios Data Tools

Once open, go into your Training Project and create a new SSIS Package. I’m naming this one Lesson_2

2018-03-12_12-22-19

Now grab an Execute SQL Task from the SSIS Toolbox and drag it to the Design window.

Notice the red arrow at the bottom, there is already a Connection set up for the Sandbox database. This is because when we made this connection in Lesson_1, we made it by clicking on the Connection Managers in the Solution Explorer window.

2018-03-12_12-24-46.png

1

This creates a (Project) connection – one that can be seen in every package. If you only want your connection to exist inside a single package (this becomes important as the number of packages you have grows, otherwise you’d have thousands of connections in Solution Explorer) – you can right click anywhere inside the Connection Managers box at bottom of the design window.

2018-03-12_12-32-41.png

But this particular package will work fine just using the existing connection, we can move on without creating a new one.

Double Click on your Execute SQL Task icon in the design window.

In the window that pops up, you can rename your Task, provide a description of the task if you would like. This does come in handy when you have to go back to old package. It makes it easier to understand what is going on.

Down at the green arrow in the picture, you’ll have to selection a Connection from the drop down. Luckily were currently only have one to choose from.

2018-03-12_12-38-43

Next go to SQL Statement and click on the little ‘…’ box that appears in the right corner.

2018-03-12_12-41-59

In the pop up window, this is where you enter your SQL Query. In this case were going to delete all data from table in this box. The reason for this step is so that last week’s students won’t still be in the table on Monday morning. We only want the current students in our database.

2018-03-12_14-18-53

Now click Okay on this window and click OKAY again to get back to the Design Window

Next drag over a Data Flow Task. Connect the Create Table task by dragging the arrow over from the Create Table task. Note the direction of the arrow, this is the order in which tasks will run.

2018-03-12_12-51-28

Now open Data Flow Task. In the SSIS Toolbox, go to Other Sources and drag a Flat File Source over to the Design Window.

Double click on your new Flat File Source box and a new window will open. Click on New… next to Flat File Connection Manager

2018-03-12_12-59-37.png

On the new window, click Browse

2018-03-12_13-01-20

Browse to your CSV File. I created a folder called SSIS Training to make it easy to store all my files.

2018-03-12_13-02-21

If you click on Columns in the left window, you can see what the CSV file contains.

2018-03-12_13-09-10

Finally, before existing this window, Click back on General and make sure Column names in the first data row is checked.

2018-03-12_13-10-32

Now go back to the Design window and drag an OLE DB Destination box onto the design window. Connect the blue arrow from the Flat File Source to the OLE DB Destination

2018-03-12_13-13-52.png

Now double click on OLE DB Destination

Select STUDENT_TEAM from Name of table or the view

2018-03-12_14-20-06_1

Now Click on Mappings

2018-03-12_14-20-06.png

Everything should line up nicely since we named the table columns to match the CSV file.

2018-03-12_14-26-33.png

Now click okay to go back to the Design Window, Click on Control Flow in the upper left to get back to the main page of the package

2018-03-12_14-27-44.png

Right click the package and click Execute Package

2018-03-12_14-28-06.png

If all goes well, you should end up with two green checks

2018-03-12_14-33-37

If this happens, check to make sure your CSV file is not currently open, that can cause errors. If so, close the file and try executing the package again

2018-03-12_14-31-00

Finally, check the table in your SQL Server to see if it populated

2018-03-12_14-35-30.png

If you want to see how it would work the next week, go into the CSV file and change the data. When you run the package again, the names in the SQL table will change too.

 

SSIS: Lesson 1 – Export to CSV

SSIS Tutorial Lesson 1

Exporting Data to a CSV file

One of the main purposes of SSIS is moving data from one location to another with ease. In this lesson we are going to export data from our SQL Server database to a CSV file.

If you want to follow along, you can use the following code to create the data table I will be working with.  I loaded this into a database I created for practice called Sandbox.

CREATE TABLE employee_id (        
             emp_nm nvarchar(30) not null, 
            emp_id nvarchar(8),
             b_emp_id nvarchar(8)  
     PRIMARY KEY(emp_nm) );
INSERT INTO employee_id       
       (emp_nm, emp_id)
VALUES       
       ('Bob', 'A1234567'),
       ('Lisa', 'A1234568')
INSERT INTO employee_id       
        (emp_nm, b_emp_id)
VALUES       
        ('Priyanka', 'B1234567');

We will start by opening the Training_Project we created in the intro lesson (SSIS Tutorial: Introduction )and creating a new package.

2018-03-08_10-00-21

I renamed my new package Lesson_1

2018-03-08_10-01-27

Now we need to make a  new connection so that SSIS knows what our data source will be communicating with. To do so, go to Solution Explorer, right click on Connection Manager and select New Connection Manager

2018-03-08_10-06-41

Since we will be connecting to a SQL Server, select OLE DB from the list below

2018-03-08_10-07-17

Since there are no existing connections to pick from, choose New

2018-03-08_10-07-35

Okay, starting from the top select Native OLE DB\SQL Server Native Client 11.0  (note you might have 10.0 – that will work as well)

Since my SQL Server is locally installed on my machine, I am using Localhost as Server Name, otherwise provide the server name here.

Select your database from the drop down, again my database is Sandbox

Finally, hit Test Connection, you should get a connection successful message box.

Click Okay

2018-03-08_10-10-11.png

You’ll see your new connection in the box, now click Okay

2018-03-08_10-10-55.png

Now at the bottom of your Design Window, you’ll see your new connection in the box labeled Connection Managers

2018-03-08_10-11-15

So next, we need to go to the SSIS Toolbox and drag a Data Flow Task over to the designer

2018-03-08_10-14-02

Once in the designer, click on the Data Flow Task box, this will bring you to the Data Flow window

2018-03-08_10-14-21

Data Flow is used whenever you need to move data between disparate systems. Since we will be moving data from SQL Server to a CSV file, we need to use a Data Flow Task

You should note that the SSIS Toolbox has changed, offering up new Data Flow related tools.

Scroll down to Other Sources and drag OLE DB Source to the design box

2018-03-08_10-14-51

Double click on the new object

2018-03-08_10-15-15

Make sure your Connection manager you just created in the top drop down. Leave Data access mode at Table or view and select dbo.employee_id as your table

2018-03-08_10-15-52

If you click Preview in the bottom left, you will get a pop up of the data in the table

2018-03-08_10-16-08

If you click Columns in the upper left, you will see the columns that will be exported from the table. You can change the Output Column names if you want to use a different name in your output file.

We will skip over Error Output for now, as that is easily an entire lesson all its own.

2018-03-08_10-16-28

Now go back to the SSIS toolbox and under Other Destinations, click on Flat File Destination and drag it over to the design window.

2018-03-08_10-16-54.png

Drag the blue arrow from the OLE DB source box to the Flat File Destination Box

2018-03-08_10-17-18

It should look like this when done

2018-03-08_10-17-39

Now click on Flat File Destination

Since we don’t have a current Flat File Connection, we will need to click on New to create one.

2018-03-08_10-18-04

Select Delimited and click OK

2018-03-08_10-18-42.png

Find a folder you want the file to end up in. Select CSV files from the bottom right drop down, and name your file.  Click OK  (Note, use a name of a file that does not currently exist. This will create the file)

2018-03-08_10-22-56

Check the box: Column names in first data row

2018-03-08_10-23-41.png

If you click on Columns in the upper left, you will see the names of your header columns.

Click Okay to go back to the Flat File Destination Window

2018-03-08_10-24-14.png

If you click on Mappings, you will see you have 3 columns from your source going to three columns in what will be the new CSV file.  In future lessons I will show how you can use this to match up different named columns.

Click Okay to return to the design window

2018-03-08_10-24-35

Go to Solution Explorer, right click on the package and click Execute Package

2018-03-08_10-25-11.png

You should get green check marks to indicate success. This is a very small package, so they probably turned green instantly. In larger jobs, the green check will first be a yellow circle to indicate progress. It turns green when that step is complete.

Note on the connecting line between Source and Destination that you get a read out of how many rows were processed.

2018-03-08_10-25-34

Go to the folder you chose as your destination, you will see your new CSV file there

2018-03-08_10-26-36.png

Open the file, you will see your data has been exported into your CSV file

2018-03-08_10-27-35.png

SQL: Check if table exists

To check if a table exists in SQL Server, you can use the INFORMATION_SCHEMA.TABLES table.

Running the following code, produces the results below:

USE SANDBOX 
GO;

Select *
from INFORMATION_SCHEMA.TABLES
GO;

1

You can use this table with an IF THEN clause do determine how your query responds whether or not a table exists.

IF EXISTS (SELECT * FROM INFORMATION_SCHEMA.TABLES
           WHERE TABLE_NAME = N'employee_id')
BEGIN
  PRINT 'Yes'
END

2

IF EXISTS (SELECT * FROM INFORMATION_SCHEMA.TABLES
           WHERE TABLE_NAME = N'employee_ids')
BEGIN
  PRINT 'Yes'
END

ELSE

BEGIN
  PRINT 'No'
End

3

One of the more common uses I find for this when I need to create a table in a script. I want to make sure a table with same name doesn’t already exist, or my query will fail. So I write a query like the one below.

IF EXISTS (SELECT * FROM INFORMATION_SCHEMA.TABLES
           WHERE TABLE_NAME = N'employee_id')
BEGIN
  drop table employee_id
END

ELSE

BEGIN
 CREATE TABLE employee_id (
        emp_nm nvarchar(30) not null,
             emp_id nvarchar(8),
             b_emp_id nvarchar(8)
             PRIMARY KEY(emp_nm) );

INSERT INTO employee_id
       (emp_nm, emp_id)
VALUES
       ('Bob', 'A1234567'),
       ('Lisa', 'A1234568')

INSERT INTO employee_id
       (emp_nm, b_emp_id)
VALUES
       ('Priyanka', 'B1234567');
End

SQL: Coalesce

Coalesce is a simple but useful SQL function I use quite often. It returns the first non-null value in a list of values.

A common real world application for this function is when you are trying to join data from multiple sources. If you work for a company known for acquiring other companies, your HR tables are bound to be full of all kinds of inconsistencies.

In this example below, you will see a snip-it of an HR table created by merging tables from two companies. The first two employees have an emp_id and the last one has a b_emp_id. The last one, Priyanka, is an employee from a newly acquired company, so her emp_id is from the HR table of the newly acquired company.

c1

So if you just want to merge the 2 emp_id columns into one, you can use the coalesce() function.

The syntax for coalesce is: coalesce (col1,col2,….)  as alias

Coalesce takes the first non-null. In the example below, we check emp_id first, if it has a value, we use that. If emp_id is null, we look for the value in b_emp_id.

select emp_nm,
coalesce (emp_id, b_emp_id) as emp_id
from employee_id

c2

If you want to try this for yourself, here is the code to build the table out for you.

CREATE TABLE employee_id (
        emp_nm nvarchar(30) not null,
             emp_id nvarchar(8),
             b_emp_id nvarchar(8)
             PRIMARY KEY(emp_nm) );
INSERT INTO employee_id
       (emp_nm, emp_id)
VALUES
       ('Bob', 'A1234567'),
       ('Lisa', 'A1234568')
INSERT INTO employee_id
       (emp_nm, b_emp_id)
VALUES
       ('Priyanka', 'B1234567');

 

SSIS Tutorial: Introduction

SSIS: Introduction

SQL Server Integration Services is the ETL tool for the Microsoft SQL Server platform. SSIS allows you to take data from various sources (from Excel files, to text files, to other databases, etc), and bring it all together.

If you are new to concept of ETL, SSIS is great place to start. Click here to learn about ETL

If you are well versed in another ETL platform, SSIS is a relatively easy system to get up to speed on.

SSIS comes as part of SQL Server Data Tools, which you should be able to install with your SQL Server installation software. You will need SQL Server Standard, Developer or above editions to run SQL Server Data Tools. SQL Server Express does not support Data Tools.

For some unknown reason, once it is installed, you will not find a program called SSIS. Maybe the engineers at Microsoft think this is funny, but in order to run SSIS you will need to look for the following program instead.

2018-02-26_14-26-37

I’m using 2015, but for most of what I am doing here, any version should be compatible.

When you launch data tools, you will notice it run in Visual Studios

2018-03-06_10-37-28

To start an SSIS job, you will either need to open an existing project, or create a new one. In this example, I will create a new one.

File ->New -> Project  (or Ctrl+Shift+N)
2018-03-06_10-37-54

Inside the Business Intelligence Templates, select Integration Services. I always just select Integrations Service Project, I’m not a big fan of the Wizard

2018-03-06_10-38-37

Next step: Name your project

2018-03-06_10-40-08

Now you are in SSIS. Here are the 3 main windows you will be starting with.

From right to left:

SSIS Toolbox

2018-03-06_10-43-40

Package Designer

2018-03-06_10-44-00

Solution Explorer

2018-03-06_10-44-19

Packages

Inside solutions, packages are the collections of jobs or scripts found inside a project. It is inside the package that you will build out your ETL job.

For our first lesson, we are just going to build a simple package. Your new solution should have opened with a new package when it opened. If it is, right click on the green arrow to rename it, if not, right click on the red arrow to create a new package.

 

2018-03-06_10-44-19_1

I renamed my package First_Package, you can name your package whatever you choose.

This first package will simply just display a pop up message. In the SSIS Toolbox, go to Script Task and drag it into the package designer window.

2018-03-06_13-09-46

2018-03-06_13-25-47

Double click on the Script Task Box in the Design window

Note in my example, I have C# set as the scripting language. The other default option is Visual Basic. If you are more comfortable with that, feel free to use it. I prefer C# mainly because I spent more time working with it.

You don’t need to know any C# for this tutorial. This is literally a single line of code assignment. I will cover more C# in the future.

Click Edit Script… to continue

2018-03-06_13-26-45

Note, this step can take a minute or so for the script editor to appear. Don’t panic if your computer appears locked up.

Once the script editor opens, don’t panic by all the code you see. Luckily Microsoft has done most of the ground work for us. We only need to scroll down until you see:

public void Main()

Now place your cursor below //TODO: Add your code here

The code you need to type for this script is:

MessageBox.Show(“This is my first SSIS Package”);

2018-03-06_15-12-02

Now I know you will be looking for a save button. But again, our friends at Microsoft might have been drinking when they coded this. Instead, just click the upper right X to close out the whole window –

I know – why would they do it that way? how much effort would a save and close button have cost them? I don’t know. It just is what it is. Just click the X and move on with your life.

2018-03-06_15-13-33

Now click the OK button – again I guess Save was too much to type

2018-03-06_15-14-08

Now right click on your package (green arrow) and click Execute Package

2018-03-06_10-44-19_1

Your message will pop up in a Message box window

2018-03-06_15-17-49

Click OK on the messagebox and Click the red square to end the package execution.

2018-03-06_15-18-09

Congrats, you have just built and executed your first SSIS Package.

 

XML Parsing: Advanced SQL

If you want to play along with the lesson, use the following code to create the table I will be using:

CREATE TABLE employee_lang (
        emp_nm nvarchar(30) not null,
             lang nvarchar(255),
             PRIMARY KEY(emp_nm) );
INSERT INTO employee_lang
       (emp_nm, lang)
VALUES
       ('Bob', 'Python, R, Java'),
       ('Lisa', 'R, Java, Ruby, JavaScript'),
       ('Priyanka', 'SQL, Python');

 

XML Parsing

XML parsing is a SQL method for separating string data found in a single field in a table. Look at the table below:

1

This table has two columns, emp_nm (employee name), lang (programming languages the employee is proficient in). Notice that the column lang has multiple values for each record. While this is easily human readable, if you want it to more machine usable (think Pivot tables or R statistical analysis), you are going to want your data to look more this this:

2

Notice now the table has the same two columns, but the lang column now only has 1 word per record. So the question is, how do you do this using SQL?

XML parsing

The code used to “parse” out the data in the lang column is below:

SELECT emp_nm
,SUBSTRING(LTRIM(RTRIM(m.n.value('.[1]','varchar(8000)'))),1,75) AS lang
FROM
(
SELECT
emp_nm
,CAST('<XMLRoot><RowData>' + REPLACE([lang],',','</RowData><RowData>') + '</RowData></XMLRoot>' AS XML) AS x
FROM employee_lang
)t
CROSS APPLY x.nodes('/XMLRoot/RowData')m(n)

Let’s break it down a little first.

SELECT
emp_nm
,CAST('<XMLRoot><RowData>' + REPLACE([lang],',','</RowData><RowData>') + '</RowData></XMLRoot>' AS XML) AS x
FROM  employee_lang

What we are doing with the code about is using a CAST and REPLACE functions to convert the elements in column lang into a XML line.  See results below:

3

For Bob, this is the result of the CAST/REPLACE code on the lang column

<XMLRoot><RowData>Python</RowData><RowData> R</RowData><RowData> Java</RowData></XMLRoot>

This is how the code works from the inside out.

REPLACE()

select REPLACE (‘SQL ROCKS’, ‘S’, ‘!’)

If you run the above code, it will return !QL ROCK!

Replace is saying — everywhere an S is in the string, replace it with a !

CAST()

The cast function is casting the string as an XML data point. This is needed for the next section, when we unpack the XLM string

Cross Apply / Value

SELECT emp_nm
,m.n.value('.[1]','varchar(8000)')
FROM
(
SELECT
emp_nm
,CAST('<XMLRoot><RowData>' + REPLACE([lang],',','</RowData><RowData>') + '</RowData></XMLRoot>' AS XML) AS x
FROM employee_lang
)t
CROSS APPLY x.nodes('/XMLRoot/RowData')m(n)

So without going into way too much detail, you can find pages dedicated to Cross Apply and Value(), I’ll give you the quick breakdown.

First notice we aliased our CAST() statement as x.  So if you look at the CROSS APPLY, you will see we are asking to look at x.nodes. Had we aliased our cast y, we would be looking at y.nodes.

Now look at m(n) at the end of the line. This is like an array or list in programming languages. Keep that in mind for the next step.

Inside the x.nodes() is ‘/XMLRoot/RowData’ , this is telling us to assign to m everything following /XMLRoot and to iterate n by everything following /RowData, so for Bob:

m(n=1) = Python

m(n=2) = R

m(n=3) = Java

Now we pass that array m(n) to our Value() method.  Hence m.n.value().  Note m and n were just letters I picked, you can use others.

Inside m.n.value(‘.[1]’,’varchar(8000)’) was used as it should pretty much cover any size string you may have to deal with.

So the final iteration simply add as SUBSTRING to clean it up and get rid of white space

4

ETL : Extract, Transform, Load

It you ever actually intend to work with data in the real world, you will quickly come face to face with two truths:

  1. Data is messy
  2. The data you need is never in one place

While creating predictive models and showing off your brilliant math skills is infinitely more sexy, before you even begin to build a simple linear regression until you have all your data together and in the correct format.

If you haven’t heard it before, it is a pretty common estimate that a data scientist spends 60 to 70 percent of their time working as a data “janitor”. While on a smaller scale, most of this data manipulation and merging (often referred to as munging) can be done manually, in a professional environment, you are going to want to find a way to try to automate as much of the process as possible.

ETL Tools

This is where ETL Tools come into play. Some of the more popular ones on the market are:

  • Oracle Warehouse Builder (OWB)
  • SAS Data Management
  • PowerCenter Informatica
  • SQL Server Integration Services (SSIS)

SSIS is the one I am most familiar with (also the one I use professionally) so it will be the one I will be focusing on in future lessons.

What these ETL Tools do is help you automate the data munging process.

Extract:

When discussing ETL, extract means pulling data from a source (or many sources). Consider this, you have Customer table residing on a SQL Server Database, a Sales Person table sitting on a Teradata Database, and a list of purchases kept on an Excel spreadsheet.

Using an SSIS tool, you can connect to these three different data sources and query the data (Extract) into memory for you to work with.

Transform:

Transform in ETL refers to manipulating the data.

Pulling from the example above, in the SQL Server table you have

Cust_ID         Cust_NM
9081            Bob Smith

The Excel sheet says

Cust_ID         Cust_NM
CN9081          Bob Smith

In the Excel spreadsheet, someone has put a CN in front of the Cust_ID. This is a common problem you will run across when working with real world data.  Using an ETL tool, you can set up steps that will either remove the CN (or add it- your prerogative).

Load:

Finally, once you have extracted the data from the different sources, transformed it – so differences in the data are correct, you come to the final step: loading it into a location of your choice – usually a data warehouse table.

Ensemble Modeling

In the world of analytics,modeling is a general term used to refer to the use of data mining (machine learning) methods to develop predictions. If you want to know what ad a particular user is more likely to click on, or which customers are likely to leave you for a competitor, you develop a predictive model.

There are a lot of models to choose from: Regression, Decision Trees, K Nearest Neighbor, Neural Nets, etc. They all will provide you with a prediction, but some will do better than others depending on the data you are working with. While there are certain tricks and tweaks one can do to improve the accuracy of these models, it never hurts to remember the fact that there is wisdom to be found in the masses.

The Jelly Bean Jar

I am sure everyone has come across some version of this in their life: you are at a fair or school fund raising event and someone has a large see-through jar full of jelly beans (or marbles or nickles). Next to the jar are some slips of paper with the instructions to “Guess the number of jelly beans in the jar you win!”

An interesting thing about this game, and you can try this out for yourself, is that given a reasonable number of participants, more often than not, the average guess of the group will perform better than the best individual guesser. Or in other words, imagine there are 200 jelly beans in the jar and the best guesser (the winner) guesses 215. More often than not, the average of all the guesses might be something like 210 or 190. The group cancels out their over and under guessing, resulting in a better answer than anyone individual.

How Do We Get the Average in Models?

There are countless ways to do it, and researchers are constantly trying new approaches to get that extra 2% improvement over the last model. For ease of understanding though, I am going to focus on 2 very popular methods of ensemble modeling : Random Forests & Boosted Trees.

2016-11-22_22-05-26

Random Forests:

Imagine you have a data set containing 50,000 records. We will start by randomly selecting 1000 records and creating a decision tree from those records. We will then put the records back into the data set and draw another 1000 records, creating another decision tree. The process is repeated over and over again for a predefined number of iterations (each time the data used is returned to the pool where it could possibly be picked again).

After all the sample decision trees have been created (let’s say we created 500 for the sake of argument), the model then takes the mean or average of all the models if you are looking at a regression or the mode of all the models if you are dealing with a classification.

For those unfamiliar with the terminology, a regression model looks for a numeric value as the answer. It could be the selling price of a house, a person’s weight, the price of a stock, etc. While a classification looks for classifying answers: yes or no, large – medium – small, fast or slow, etc.

Boosted Trees:

Another popular method of ensemble modeling is known as boosted trees. In this method, a simple (poor learner) model tree is created – usually 3-5 splits maybe. Then another small tree (3-5 splits) is built by using incorrect predictions for the first tree. This is repeated multiple times (say 50 in this example), building layers of trees, each one getting a little bit better than the one before it. All the layers are combined to make the final predictive model.

Oversimplified?

Now I know this may be an oversimplified explanation, and I will create some tutorials on actually building ensemble models, but sometimes I think just getting a feel for the concept is important.

So are ensemble models always the best? Not necessarily.

One thing you will learn when it comes to modeling is that no one method is the best. Each has their own strengths. The more complex the model, the longer it takes to run, so sometimes you will find speed outweighs desire for the added 2% accuracy bump. The secret is to be familiar with the different models, and to try them out in different scenarios. You will find that choosing the right model can be as much of an art as a science.

The Monty Hall Problem

The Monty Hall problem is an interesting exercise in conditional probability. It focuses on a 1970’s American television show called Let’s Make a Deal hosted by television personality Monty Hall.

The game would end with a contestant being shown 3 doors. Behind one of those doors, there was a prize. Behind the other 2, a goat. Monty Hall would ask the contestant to pick a door. He would then open the one of the two remaining doors showing the contestant a goat. Now with 2 doors remaining, he would ask the contestant if they wanted to change their selection. The question was – should they?

Let’s look at it. In the beginning of the game, we have 3 closed doors. We know 1 of these 3 doors has a prize behind it, the other 2 do not. So the probability of picking the right door at this point is 1/3.

monty.jpg

Let’s say you chose door number 1. Once you chose, Monty Hall opens door number 3 and shows you the prize is not behind that door. He then asks if you would like to choose door number 2 instead.

So what do you think the probability the prize is behind door number 2? You have 2 doors left, so you might think it would 1/2. Well, you would be wrong.

monty1.jpg

The reality is, the probability that the prize is behind door number 2 is actually 2/3.

Don’t believe me? Don’t worry, you aren’t alone. When Marilyn vos Savant wrote about this in her newspaper column, outraged mathematics professors from all over wrote her letters protesting her faulty logic. Her response was simple, play the game and see what results you come up with.

The trick to this neat little math problem is the unique constraints we are faced with. We start with 3 doors, from which we must pick 1. Giving us a the already agreed upon probability of 1/3.

Next, one of the two remaining options is removed the equations leaving only the door you have picked and 1 other door. Common sense would tell you that you now have a 50 – 50 situation where either door gives you the same odds of winning.

But common senses fails you. If you stick to your original choice, you will have a 1/3 probability of winning… but if you change your choice to the remaining door, your probability now shifts to 2/3.

Let’s see it in action.

Below is the layout of the three possible outcomes of the game.

2017-01-12_19-39-35.jpg

For the sake of argument, let us say that we have decided to choose Door 1 to start the game.

2017-01-12_19-41-10.jpg

After we have made that choice, one of the remaining doors with a goat behind it will be revealed.

2017-01-12_19-44-47.jpg

So, now we are left with the option to stick with door 1 or try the remaining door.

If we stick to our original door, we will win 1 out of the three possible games.

2017-01-12_19-47-16.jpg

If, however, you change your door after the reveal, you will win 2 times out of the 3 possible games.

2017-01-12_19-47-38.jpg

 

So what is the take away here?

If you ever find yourself on Let’s Make a Deal, always change your door after the reveal.

More importantly, if you are working with data for a living, make sure you have some healthy respect for the math behind statistics and probability. The Monty Hall Problem is, at its very heart, a conditional probability problem. Issues like this and Simpons’ Paradox can lead you to develop what appears to be logically sound models that are sure to fail once they come face to face with reality.

There is a lot of great software packages out there that make building machine learning models as simple as point and click. But without someone with some healthy skepticism as well as a firm handle on the math, many of these models are sure to fail.