It you ever actually intend to work with data in the real world, you will quickly come face to face with two truths:
- Data is messy
- The data you need is never in one place
While creating predictive models and showing off your brilliant math skills is infinitely more sexy, before you even begin to build a simple linear regression until you have all your data together and in the correct format.
If you haven’t heard it before, it is a pretty common estimate that a data scientist spends 60 to 70 percent of their time working as a data “janitor”. While on a smaller scale, most of this data manipulation and merging (often referred to as munging) can be done manually, in a professional environment, you are going to want to find a way to try to automate as much of the process as possible.
This is where ETL Tools come into play. Some of the more popular ones on the market are:
- Oracle Warehouse Builder (OWB)
- SAS Data Management
- PowerCenter Informatica
- SQL Server Integration Services (SSIS)
SSIS is the one I am most familiar with (also the one I use professionally) so it will be the one I will be focusing on in future lessons.
What these ETL Tools do is help you automate the data munging process.
When discussing ETL, extract means pulling data from a source (or many sources). Consider this, you have Customer table residing on a SQL Server Database, a Sales Person table sitting on a Teradata Database, and a list of purchases kept on an Excel spreadsheet.
Using an SSIS tool, you can connect to these three different data sources and query the data (Extract) into memory for you to work with.
Transform in ETL refers to manipulating the data.
Pulling from the example above, in the SQL Server table you have
9081 Bob Smith
The Excel sheet says
CN9081 Bob Smith
In the Excel spreadsheet, someone has put a CN in front of the Cust_ID. This is a common problem you will run across when working with real world data. Using an ETL tool, you can set up steps that will either remove the CN (or add it- your prerogative).
Finally, once you have extracted the data from the different sources, transformed it – so differences in the data are correct, you come to the final step: loading it into a location of your choice – usually a data warehouse table.