Power Query: Merge Data Sets

For this tutorial, you will need to have Power Query installed. If you are running Office 2016, Power Query should already be available. For Excel 2010 and 2013, here is a link to the download: Power Query

Here is a link to the practice file for this Lesson: Data Cleaning Power Query

If you open the Excel file, you will see 3 sheets with 3 tables (Work Order, Vehicles, Prices). We want to populate a single sheet combining the three tables into one data set.

datacleanPowerQuery

Open up a New Excel Workbook (A new file. Do not try working from the practice file). From the Ribbon bar select Power Query > From File > From Excel

datacleanpowerquery1

Select the practice file: dataCleaningPQuery.xlsx and when the Navigator pops up, select Work Orders

datacleanpowerquery2

Select Load. Once the work order loads, repeat the process, this time loading Prices.

**Note if the Query Editor window pops up, just click Close and Load. We will be working in the Query Editor window later. 

datacleanpowerquery4

Now go to the sheet with the Work Order table on it. Power Query > Merge

datacleanpowerQuery6

A) The top drop down box should already be populated. If not, select Work Orders for top and then select Prices for the bottom.

B) Now highlight the Service Columns in both tables.

C) Leave the Join Kind at Left Outer.

A Left Outer Join works as seen below. The left table is displayed in full and the the right table adds data based on columns that match up against data in the left table.

datacleanPquery

Note the text below “The selection has matched 29 out of the first 29 rows.” This is because if you look at the two columns we selected, they both have matching text. It is through this matching that we are able to line up the two tables into one.

Click OK and the Query Editor window will pop up.

Select the dual arrow icon in the NewColumn header and deselect Service from the list.

datacleanpowerquery8

Right click the NewColumn and Rename it Price. Click on it and hold down the mouse button. Now drag it so that price sits in between Service and Mech.

Hit Close & Load and your new merged table should look like this:

datacleanpowerquery9

Ok, now to load the Vehicle sheet.

Power Query > From File > From Excel. Select excel practice file and select Vehicles. DO NOT HIT LOAD this time. Instead, select Edit.

datacleanpowerquery10

If you look at the original data set, the Lic Plate column in the Work Order sheet and Licence Plate in Vehicles are the two column we need to match up. Unfortunately, they currently do not match. In the second table, the licence plates are preceded by the letters Lic. We need to remove this.

Right click on the Licence Plate> Replace Values

datacleanpowerquery11

Type Lic  into Value to  Find and leave Replace with blank. **Note — the Value to Find is “Lic ” with a space after it. Make sure you add the space.

datacleanpowerquery12

Click OK, the Licence Plate column will remove the Lic. Click Close & Load

Now it is time to merge. Go back to the Work Order sheet and click Merge

Set the top table to Work Orders. Set the bottom to Vehicles. Highlight Lic Plates and Licence Plate columns. Set Join to Left Outer Join and click okay.

datacleanpowerquery13

Click the double arrows on the new column and un-check column 1.

Rename the new columns Make and Model. Highlight both column and move them in between Lic Plate and Service. Click Close&Load

datacleanpowerquery15

Congratulations. You have now successfully merged 3 tables into 1 using Power Query. Now you can go forth and analyze the data.

 

 

 

 

Advertisements

SQL: Getting Full Days Worth of Data

From a real work example:

I get a weekly data file of patient alarm data. This data is collected continuously throughout the day. However, the data must be exported manually and is usually done in the middle of the day. So the first and last data samples in my set could be taken at 14:41 one week apart. Since I only want to have full days worth of day, I like to truncate the data at midnight – deleting any partial days from the data set.

Instead of having to manually look at the data and determining when to cut off my data, I wrote the following simple script.

SQLDate

This script takes advantage of SQL’s variable functionality. Notice you need to declare your variables ahead of time. One thing to keep in mind is that variables only exist within their scripting block. Meaning once you are done executing this block of code, the variables are dropped from memory and no longer available to further code blocks.

I set the variables value using an inline Select statement. The nested function in the Select statement reads like this:

 convert(date, (dateadd(dd,1,min(dateField))))

It is actually a combination of the following functions – by order of operation:

Dateadd()

dateadd(dd,1,min(dateField)) = add 1 day(dd) to the minimum date (min(dateField)) in the table ** the dateadd function always starts at 12:00 AM when adding or subtracting a day

Convert()

Convert the value from Dateadd() above to the Date format.

Finally:

Delete from Table1
where dateField not between @start and @end
go

This final block of coat deletes every record in NOT in between the dates represented by the @Start and @End variables

SQL Code

declare @start date
declare @end date

set @start = (select convert(date,(dateadd(dd,1,min(dateField)))) from Table1)
set @end = (select convert(date,(dateadd(dd,0,max(dateField)))) from Table1)

delete from Table1
where date not between @start and @end
go

 

 

 

 

Automating Data Preparation with SQL Server

Download the Excel file here: Data Cleaning with SQL
Download the SQL file here: Data Cleaning with SQL
**Zip Folder contains the SQL file as both an SQL file and Txt for easy
viewing by people who don't have SQL Server installed 

Data preparation and cleaning is not a lot of fun. Data preparation is especially irritating when you have to repeat the task over and over again. While no one has come up with a data cleaning magic wand that I am aware of, there are some ways to automate the dreaded process.

While there are many software offerings out there that can help you with data cleaning, some are very expensive and have a high learning curve. So in this article I am going to focus on SQL Server. I chose SQL Server for the following reasons:

  • SQL programming is vital skill to have if you want to work in data
  • SQL Server readily available. Many  companies have it. If you don’t have access to a current SQL Server, you can download the Express version for free.

The Problem

You work for a university and scheduling software is sorely out of date. The reporting capabilities are limited to 4 pre-canned reports. There is no money for an upgrade, but your boss is asking you for some reports that are not available from the system.

So you go to extract some data from the outdated software, but the program only lets you extract the following 4 columns (Start Date, Class, Professor, Classroom).

Let’s look at the data

Here is the data sample you pulled.

sqlcleaning

To effectively create the reports your boss is asking for, you need to add a few more data points.

sqlclean1

Now, while these columns can be added manually, keep in mind a class schedule at a even a small university is at least 100 classes. And when you consider that you will have to repeat this process for every new semester, finding a way to automate this process will save you a lot of time in the end.

Step One – Get the data into SQL Server

First, let’s create a new database to work with.

Open up SQL Server Management Studio >  Object Explorer > right click  > Databases > New Database

sqlclean2

In the New Database window, give your new database the name University and click OK

sqlclean3

Right click on your new database > Tasks > Import Data

sqlclean4

On the Welcome to SQL Server Data Import  and Export Wizard click Next

On Choose a Data Source, select Microsoft Excel from the drop down. Then browse to your downloaded Excel file: dataCleanWSQL.xlsx

sqlclean5.jpg

Select SQL Server Native Client from the drop down. If your are working on the same machine as your SQL Server instance, you can just put a period “.” in Server Name. Otherwise you would need the server name. Database should be pre-filled with University.

sqlclean6

Select Copy data from one or more tables or views and click Next

sqlclean7

Take note of Destination name ([dbo].[Sheet1$]) and click Next

sqlclean8

Click Next and Finish. You will watch the process. Note how many rows transferred. This should match up to your source data.

sqlclean9

Back in the Object Browser go to University> Tables > dbo.Sheet1$. Right click and Select Top 1000 Rows.

sqlclean10

Here is the output, showing the contents of your imported data

sqlclean11

Step 2- Write the SQL Script

Load the SQL script by going to File>Open>File and choose DataCleanwSQL.SQL

While this is not an SQL tutorial, I will give a brief overview of the script file

**note GO in SQL indicates a complete block of code. This code will be completed before the script continues one to the next block of code. While it is not always needed, it is good coding practice

The first block of code: use University, simply tells SQL Server which database we are working with.

The next block: alter table, lets you make changes to an existing time. In this case, we are adding 3 column for out new data fields to go in (day, department, building).

sqlclean12

The next block is an update block. In this block we set the value of our new column day by datename(dw, [start date]). This converts the datetime stamp from [start date] to a weekday name.

The next block is a Case Statement. This works like a nested If – Then -Else statement in many languages. In this block we are stating When the value of Class column is like ‘Calc%’ ( % is a wildcard in SQL) Then set the value of the department column to ‘Math’

sqlclean13

The next code block is another Case statement. This one is using a between statement though, stating that When the Classroom column value is between 100 and 199 then set the value of Building to A

Finally
Select *
from dbo.sheet1$
displays the contents of the table.

sqlclean14

Now we have our data the way we want it. You connect you BI tool to SQL Server to work with the data, you can export it as a CSV file, or you can just copy an paste it into Excel for further analysis.

sqlclean154

Step 3: Repeat as needed

The great advantage the method above is that now we have a script we can run again over and over. So the next semester, when you download the new class listing, you don’t have to spend a lot of time on data prep. You simply have to import it into SQL Server and run your script.