Data Science Interview Questions: Parts of a SQL Query

A common question I have seen good people get tripped up on. Remember KISS – Keep it Simple Stupid.

Don’t let the question trick you into overthinking. For a SQL query you need a SELECT command (what do we want), a FROM (what table is it in) and if you want to go for the extra credit point, throw in a WHERE (its our filter)

Return to questions page

SQL: Create a temporary table

Temporary tables are a great way of working on complex data requests. They are easy to create and they delete themselves after every session, so you do not have to worry about creating a big mess with a bunch of tables you need to go clean up later.

In this tutorial, I am going to use a real world example from my work in Verizon’s Cyber Security Department. This is a simplified version of ask, and I am using completely made up data. There is no data from Verizon on my website every. I simply discuss use cases to make learning analytics more grounded in the real world

Below is a list of dates, PhoneNum: phone numbers called about, and the CallerNum: the number the person is calling from. While there are many legitimate reasons for someone to call customer support from another number (I drop and break my phone so I borrow my co-workers phone and call customer support to request a replacement), a number that calls in repeatedly about many different numbers is a red flag of someone that could be a fraudster.

If you want to play along, you can download the data set here:

I am using MySql in this example as my database, but I will include the code for SQL Server, Teradata, and Oracle platforms as well.

So the ask is find CallerNum that is calling about many different PhoneNum

While I am sure you can make a complex subquery to do this job, but I’m going to show you how to use temporary tables to make this ask very simple:

As you can see above, I loaded the data into a table called dbtest.numberslist

Now to find out how many CallerNum are calling about multiple PhoneNum, a simple solution is to get a list of all distinct combinations of PhoneNum and CallerNum and then do a count of CallerNums from this distinct list. Since the list is distinct, a CallerNum calling in about the same PhoneNum will only appear once, so a CallerNum calling about multiple PhoneNums will appear multiple times.

So using temporary tables, I will create a table that holds the distinct call combinations

MySql (code is create temporary table <table name> then query to fill table

create temporary table distCalls
select distinct phonenum, callerNum from dbtest.numberslist;
Select * from distCalls -- shows what is in the table now

Now, lets see if we can find potential fraud callers, let us do a count of callerNum from the distinct temporary table

As you can see above, there are 4 numbers that have called about 4 distinct phone numbers during this time period. Again, this could be for legitimate reasons, but this is still something we look at when trying to find questionable activity.

MySQL Code

create temporary table distCalls select distinct phonenum, callerNum

from dbtest.numberslist;

SQL Server

Select distinct PhoneNum, CallerNum 
into #distCalls
from dbtest.numberslist
go 

#tableName -- indicated temporary tables in SQL Server

Teradata

Create volatile table distCalls as (
select distinct PhoneNum, CallerNum
from dbtest.numberslist)
with data
on commit preserve rows;

with data and on commit preserve rows are needed at the end if you want any data to be in your table when you go to use it

Oracle

Create private temporary table distCalls as
select distinct PhoneNum, CallerNum
from dbtest.numberslist; 

Remember, temp tables delete themselves after each session (each time you log off the database). If you are working in the same session and need to recreate the temp table for some reason, you can always drop the table just as you would any other table object in SQL.

SQL Server: Importing Excel File to SQL Server

 

Working with data inside a database has many advantages to working with data in an Excel spreadsheet. Luckily SQL Server makes it relatively easy to import data from an Excel File into the database.

We will be using the Excel files below:

JobDataSet

JobDesc

Let’s start by opening SSMS (SQL Server Management Studio)

Next, let’s create a database to hold these files. You don’t need to create a new database to import data, but I am building this tutorial as part of a series on SSRS, so I am building a new database for that purpose.

To create a new Database, right click on Databases on the upper left and click New Database…

2018-04-09_8-55-16.png

Now name your new database, we will just accept the defaults

2018-04-09_8-59-56

Now go your newly created database, right click, and go to Tasks

2018-04-09_9-00-48

From the Tasks sub-menu, select Import Data

2018-04-09_8-57-20.png

The import Wizard will open, simply click Next

2018-04-09_9-01-12.png

Next, select Excel from the drop down

2018-04-09_9-01-41.png

Next, click Browse

2018-04-09_9-02-13

Select your file

2018-04-09_9-05-24.png

Make sure First row has column names is checked and click Next

2018-04-09_9-07-04.png

On the next screen select SQL Server Native Client 11.0 (If you don’t have 11.0 – 10.0 should work)

2018-04-09_9-11-31.png

Make sure the database you want is selected and click next

2018-04-09_9-11-52.png

In this example, we are going to use Copy data from one or more tables or views

2018-04-09_9-12-29.png

Make sure to name the table you want to create in SQL Server (red arrow)

2018-04-09_9-13-23

If you click Preview you can get a look at what the new table will be loaded with

2018-04-09_9-15-34.png

Click Okay on the preview window and click Next on the Import Wizard

Leave the default Run Immediately checked and click next

2018-04-09_9-16-07.png

Review info on the next window and click Finish

2018-04-09_9-16-31

The package will run

Note the blue lettering will let you see how many rows transfer from the Excel file to SQL Server

2018-04-09_9-16-53.png

If you check your database, you will see your tables. (I loaded both spreadsheets in to the database for the upcoming SSRS tutorial)

2018-04-09_9-19-58.png

Finally, run a select * on your new table to see the data you transferred into SQL Server

2018-04-09_9-20-35.png

SQL: User Defined Functions

Functions in SQL work just like functions in most programming languages. If you aren’t familiar with a function, you should know that you are already using them without even knowing it.

Consider this for example:

Select Count(*)
From Table

COUNT() is a function. When you pass it rows from your query, it counts the rows and returns a value in the form of an integer. But COUNT() is a built in function, meaning it came as part of SQL Server, you did not have to create it.

SQL Server allows you the option of creating User Defined Functions, meaning functions you develop yourself. These are handy when you find yourself handling repeated tasks, such as date formatting or string manipulation. Instead of having to repeatedly code a complex command, you can just build a function once and then call on it whenever needed.

Let’s start with a basic example:

2018-04-03_14-23-36.png

Here I created a function called ADD_UP that accepts 2 numbers and outputs the SUM of the two numbers ( yes I know this already available as the built in function SUM(), but I want to start nice and easy)

Lets start by discussing the syntax. The basic syntax for creating a function is as follows:

CREATE FUNCTION name (@var data-type)
RETURNS data-type
AS
BEGIN
   RETURNS (some type of action)
END

In my example we are naming the function ADD_UP and supplying two integer variables: @NUM1 and @NUM2

CREATE FUNCTION ADD_UP (@NUM1 INT, @NUM2 INT)

Then we define the data-type our function will return. In this case, since we are adding 2 integers, our function will return an INT

RETURNS INT

Next we wrap out function in

AS
BEGIN
ENDS

Finally, we perform an action

RETURNS (@NUM1+ @NUM2)

Finally, when you want to call the Function, just use it in a select statement.

(**Note, user defined functions require you to use the schema prefix. Since I just used the default dbo schema, this example uses dbo.ADD_UP)

select dbo.ADD_UP(2,3) as ADDED

and as you see, we get 5 as our answer.

2018-04-03_14-23-36

Now, let’s try something different. Here we are going to work with a date function. In this example I built a function called MNTH that accepts one variable @DT – a date data-type and returns an Integer representing the month of the date passed to it.

Again, all I am really doing is duplicating the built-in function MONTH(), but I wanted to show different data-types

2018-04-03_14-32-54.png

(** getdate() is a built-in function that returns the current date. I ran this SQL on 4/2/2018, so it returns 4 as a result)

Now finally here is an example of how you might use a function in real life. Let’s say you have lots of reports that call for your date to be represented in MM-YYYY format. Instead of having to repeatedly type a complex date formatting, you can build it into a User Defined Function and pass a regular date to the function.

2018-04-03_14-34-36

If you are not familiar with cast(concat(month(@DT),’-‘,year(@DT))as varchar(8))) statement, I’ll break it down here:

Let’s go from the inside out:

concat is a string function meaning to concatenate or “string together” – so

concat(month(@DT),’-‘,year(@DT))

concat(4, ‘-‘, 2018)

4-2018

Cast allows us to convert the output of the concat statement into a string (varchar) data-type

cast(4-2018 as varchar(8)) = ‘4-2018’

Finally, if you want to find your functions after you create them, they are located under your database -> Programmability -> Functions

In this case, I only built Scalar-valued Functions, I’ll cover the other types in future lessons.

2018-04-03_14-39-21.png

 

SQL Server: Linked Servers

SQL Server Linked Servers feature allows you to query data from other databases (even non SQL Server databases) from within SQL Server.

While I believe large data transfer jobs are best handled using an ETL solution like SSIS, sometimes all you want is a just a couple hundred rows of data from another database and setting up an ETL jobs might be overkill.

In this example, I am going to set up a link to a Teradata Data Warehouse.  This will allow me to query data from the warehouse and use it in my SQL Server environment.

(Note that you can connect to any database, such as another SQL Server or Oracle. I am just using Teradata as an example)

First go into SSMS and connect to your SQL Server

Now, in Object Explorer go to Server Objects -> Linked Servers

2018-03-29_12-44-42

Right click on Linked Servers and select New Linked Server…

2018-03-29_12-45-49

For a Teradata connection, I am choosing Microsoft OLE DB Provider for ODBC Drivers

(on a side note, you will need to have the ODBC driver for Teradata installed for this to work)

2018-03-29_12-46-25

Linked Server = just whatever name you want to give your linked server – I chose EDW (short for Enterprise Data Warehouse)

Product Name = again free text, I just wrote Teradata to let people know what kind of product you are connecting too.

Data Souce: This is the server name you are trying to connect to

2018-03-29_12-48-05

Next, click Security tab.

I want to use my own credentials to connect to Teradata so I clicked Be Made Using This Security Context: ( the green arrow)

Type in your Username and password. Click Ok

2018-03-29_12-48-34

Now go to create a new query in SSMS (SQL Server Management Studio)

We will be using a function called OPENQUERY. The syntax is as follows

Select * from OPENQUERY(LINKEDSERVER, QUERY)

2018-03-29_12-50-47

SELECT * FROM OPENQUERY(EDW, 'SELECT USER_NM, LAST_NM, FIRST_NM FROM EMPLOYEE_TABLE')

We can use the data returned from Teradata in SQL Server by simply putting it in a table or a temp table as shown below

SELECT * into ##TempTable FROM OPENQUERY(EDW, 'SELECT USER_NM, LAST_NM, FIRST_NM FROM EMPLOYEE_TABLE')

We can also pass variables to our Teradata query. My preferred method is simply using a stored procedure.

2018-03-29_12-54-32

CREATE PROCEDURE [LINKED_SERVER_QUERY]
       @USER VARCHAR(8) = NULL
AS

declare @query varchar(8000) = 'Select * from OPENQUERY(EDW, ''SELECT USER_NM, LAST_NM, FIRST_NM FROM EMPLOYEE_TABLE WHERE USER_NM =  ''''' + @USER + ''''''')' 
exec(@query)

GO

exec [LINKED_SERVER_QUERY] @USER = 'blarson'

SQL: Coalesce

Coalesce is a simple but useful SQL function I use quite often. It returns the first non-null value in a list of values.

A common real world application for this function is when you are trying to join data from multiple sources. If you work for a company known for acquiring other companies, your HR tables are bound to be full of all kinds of inconsistencies.

In this example below, you will see a snip-it of an HR table created by merging tables from two companies. The first two employees have an emp_id and the last one has a b_emp_id. The last one, Priyanka, is an employee from a newly acquired company, so her emp_id is from the HR table of the newly acquired company.

c1

So if you just want to merge the 2 emp_id columns into one, you can use the coalesce() function.

The syntax for coalesce is: coalesce (col1,col2,….)  as alias

Coalesce takes the first non-null. In the example below, we check emp_id first, if it has a value, we use that. If emp_id is null, we look for the value in b_emp_id.

select emp_nm,
coalesce (emp_id, b_emp_id) as emp_id
from employee_id

c2

If you want to try this for yourself, here is the code to build the table out for you.

CREATE TABLE employee_id (
        emp_nm nvarchar(30) not null,
             emp_id nvarchar(8),
             b_emp_id nvarchar(8)
             PRIMARY KEY(emp_nm) );
INSERT INTO employee_id
       (emp_nm, emp_id)
VALUES
       ('Bob', 'A1234567'),
       ('Lisa', 'A1234568')
INSERT INTO employee_id
       (emp_nm, b_emp_id)
VALUES
       ('Priyanka', 'B1234567');

 

SSIS Tutorial: Introduction

SSIS: Introduction

SQL Server Integration Services is the ETL tool for the Microsoft SQL Server platform. SSIS allows you to take data from various sources (from Excel files, to text files, to other databases, etc), and bring it all together.

If you are new to concept of ETL, SSIS is great place to start. Click here to learn about ETL

If you are well versed in another ETL platform, SSIS is a relatively easy system to get up to speed on.

SSIS comes as part of SQL Server Data Tools, which you should be able to install with your SQL Server installation software. You will need SQL Server Standard, Developer or above editions to run SQL Server Data Tools. SQL Server Express does not support Data Tools.

For some unknown reason, once it is installed, you will not find a program called SSIS. Maybe the engineers at Microsoft think this is funny, but in order to run SSIS you will need to look for the following program instead.

2018-02-26_14-26-37

I’m using 2015, but for most of what I am doing here, any version should be compatible.

When you launch data tools, you will notice it run in Visual Studios

2018-03-06_10-37-28

To start an SSIS job, you will either need to open an existing project, or create a new one. In this example, I will create a new one.

File ->New -> Project  (or Ctrl+Shift+N)
2018-03-06_10-37-54

Inside the Business Intelligence Templates, select Integration Services. I always just select Integrations Service Project, I’m not a big fan of the Wizard

2018-03-06_10-38-37

Next step: Name your project

2018-03-06_10-40-08

Now you are in SSIS. Here are the 3 main windows you will be starting with.

From right to left:

SSIS Toolbox

2018-03-06_10-43-40

Package Designer

2018-03-06_10-44-00

Solution Explorer

2018-03-06_10-44-19

Packages

Inside solutions, packages are the collections of jobs or scripts found inside a project. It is inside the package that you will build out your ETL job.

For our first lesson, we are just going to build a simple package. Your new solution should have opened with a new package when it opened. If it is, right click on the green arrow to rename it, if not, right click on the red arrow to create a new package.

 

2018-03-06_10-44-19_1

I renamed my package First_Package, you can name your package whatever you choose.

This first package will simply just display a pop up message. In the SSIS Toolbox, go to Script Task and drag it into the package designer window.

2018-03-06_13-09-46

2018-03-06_13-25-47

Double click on the Script Task Box in the Design window

Note in my example, I have C# set as the scripting language. The other default option is Visual Basic. If you are more comfortable with that, feel free to use it. I prefer C# mainly because I spent more time working with it.

You don’t need to know any C# for this tutorial. This is literally a single line of code assignment. I will cover more C# in the future.

Click Edit Script… to continue

2018-03-06_13-26-45

Note, this step can take a minute or so for the script editor to appear. Don’t panic if your computer appears locked up.

Once the script editor opens, don’t panic by all the code you see. Luckily Microsoft has done most of the ground work for us. We only need to scroll down until you see:

public void Main()

Now place your cursor below //TODO: Add your code here

The code you need to type for this script is:

MessageBox.Show(“This is my first SSIS Package”);

2018-03-06_15-12-02

Now I know you will be looking for a save button. But again, our friends at Microsoft might have been drinking when they coded this. Instead, just click the upper right X to close out the whole window –

I know – why would they do it that way? how much effort would a save and close button have cost them? I don’t know. It just is what it is. Just click the X and move on with your life.

2018-03-06_15-13-33

Now click the OK button – again I guess Save was too much to type

2018-03-06_15-14-08

Now right click on your package (green arrow) and click Execute Package

2018-03-06_10-44-19_1

Your message will pop up in a Message box window

2018-03-06_15-17-49

Click OK on the messagebox and Click the red square to end the package execution.

2018-03-06_15-18-09

Congrats, you have just built and executed your first SSIS Package.

 

XML Parsing: Advanced SQL

If you want to play along with the lesson, use the following code to create the table I will be using:

CREATE TABLE employee_lang (
        emp_nm nvarchar(30) not null,
             lang nvarchar(255),
             PRIMARY KEY(emp_nm) );
INSERT INTO employee_lang
       (emp_nm, lang)
VALUES
       ('Bob', 'Python, R, Java'),
       ('Lisa', 'R, Java, Ruby, JavaScript'),
       ('Priyanka', 'SQL, Python');

 

XML Parsing

XML parsing is a SQL method for separating string data found in a single field in a table. Look at the table below:

1

This table has two columns, emp_nm (employee name), lang (programming languages the employee is proficient in). Notice that the column lang has multiple values for each record. While this is easily human readable, if you want it to more machine usable (think Pivot tables or R statistical analysis), you are going to want your data to look more this this:

2

Notice now the table has the same two columns, but the lang column now only has 1 word per record. So the question is, how do you do this using SQL?

XML parsing

The code used to “parse” out the data in the lang column is below:

SELECT emp_nm
,SUBSTRING(LTRIM(RTRIM(m.n.value('.[1]','varchar(8000)'))),1,75) AS lang
FROM
(
SELECT
emp_nm
,CAST('<XMLRoot><RowData>' + REPLACE([lang],',','</RowData><RowData>') + '</RowData></XMLRoot>' AS XML) AS x
FROM employee_lang
)t
CROSS APPLY x.nodes('/XMLRoot/RowData')m(n)

Let’s break it down a little first.

SELECT
emp_nm
,CAST('<XMLRoot><RowData>' + REPLACE([lang],',','</RowData><RowData>') + '</RowData></XMLRoot>' AS XML) AS x
FROM  employee_lang

What we are doing with the code about is using a CAST and REPLACE functions to convert the elements in column lang into a XML line.  See results below:

3

For Bob, this is the result of the CAST/REPLACE code on the lang column

<XMLRoot><RowData>Python</RowData><RowData> R</RowData><RowData> Java</RowData></XMLRoot>

This is how the code works from the inside out.

REPLACE()

select REPLACE (‘SQL ROCKS’, ‘S’, ‘!’)

If you run the above code, it will return !QL ROCK!

Replace is saying — everywhere an S is in the string, replace it with a !

CAST()

The cast function is casting the string as an XML data point. This is needed for the next section, when we unpack the XLM string

Cross Apply / Value

SELECT emp_nm
,m.n.value('.[1]','varchar(8000)')
FROM
(
SELECT
emp_nm
,CAST('<XMLRoot><RowData>' + REPLACE([lang],',','</RowData><RowData>') + '</RowData></XMLRoot>' AS XML) AS x
FROM employee_lang
)t
CROSS APPLY x.nodes('/XMLRoot/RowData')m(n)

So without going into way too much detail, you can find pages dedicated to Cross Apply and Value(), I’ll give you the quick breakdown.

First notice we aliased our CAST() statement as x.  So if you look at the CROSS APPLY, you will see we are asking to look at x.nodes. Had we aliased our cast y, we would be looking at y.nodes.

Now look at m(n) at the end of the line. This is like an array or list in programming languages. Keep that in mind for the next step.

Inside the x.nodes() is ‘/XMLRoot/RowData’ , this is telling us to assign to m everything following /XMLRoot and to iterate n by everything following /RowData, so for Bob:

m(n=1) = Python

m(n=2) = R

m(n=3) = Java

Now we pass that array m(n) to our Value() method.  Hence m.n.value().  Note m and n were just letters I picked, you can use others.

Inside m.n.value(‘.[1]’,’varchar(8000)’) was used as it should pretty much cover any size string you may have to deal with.

So the final iteration simply add as SUBSTRING to clean it up and get rid of white space

4

SQL: 4 Types of Joins

SQL Joins come in 4 major types. In T-SQL – Microsoft’s SQL language, unless otherwise specified, a Join defaults to Inner Join. The other three are Right Outer Join, Left Outer Join, and Full Outer Join.

Inner Join

Inner joins match up data that exists in both tables. In the example below, only A and C found in the result table since B,D don’t exist in TableB and E,F don’t exist in TableA.

Syntax

Select *
from TableA as A Join TableB as B
on A.ID = B.ID

innerjoin

Full Outer Join

A Full Outer Join puts all elements in both tables together. Where information is missing, Nulls will appear

Select *
from TableA as A Full Outer Join TableB as B
on A.ID = B.ID

fullouterjoin

Left Outer Join

Left Outer Join takes all data from the Left Table and joins up matching data from the Right table. Left and Right is determined based on the table’s position in the Where clause

Select *
from TableA as A Left Outer Join TableB as B
on A.ID = B.ID

leftouterjoin.jpg

Right Outer Join

Right Outer Join is the exact opposite of the Left Outer Join, with all elements from the Right Table being used and only matching elements from the Left Table.

Select *
from TableA as A Right Outer Join TableB as B
on A.ID = B.ID

rightouterjoin.jpg

SQL: Intro to Joins

More often than not, the information you need does not all reside in one table. To query information from more than one table, use the Join command.

In our example, we are going to be using AdventureWorks2012. If you want to follow along, but do not have SQL Server or AdventureWorks2012 installed, click in the following links:

We are going to be using the following Tables for our Join

Notice they both have a matching field – BusinessEntityID – This field will be important when creating a join.

sqlJoin7

We can call up both tables individually through separate SQL Queries

sqljoin9

However, to combine the two tables into a single result, you will need a Join.

Inner Join

SQL uses Inner Joins by default. In this lesson, I am only going to focus on inner joins. Inner joins work by focusing on columns with matching information. In the example below, the columns with matching information are the Name columns.

The combined result only contains rows who have a match in both source tables. Notice that Sally and Sarah do not appear in the result table.

sqljoin10.jpg

The syntax is as follows:

Select *
from Table A join Table B
on TableA.Match = TableB.Match

sqljoin11.jpg

In our example, the matching column is BusinessEntityID in both tables.

To clean the code up a little bit, we can assign aliases to our tables using the “as” keyword. We are using E and S as our aliases. We can now use those aliases in our “on” clause.

sqlJoin12

And, just as in a regular select statement, we can choose which columns we want. You do however, have to identify which table the columns you are requesting are from. Notice how I did this using the E and S aliases.

sqljoin13.jpg