Data Jobs: What does a Data Analyst Do?

Data Analysts get a bad wrap. With the advent of the Data Scientist, Data Analysts are often viewed as Data Scientists lite, however I feel that is not the honest case. Truth is, there is a lot of overlap between the two fields. I will dive deeper into what a Data Scientist is in a future article, but just know my opinion is the definition of Data Scientist as a job is still a bit fuzzy and I think the job title may eventually be broken into a few different titles to better define the differences.

Data Analyst

So what does a Data Analyst do?

A lot actually. You could put 10 data analysts into a room and you would get ten different answers to this question. So the best I can do here is make sweeping generalities. As the old saying goes “Your results may vary”

In general, data analysts perform statistical analysis, create reporting, run ad-hoc queries from data warehouses, create data visualizations, create and maintain dashboards, perform data mining, and create machine learning models (yes, ML is not only for data scientists). Their assignments are business driven. A data analysts is either embedded with a business unit (financial planning, fraud, risk management, cyber security, etc.) or working in a centralized reporting/analytics team. They use their skills to provide reporting and analytics for the business.

Tools used by Data Analysts

  • SQL – MySql, SQL Server, Oracle, Teradata, Postgres – whether simply querying a data warehouse or creating and managing a local data mart, data analysts need to be advanced SQL programmers
  • Visualization tools – Tableau, Qlik, Power BI, Excel, analysts use these tools to create visualizations and dashboards
  • Python/R – Data analysts should be familiar with languages like Python or R to help manage data and perform statistical analysis or build machine learning models
  • Spreadsheets – Excel, Google Sheets, Smart Sheets are used to create reports, and pivot tables used to analyze the data
  • ETL tools – SSIS, Alteryx, Talend, Knime, these tools are design to move data to and from databases, CSV files, and spreadsheets. Until the data is in a usable format, analysis cannot be performed.

Educational Requirements

Typically a data analyst position will ask for a bachelors degrees, preferably in computer science, statistics, database management or even business. While the barrier to entry for a data analyst job is generally not as high as a data scientist, that does not mean you cannot make a meaningful and well paid career as a data analyst. Also, the demand for data professionals seems to keep going up and up and it most likely will for the foreseeable future.

SQL: Create a Table

Create a table in MySql and Sql Server

Create table <table name> (
           <column 1> data type,
           <column 2> data type )

As an example, let’s create an employee table with 3 columns, ID, Name, and Date of Birth

create table Employee(
   Id int,
  Emp_NM varchar(255),
  DOB date);

** remember that unless you set your database to be case sensitive, most SQL databases are not

You may need to refresh your database explore view, but after running this command, you should now have a new table

If you need a reference table for the types of data types available, check out this page: data types

SQL: What is DDL and DML?

You might have heard of DDL and DML and been confused. Are they part of SQL or are they their own language? Actually Yes and No…

If you look at it as a purist computer programmer or an academic, then you will probably going to say they are all different languages, you would not be wrong. However for someone like me, I view them as a subset of commands used in the SQL language.

DDL: Data Definition Language is a set of commands used to create, modify, or drop databases, tables, views, indexes, schemas, and users.

DML: Data Manipulation Language is a set of commands used to add data to a table, move data around, read data, update data, or delete data.

R: Connecting to SQL Server Database

You can query data from a SQL Server database directly from R using the RODBC package.

install.packages("RODBC")

First you need to form a connection

library(RODBC)
##connection string
cn <- odbcDriverConnect(connection="Driver={SQL Server Native Client 11.0};server=localhost; database=SSRSTraining;trusted_connection=yes;")

We use the odbcDriverConnect() function. Inside we pass a connection = value

Driver = {SQL Server Native Client 11.0};  — this is based on the version of SQL Server you have

server=localhost;  — I used localhost because the SQL Server was on the same computer I was working with. Otherwise, pass the server name

database=SSRSTraining; — name of database I want to work with

trusted_connection=yes; — this means I am able to pass my Windows credentials.

If you don’t have a trusted connect pass the user Id and password like this

uid = userName; pwd = Password;

Note each parameter is separated by a semicolon

Query the database

> ##passes query to SQL Server
> df <- sqlQuery(cn, "select * FROM [SSRSTraining].[dbo].[JobDataSet]")
> head(df)

    Name              Job Hours Complete
1  Sally Predictive Model     1        n
2 Philip      Maintanence    10        n
3    Tom    Ad-hoc Report    12        y
4    Bob             SSRS     3        y
5 Philip         Tableau      7        n
6    Tom         Tableau      9        n

using sqlQuery() – pass through the connection string (cn) and enclose your query in ” ”

 

 

 

SSRS: Drill down

Let’s make the report at little more readable and interactive.

We are going to do a drill down on the names in our report. This mean when the report first opens, you will only see one row for each name. If you want to see the details of that person, you can click on a + by their name and the details for that person will pop down like an accordion.

Below your report design window, you should see another window called Row Groups

Go to the area marked Details (green arrow) and right click – Group Properties…

1

Click Visibility > Hide > Display can be toggled.

(Note I am selecting Name1 not Name. If you remember Name was the column we hid. Name1 is the column created when we created the Parent Group)

2.png

Now when you open your report, you will see it is in drill down format.

3.png

SSRS: Grouping

In our last lesson we made the basic report seen below:

2018-04-06_10-25-55

**Note, you can follow the link below to the first lesson:

SSRS: Introdution: 1rst Report

Grouping

Our data is currently in table form, but is otherwise still nothing more than a raw data dump. Let’s make our report a little nicer with some grouping. Right click on the data row (not header) in your table. Mouse over to Add Group and select Parent Group…

2018-04-06_10-34-35.png

In the new window, select [Name] from the drop down.

2018-04-06_10-35-12

Click Add group header and Add group footer boxes. Now click OK

2018-04-06_10-35-27.png

Now a group has been added to your report.

2018-04-06_10-36-00

If you click on preview, you will now see the table is grouped by Names

2018-04-06_10-36-27.png

But you will notice we now have 2 columns showing the Name, one – our new grouping column and the other – the original column. This is redundant. To get rid of it, go back to Design, right click on the second name column and select Column Visibility…

2018-04-06_10-36-52

When the new window opens up, click Hide.

2018-04-06_10-37-11.png

Now when you look at the report now, you will see the second Name column is now hidden.

2018-04-06_10-38-16.png

Running Total:

Next, let’s set up a running total for Hours spent on each job. To do so, right click on the Hours text box and select Add Total

2018-04-06_10-39-04

Now when we go back to preview, we will see at total in the group footer for each person

2018-04-06_10-39-36

Average:

Now what if we wanted an average instead? Right click on the textbox that says [Sum(Hours)] and select Expression

2018-04-06_10-40-15

You can just type = Avg(Fields!Hours.Value) in the expression builder box, but if you don’t know the code, you can use information in the boxes below. As you can see in the example below, if you go to Common Functions > Aggregate you will see the code for lots of functions like Average, count, standard deviation.

2018-04-06_10-41-25

Now when you go to preview, you will see an average.

But now we have a new problem. If you are trying to average something like work hours, odds are you will not need to go out to 10 decimal places. So a number like 8.272727272727 is pretty much ridiculous for a report like this.

2018-04-06_10-42-19.png

Now go to Number > Number and set the Decimal Places to 2

2018-04-06_10-43-26.png

So if you look at it again, you will see you only have 2 decimal points now.

2018-04-06_10-46-41

 

 

SSRS: Introdution: 1rst Report

SSRS stands for Sql Server Reporting Service. This is Microsoft’s BI reporting tool integrated into their Sql Server platform. SSRS allows you to create, deploy, and manage reports from a server platform.

SSRS comes as part of the SQL Server suite. It is not available as part of Express, but if you buy the developers edition, you will get SSRS (as well as SSIS and SSAS). You may need to download and install it separately. You’ll find it under the title, SQL Server Data Tools (SSDT).

SSDT download link

In this tutorial, I will walk you through creating your first SSRS report. In this example, I created two SQL Server tables from the following Excel files:

JobDataSet

JobDesc

If you aren’t sure how to upload an Excel file to SQL Server, you can go to my tutorial on how to do it. The SQL Server upload tutorial was actually created to support the SSRS lessons.

SQL Server: Importing Excel File to SQL Server

To create a new SSRS project, open Visual Studios (the platform SSDT runs on) and go to File->New->Project

2018-04-06_8-56-18

Select Reporting Services -> Report Server Project

Name your project. I typically leave Create directory for solution checked.

2018-04-06_8-58-27.png

Your new “solution” will open up. I still haven’t really figured out why MS changes the name from Project when creating and opening it to Solution once you are working on it. I am sure someone, somewhere had a reason for it. That person is probably retired now and won’t return any email requests as to why he decided on the wording choice.  We’ll just have to chalk it up to another one of life’s mysteries. Like why is Regis Philbin famous?

But I digress…

Now our “solution” is open, we will see 3 sub folders. Shared Data Sources, Shared Datasets and Reports

2018-04-06_9-01-16

To create a new report, right click on Reports -> Add -> New Item.  Don’t click Add New Report unless you want to meet the world’s most unhelpful Wizard. I call him Gandalf the Drunk…

2018-04-06_9-01-47

After clicking add new item, click Report and name your report.

2018-04-06_9-02-32

Now your new report will open up.

2018-04-06_9-33-25.png

Data Source

So, before we can actually report on anything, we are going to need data. And in order to get data, we are going to need a Data Source. Now you will notice you have Shared Data Sources on the right and Data Sources on the left. Shared Data Sources, once established, can be used in all reports you create. If you create an “embedded” data source in the report, you will only be able to use it in that report.

For this example, we will make a shared data source.

Go to the right under Solution Explorer and right click Shared Data Sources. This time you can click Add New Data Source.

2018-04-06_9-33-48

This wizard is just Gandolf the Tipsy. While I harbor a general dislike for most wizards, this one isn’t completely useless at least.

First name your data set something you will remember later.

Select Type ->Microsoft SQL Server

Click Edit

2018-04-06_9-37-25

Copy and paste your server name in the 2nd box. In this example, my SQL Server is locally installed on my computer, so I just used localhost as my server name. Next select the Database you want to work with from the drop down. I created a database call SSRSTraining for this example

2018-04-06_9-40-23.png

Hit Test Connection, you should get a success message.

2018-04-06_9-40-42

Click Okay, you’ll new see your Data Source in the Solution Explorer

2018-04-06_9-41-33.png

Now go to the left and right click on Data Source for your report. Select Add New Data Source

2018-04-06_9-43-10

Name your data source and click on the Use shared data source reference radio button.

Pick you data source from the drop down. There should only be one to choose from

2018-04-06_9-43-36

Datasets:

Now click okay, go back to the left and right click on Dataset.

2018-04-06_9-44-02

Select New Data Set,

Name the Data Set

Select Use a dataset embedded in my report. This is generally how I do things, as Data Sources are usually reusable, Datasets are more designed for specific reports, so I leave them embedded.

Select your Data source from the drop down

For Query type we are using Text in this example

I am using a simple select all statement from the dbo.JobDataSet table I created

2018-04-06_9-47-16.png

If you click on fields in the upper right, you’ll now see the columns that will be feeding in from the query. You can rename the columns if you wish

2018-04-06_9-48-00.png

For now, let’s just click Okay, now you will see your data set expanded on the left side of your screen.

2018-04-06_9-48-46

Now to the far left, you should see the word Toolbox, click on that. This is a list of the tools we can work with within SSRS. Let’s start with a table

2018-04-06_9-49-09

Click on the table and drag it into the design window in the middle of the screen

2018-04-06_9-49-27.png

Now you can simply drag and drop columns from your dataset into your new table.

2018-04-06_9-49-53.png

By default, a table comes with 3 columns. You can add columns to this table by dragging a field over to the end of the table (note you will see a blue bar indicator letting you know your mouse is in the right spot. The column will populate to the right of the blue bar).

2018-04-06_9-50-24

This will add a new column to the end of your table. You can also use this method to insert a column in between existing table columns

2018-04-06_9-50-39

Now click on Preview button above your table

2018-04-06_10-23-01

When you do, you will get to see your table complete with data. Notice how the job column is too small for the job description.

2018-04-06_9-51-14

Step 5: Formatting

To fix the job column, let’s go back to our Design screen. Do this by clicking on the Design tab in the upper left.

2018-04-06_10-23-01

Now hover your mouse over the table until you get the double arrow icon seen below

2018-04-06_10-23-28

Once you have that, simply click and drag the column over to make it wider

2018-04-06_10-24-09

Since we are in the design window anyway, let us do a little more formatting. Click on the gray box to the left the header row to highlight the entire row. Now we can do things like Bold the font or change the background color

2018-04-06_10-24-49.png

Go back to the preview window to check out your results.

2018-04-06_10-25-55

There you have your very first SSRS Report from top to bottom.

SQL Server: Importing Excel File to SQL Server

Working with data inside a database has many advantages to working with data in an Excel spreadsheet. Luckily SQL Server makes it relatively easy to import data from an Excel File into the database.

We will be using the Excel files below:

JobDataSet

JobDesc

Let’s start by opening SSMS (SQL Server Management Studio)

Next, let’s create a database to hold these files. You don’t need to create a new database to import data, but I am building this tutorial as part of a series on SSRS, so I am building a new database for that purpose.

To create a new Database, right click on Databases on the upper left and click New Database…

2018-04-09_8-55-16.png

Now name your new database, we will just accept the defaults

2018-04-09_8-59-56

Now go your newly created database, right click, and go to Tasks

2018-04-09_9-00-48

From the Tasks sub-menu, select Import Data

2018-04-09_8-57-20.png

The import Wizard will open, simply click Next

2018-04-09_9-01-12.png

Next, select Excel from the drop down

2018-04-09_9-01-41.png

Next, click Browse

2018-04-09_9-02-13

Select your file

2018-04-09_9-05-24.png

Make sure First row has column names is checked and click Next

2018-04-09_9-07-04.png

On the next screen select SQL Server Native Client 11.0 (If you don’t have 11.0 – 10.0 should work)

2018-04-09_9-11-31.png

Make sure the database you want is selected and click next

2018-04-09_9-11-52.png

In this example, we are going to use Copy data from one or more tables or views

2018-04-09_9-12-29.png

Make sure to name the table you want to create in SQL Server (red arrow)

2018-04-09_9-13-23

If you click Preview you can get a look at what the new table will be loaded with

2018-04-09_9-15-34.png

Click Okay on the preview window and click Next on the Import Wizard

Leave the default Run Immediately checked and click next

2018-04-09_9-16-07.png

Review info on the next window and click Finish

2018-04-09_9-16-31

The package will run

Note the blue lettering will let you see how many rows transfer from the Excel file to SQL Server

2018-04-09_9-16-53.png

If you check your database, you will see your tables. (I loaded both spreadsheets in to the database for the upcoming SSRS tutorial)

2018-04-09_9-19-58.png

Finally, run a select * on your new table to see the data you transferred into SQL Server

2018-04-09_9-20-35.png

SQL: User Defined Functions

Functions in SQL work just like functions in most programming languages. If you aren’t familiar with a function, you should know that you are already using them without even knowing it.

Consider this for example:

Select Count(*)
From Table

COUNT() is a function. When you pass it rows from your query, it counts the rows and returns a value in the form of an integer. But COUNT() is a built in function, meaning it came as part of SQL Server, you did not have to create it.

SQL Server allows you the option of creating User Defined Functions, meaning functions you develop yourself. These are handy when you find yourself handling repeated tasks, such as date formatting or string manipulation. Instead of having to repeatedly code a complex command, you can just build a function once and then call on it whenever needed.

Let’s start with a basic example:

2018-04-03_14-23-36.png

Here I created a function called ADD_UP that accepts 2 numbers and outputs the SUM of the two numbers ( yes I know this already available as the built in function SUM(), but I want to start nice and easy)

Lets start by discussing the syntax. The basic syntax for creating a function is as follows:

CREATE FUNCTION name (@var data-type)
RETURNS data-type
AS
BEGIN
   RETURNS (some type of action)
END

In my example we are naming the function ADD_UP and supplying two integer variables: @NUM1 and @NUM2

CREATE FUNCTION ADD_UP (@NUM1 INT, @NUM2 INT)

Then we define the data-type our function will return. In this case, since we are adding 2 integers, our function will return an INT

RETURNS INT

Next we wrap out function in

AS
BEGIN
ENDS

Finally, we perform an action

RETURNS (@NUM1+ @NUM2)

Finally, when you want to call the Function, just use it in a select statement.

(**Note, user defined functions require you to use the schema prefix. Since I just used the default dbo schema, this example uses dbo.ADD_UP)

select dbo.ADD_UP(2,3) as ADDED

and as you see, we get 5 as our answer.

2018-04-03_14-23-36

Now, let’s try something different. Here we are going to work with a date function. In this example I built a function called MNTH that accepts one variable @DT – a date data-type and returns an Integer representing the month of the date passed to it.

Again, all I am really doing is duplicating the built-in function MONTH(), but I wanted to show different data-types

2018-04-03_14-32-54.png

(** getdate() is a built-in function that returns the current date. I ran this SQL on 4/2/2018, so it returns 4 as a result)

Now finally here is an example of how you might use a function in real life. Let’s say you have lots of reports that call for your date to be represented in MM-YYYY format. Instead of having to repeatedly type a complex date formatting, you can build it into a User Defined Function and pass a regular date to the function.

2018-04-03_14-34-36

If you are not familiar with cast(concat(month(@DT),’-‘,year(@DT))as varchar(8))) statement, I’ll break it down here:

Let’s go from the inside out:

concat is a string function meaning to concatenate or “string together” – so

concat(month(@DT),’-‘,year(@DT))

concat(4, ‘-‘, 2018)

4-2018

Cast allows us to convert the output of the concat statement into a string (varchar) data-type

cast(4-2018 as varchar(8)) = ‘4-2018’

Finally, if you want to find your functions after you create them, they are located under your database -> Programmability -> Functions

In this case, I only built Scalar-valued Functions, I’ll cover the other types in future lessons.

2018-04-03_14-39-21.png