Data Jobs: What does a Data Analyst Do?

Data Analysts get a bad wrap. With the advent of the Data Scientist, Data Analysts are often viewed as Data Scientists lite, however I feel that is not the honest case. Truth is, there is a lot of overlap between the two fields. I will dive deeper into what a Data Scientist is in a future article, but just know my opinion is the definition of Data Scientist as a job is still a bit fuzzy and I think the job title may eventually be broken into a few different titles to better define the differences.

Data Analyst

So what does a Data Analyst do?

A lot actually. You could put 10 data analysts into a room and you would get ten different answers to this question. So the best I can do here is make sweeping generalities. As the old saying goes “Your results may vary”

In general, data analysts perform statistical analysis, create reporting, run ad-hoc queries from data warehouses, create data visualizations, create and maintain dashboards, perform data mining, and create machine learning models (yes, ML is not only for data scientists). Their assignments are business driven. A data analysts is either embedded with a business unit (financial planning, fraud, risk management, cyber security, etc.) or working in a centralized reporting/analytics team. They use their skills to provide reporting and analytics for the business.

Tools used by Data Analysts

  • SQL – MySql, SQL Server, Oracle, Teradata, Postgres – whether simply querying a data warehouse or creating and managing a local data mart, data analysts need to be advanced SQL programmers
  • Visualization tools – Tableau, Qlik, Power BI, Excel, analysts use these tools to create visualizations and dashboards
  • Python/R – Data analysts should be familiar with languages like Python or R to help manage data and perform statistical analysis or build machine learning models
  • Spreadsheets – Excel, Google Sheets, Smart Sheets are used to create reports, and pivot tables used to analyze the data
  • ETL tools – SSIS, Alteryx, Talend, Knime, these tools are design to move data to and from databases, CSV files, and spreadsheets. Until the data is in a usable format, analysis cannot be performed.

Educational Requirements

Typically a data analyst position will ask for a bachelors degrees, preferably in computer science, statistics, database management or even business. While the barrier to entry for a data analyst job is generally not as high as a data scientist, that does not mean you cannot make a meaningful and well paid career as a data analyst. Also, the demand for data professionals seems to keep going up and up and it most likely will for the foreseeable future.

SSRS: Introdution: 1rst Report

SSRS stands for Sql Server Reporting Service. This is Microsoft’s BI reporting tool integrated into their Sql Server platform. SSRS allows you to create, deploy, and manage reports from a server platform.

SSRS comes as part of the SQL Server suite. It is not available as part of Express, but if you buy the developers edition, you will get SSRS (as well as SSIS and SSAS). You may need to download and install it separately. You’ll find it under the title, SQL Server Data Tools (SSDT).

SSDT download link

In this tutorial, I will walk you through creating your first SSRS report. In this example, I created two SQL Server tables from the following Excel files:

JobDataSet

JobDesc

If you aren’t sure how to upload an Excel file to SQL Server, you can go to my tutorial on how to do it. The SQL Server upload tutorial was actually created to support the SSRS lessons.

SQL Server: Importing Excel File to SQL Server

To create a new SSRS project, open Visual Studios (the platform SSDT runs on) and go to File->New->Project

2018-04-06_8-56-18

Select Reporting Services -> Report Server Project

Name your project. I typically leave Create directory for solution checked.

2018-04-06_8-58-27.png

Your new “solution” will open up. I still haven’t really figured out why MS changes the name from Project when creating and opening it to Solution once you are working on it. I am sure someone, somewhere had a reason for it. That person is probably retired now and won’t return any email requests as to why he decided on the wording choice.  We’ll just have to chalk it up to another one of life’s mysteries. Like why is Regis Philbin famous?

But I digress…

Now our “solution” is open, we will see 3 sub folders. Shared Data Sources, Shared Datasets and Reports

2018-04-06_9-01-16

To create a new report, right click on Reports -> Add -> New Item.  Don’t click Add New Report unless you want to meet the world’s most unhelpful Wizard. I call him Gandalf the Drunk…

2018-04-06_9-01-47

After clicking add new item, click Report and name your report.

2018-04-06_9-02-32

Now your new report will open up.

2018-04-06_9-33-25.png

Data Source

So, before we can actually report on anything, we are going to need data. And in order to get data, we are going to need a Data Source. Now you will notice you have Shared Data Sources on the right and Data Sources on the left. Shared Data Sources, once established, can be used in all reports you create. If you create an “embedded” data source in the report, you will only be able to use it in that report.

For this example, we will make a shared data source.

Go to the right under Solution Explorer and right click Shared Data Sources. This time you can click Add New Data Source.

2018-04-06_9-33-48

This wizard is just Gandolf the Tipsy. While I harbor a general dislike for most wizards, this one isn’t completely useless at least.

First name your data set something you will remember later.

Select Type ->Microsoft SQL Server

Click Edit

2018-04-06_9-37-25

Copy and paste your server name in the 2nd box. In this example, my SQL Server is locally installed on my computer, so I just used localhost as my server name. Next select the Database you want to work with from the drop down. I created a database call SSRSTraining for this example

2018-04-06_9-40-23.png

Hit Test Connection, you should get a success message.

2018-04-06_9-40-42

Click Okay, you’ll new see your Data Source in the Solution Explorer

2018-04-06_9-41-33.png

Now go to the left and right click on Data Source for your report. Select Add New Data Source

2018-04-06_9-43-10

Name your data source and click on the Use shared data source reference radio button.

Pick you data source from the drop down. There should only be one to choose from

2018-04-06_9-43-36

Datasets:

Now click okay, go back to the left and right click on Dataset.

2018-04-06_9-44-02

Select New Data Set,

Name the Data Set

Select Use a dataset embedded in my report. This is generally how I do things, as Data Sources are usually reusable, Datasets are more designed for specific reports, so I leave them embedded.

Select your Data source from the drop down

For Query type we are using Text in this example

I am using a simple select all statement from the dbo.JobDataSet table I created

2018-04-06_9-47-16.png

If you click on fields in the upper right, you’ll now see the columns that will be feeding in from the query. You can rename the columns if you wish

2018-04-06_9-48-00.png

For now, let’s just click Okay, now you will see your data set expanded on the left side of your screen.

2018-04-06_9-48-46

Now to the far left, you should see the word Toolbox, click on that. This is a list of the tools we can work with within SSRS. Let’s start with a table

2018-04-06_9-49-09

Click on the table and drag it into the design window in the middle of the screen

2018-04-06_9-49-27.png

Now you can simply drag and drop columns from your dataset into your new table.

2018-04-06_9-49-53.png

By default, a table comes with 3 columns. You can add columns to this table by dragging a field over to the end of the table (note you will see a blue bar indicator letting you know your mouse is in the right spot. The column will populate to the right of the blue bar).

2018-04-06_9-50-24

This will add a new column to the end of your table. You can also use this method to insert a column in between existing table columns

2018-04-06_9-50-39

Now click on Preview button above your table

2018-04-06_10-23-01

When you do, you will get to see your table complete with data. Notice how the job column is too small for the job description.

2018-04-06_9-51-14

Step 5: Formatting

To fix the job column, let’s go back to our Design screen. Do this by clicking on the Design tab in the upper left.

2018-04-06_10-23-01

Now hover your mouse over the table until you get the double arrow icon seen below

2018-04-06_10-23-28

Once you have that, simply click and drag the column over to make it wider

2018-04-06_10-24-09

Since we are in the design window anyway, let us do a little more formatting. Click on the gray box to the left the header row to highlight the entire row. Now we can do things like Bold the font or change the background color

2018-04-06_10-24-49.png

Go back to the preview window to check out your results.

2018-04-06_10-25-55

There you have your very first SSRS Report from top to bottom.

R Shiny: Introduction

Shiny is an R Studio application that allows you to how host your R code on a website. This is a great method for creating a dashboard or deploying a predictive model. The best part is, it is free. There are, however, paid services associated with R Studio and Shiny if you want to get professional support and such.

In this lesson, we are going to learn to build a Shiny server application step by step.

Let’s start at the beginning. First off, to use Shiny, you need to have R Studio installed on your computer. Once you have that, you will need to install the Shiny package

install.packages(‘shiny’)

Next, we are going to learn about the three parts of a Shiny server application.

Below is the basic framework of a Shiny app:

library(shiny)

ui <- fluidPage("Test")
server <- function(input, output) {}
shinyApp(ui = ui, server = server)

Let’s break this down:

ui: ui is the user interface. This is the code that makes up the web page the user will launch and interact with. Anyone familiar with HTML, this is the basic markup. This is where you set the wording, font, colors, what kind of object will be on the page, etc

server: server is the brains of the operation. This is where you load and manipulate any data being used in your page. Again, if you are a web developer, think Java Script here.

shinyApp: this is the code that launches your App.

If you run the code above, you will see something like this

2018-03-19_11-06-29

Note the IP address in the upper left. If you copy this into a browser of your choice, it will run there too.

2018-03-19_11-07-28.png

You will see the same address in the R Console. As long as the R Shiny server is running, you will see this Listening command. For those not familiar, the address 127.0.0.1 is called a pingback address. This is the internal address of the NIC card on your machine. This means the server is running on the machine you are on. The :4654 is the port my computer grabbed. This can be different for you. The easiest way for me to explain the port without going too deep into network theory is: think of it like a TV channel. From your computer, you could technically be running hundreds of these R Shiny servers. The port (TV channel) is what tells the computer which one of these web pages you want to visit.

If you happen to be on a network, you can go to another computer and type in your host computers network IP address (you can get it by going to a cmd console and typing ipconfig). It will be the address I have the arrows pointing to in my example

2018-03-19_13-33-33.png

Then, using my example, type http://10.171.46.91:4654 into the second computer’s browser and you should see your Shiny website.

Step 2

Let’s see if we can make our website more interesting, let’s start easy, we will format the text.

First close out your shiny window to end the session.

Change your ui line to match the one below.

ui <- fluidPage(h1(“Test”))

Now run your code again

library(shiny)
ui <- fluidPage(h1(“Test”))
server <- function(input, output) {}
shinyApp(ui = ui, server = server)

2018-03-19_13-38-26

Step 3

Okay, let’s make it a little more interesting by adding some detail to our page. We will do so by using panel methods in our fluidPage()

library(shiny)

ui <- fluidPage(
  titlePanel("Title Panel"),
     sidebarPanel("Sidebar Panel"),
      mainPanel("Main Panel")
    )
server <- function(input, output) {}
shinyApp(ui = ui, server = server)

Notice your page is now segmented into a title bar, a sidebar and a main panel

2018-03-19_13-52-55.png

Okay, now let’s actually add some R data magic to this.

library(shiny)
library(datasets)
mpgData <- mtcars

ui <- fluidPage(
  titlePanel("Mpg vs Cylinder Plot"),
     sidebarPanel("Sidebar Panel"),
      mainPanel(  plotOutput("mpgPlot"))
    )
server <- function(input, output) {
  output$mpgPlot <- renderPlot({
    boxplot(mpg~cyl, data= mpgData)})
}
shinyApp(ui = ui, server = server)

Let’s break down the new code. First thing first, we are loading the built in datasets library and grabbing the mtcars data set from it.

library(datasets)

mpgData <- mtcars

Next we are going to skip down to the server section next as this is where we work with our data:

First thing you will notice is function(input, output) – this is a standard call that allows you to pass information back and forth through the server code. In our example, we are only using the output right now.

We create a variable called output$mpgPlot to which we pass the shiny method renderPlot.  Inside that method, we are simply creating a simple boxplot comparing mpg to cyl using the mtcars data set that we labeled mpgData

server <- function(input, output) {
  output$mpgPlot <- renderPlot({
    boxplot(mpg~cyl, data= mpgData)}) 
}

The boxplot we rendered as output$mpgPlot is now passed to our ui

Next, in the ui I renamed the title panel and in the mainPanel, I added the command plotOutput(“mpgData”). That will allow me to post the boxplot I created in the server to the ui web page.

ui <- fluidPage(
  titlePanel("Mpg vs Cylinder Plot"),
     sidebarPanel("Sidebar Panel"),
      mainPanel(  plotOutput("mpgPlot"))
    )

If you run the code, it should look like this:

2018-03-19_14-59-52

Step 4

Now let’s add a little interactivity to our web page. We are going to give the user the option to view the boxplots with or without the outliers taken into effect.

library(shiny)
library(datasets)
mpgData <- mtcars

ui <- fluidPage(
  titlePanel("Mpg vs Cylinder Plot"),
     sidebarPanel(checkboxInput("outliers", "Show outliers", FALSE)),
      mainPanel( 
                      plotOutput("mpgPlot"))
    )
server <- function(input, output) {
  output$mpgPlot <- renderPlot({
    boxplot(mpg~cyl, data= mpgData,
            outline = input$outliers)})
}
shinyApp(ui = ui, server = server)

This involves only two changes to our code

First we will add a checkbox to our sidebar, in the ui we will alter the sidebarPanel line to look like this

sidebarPanel(checkboxInput(“outliers”, “Show outliers”, FALSE)),

The breakdown of the syntax here is

checkboxinput(variable name, display text, default )– our default is set to FALSE so our box will be unchecked by default

 

Next, we will alter our boxplot to take into account the checkbox input

boxplot(mpg~cyl, data= mpgData,
            outline = input$outliers)

— notice the variable name is input$outliers  It is a composite of the variable from our checkboxinput method and “input” from out function(input, output) function that opens up our server.

Now run your code

Unchecked

2018-03-19_15-05-46.png

Checked

2018-03-19_15-05-29.png

Splunk: Overview of functionality

Splunk is great at analyzing large sets of unstructured data quickly and easily. To simulate working with a stream of data, I am using a log file I pulled from a server. You can download it here: event

Load this into Splunk through data upload button. You can just hit next through the remaining upload screens. The default setting work just fine for us.

2016-11-30_08-57-31

With out data loaded, we can search through the data using key words and wildcards

2016-11-30_09-03-51

But for something cool, click on the Patterns tab. This shows interesting patterns Splunk has sniffed out on its own.

2016-11-30_09-13-33

Click on any of the Interesting Fields Splunk created from your data to see some interest quick stats.

2016-11-30_09-16-40.jpg

You can dig deeper into the Stats by picking one of the canned reports that appear in the pop-up window.

2016-11-30_09-30-07

You can run visualizations from these reports.

2016-11-30_09-30-28

Of course you can change the visualization type from a  simple menu

2016-11-30_10-15-25

The Statistics tab shows numeric stats based on the report/query you are working with.

2016-11-30_09-30-58

If while interacting you find search you want to keep or share it, you can save the settings in it as a Report

2016-11-30_09-18-05.jpg

When you save the report, Splunk gives you a default option to add a Time Range Picker that will let you pick a time range next time you run the report.

2016-11-30_09-18-20

I personally recommend playing with Splunk to get a better feel for it. What is so cool about it is how quickly and easily you can produce actionable analytics.

 

Python: Working with Lists

Making a list in Python is simple: x = [1,2,3,4]

Now let us see how we can work with them.

Index

First thing to understand is that a list is indexed. Each item in the list is given a number that tells you its position in the list.

It is important to note that Python is a 0 index language. This means indexes begin at 0 not 1. So the first item in the list, is found by calling: x[0]

pythonlist

Note that x[4] returns an error. Since there are 4 items in the list, the indexes go 0,1,2,3. Index 4 is out of range.

Another interesting point to understand about indexes is that you can use a negative index. x[-2]  returns Duck

Try your code in our online Python console: 

Index Range

Use a “:” to return a range of items from a list: x[start:end]. If you leave out the start or end number, the index starts or ends at the start or end of the list

pythonlist1

Update Values in a List

If you want to change a value in a list, just assign it a new value like you would with a regular variable.

pythonlist2

Append()

If you want to add an item to then end of a list, you can use the Append() function

pythonlist3.jpg

Del

To delete a item from a list, use the Del command.

pythonlist4

Remove()

Remove works like Del, except instead of using index values, Remove() uses the values stored in the list.

pythonlist5.jpg

Try your code in our online Python console: 

Pop() and Insert()

Pop() simply returns the last item from a list

Insert() lets you add a item to a list, but it also lets you choose what position in the list to add it.

pythonlist6.jpg

Len()

Len() returns a count of the number of items in a list.

pythonlist7.jpg


Try your code in our online Python console: 

If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT. 

Last Lesson: Python Lists and Dictionaries

Next Lesson: Working with Dictionaries

Back to Python Course: Course

Python: Intro to Graphs

Visualizations are big part of analytics. You will need to produce visually engaging graphics for presentations, reports, and dashboards. You will also make graphs for your own use in data discovery and analysis. As bonus, unlike data cleaning, data viz can be pretty fun.

Matplotlab and Pyplot

Matplotlab is a module you can import into Python that will help you to build some basic graphs and charts. Pyplot is part of Matplotlab and the part we will be using in the following example.

**If you are using the Anaconda Python distribution, Matplotlab is already installed. If not, you may need to download it from another source.

Line Graph

Syntax

  • %matplotlib inline – this code allows you to view your graphs inside jupyter notebooks
  • from matplotlib import pyplot as plt – here we import pyplot from matplotlib into our program (note, we only want pyplot not all the functions in matplotlib).Adding “as plt” gives us a shorter alias to work with
  • age and height lines – fill our lists with age and height information for an individual
  • plt.plot(age, height, color = ‘blue’) – here we tell Python to plot age against height and to color our line blue
  • plt.show() – prints out our graph

pythonGraphs

Bar Chart

For this example, we will make a bar charting showing ages of 4 people.

Syntax

  • You should understand the first few lines from the first example
  • count_names = [i for i,_ in enumerate(name)]  – since the name list is a list of strings, we cannot really graph that onto a chart. We need a way to convert these strings into numbers.

Wait? What does for i,_ mean? Let’s jump to the next code sample

pythonGraphs1

While you don’t see it when making the list, a Python list is technically a list of tuples (index number, element). So if instead of i,_ we asked for both elements in the tuple, (i,j) we would get the following.

pythonGraphs2

So by iterating by for i,_ we only return the first element in the tuple (the index)

** notice we are using a list comprehension. If you are unfamiliar with list comprehensions, check out my earlier post: Python: List Comprehension

Let’s clean up our bar chart a little now.

  • plt.ylabel(‘Age’) – label the y-axis Age
  • plt.title(‘Age of People’) – give the graph a title
  • plt.xticks([i+0.5 for i,_ in enumerate(name)], name) – this label function is using a list comprehension to first chose the position on the X-axis, and name provides the person’s name for the label.

pythonGraphs3.jpg


If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT

Follow this link for more Python content: Python

MS SQL Server: Installation

When it comes to SQL Server for personal use – (i.e. training, education) Microsoft offers two options:

  • SQL Server Express: This option is free, but is surprisingly robust. This is the version I cut my teeth on.
  • SQL Server Developer Edition – which you can find online at around the $60.00 price point. This edition is worth the money as it comes with everything  (SSIS, SSRS, etc.) Unfortunately you will need to purchase from a 3rd party vendor because as of the time of this writing, Microsoft is not currently selling it.

For this tutorial, we will focus on Express:

Here is the link for SQL Server Express: Download Link

The screen below explains the options available. You will want – at minimum – to install SQL Server Express with Tools. I would recommend SQL Server Express with Advanced Services.

Select the download arrow my red arrow is pointing to and you will be asked to sign in with your Microsoft Account. (If you don’t have one, do not worry, they are free – sign up for it).

sqlinstall

 

On the next screen, choose your version and language. Make sure you choose 64 or 32bit. If you don’t know what version Windows you are running. Go to your Control Panel > Admin Tools > System Configuration You will find your system version there.

sqlinstall1

Once the file downloads, just run it like a regular MSI. The only thing you really may not have seen before is asking for authentication method. I always go with Mixed Mode (just make sure to remember the SA password you put in).

If you are having trouble, here is a good You Tube Tutorial for a step by step installation: Link to Video

Okay, now we have SQL Server installed, let’s use it.

Starting SQL Server Management Studio

SQL Server Management Studio is how you will interact with SQL Server.

To open it up, hit Windows Key+R. In the run box type SSMS.

sqlinstall2.jpg

Since you are trying to connect to the SQL Server you just installed on your computer, enter localhost as Server name. Choose Windows Authentication and hit connect.

sqlinstall3.jpg

Let’s Make a Database

Once SQL Server Management Studio opens, go to the Object Explorer and right click Databases > New Database

sqlinstall4.jpg

Let us name the new database SQLIntro and hit okay. Don’t worry about other options at this point.

sqlinstall5

From the top bar, select New Query

sqlinstall6

Copy and Paste the following script into the Query window that just opened up. Don’t worry about the code at this point. Just know:

  1. Use SQLIntro – tells us to use the database we just created
  2. Create table — we are creating a table called FirstTable with 2 columns Name and Age
  3. We are inserting data (names and ages of 4 people) into our new table

Copy the script below:


Use SQLIntro
go

Create table FirstTable
(Name nvarchar(255),
Age int)
go

Insert into FirstTable (Name, Age)
values (‘Ben’, 40), (‘Chris’, 45), (‘Patrick’, 32), (‘Jordan’, 18)
go


After you paste the data, hit Execute!

sqlinstall7

Let’s See Our Data

Go to Object Explorer:

  1. Databases — note you will not have as many databases as you see here. This is my SQL Server I have been working with for the past 3 years.
  2. SQLIntro
  3. Tables
  4. Right Click dbo.FirstTable
  5. Select Top 1000 Rows

sqlinstall8.jpg

And there you have it, the contents of your first table in your first SQL Server Database

sqlinstall9.jpg


If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT. 

Follow this link for more SQL content: SQL

Tableau: Free and Reduced School Lunch Program

This visualization depicts Free, Reduced, and Full priced lunches served by the the National School Lunch Program (NSLP) in United States Public and Private Non-Profit Schools.

school

The sharp rise Free lunches, in conjunction with the sharp decline in Full priced lunches since 2008 hints that the effects of the 2008 economic recession are still being felt.

school1

Visualization can be found at: Link to Tableau Worksheet

**Data taken from Data.gov: Link to Data File

Linear Regression using Excel

Link to video on Linear Regression using Excel

Regression Analysis is still the most popular method used in Predictive Analytics. The main reason is that it works. It is well known and understood. With its different flavors, regression analysis covers a width swath of problems. Another great reason to use it, is that regression tools are easy to find.

Today we are going to use Excel to tackle a simple regression problem. I have uploaded a spreadsheet to this page. If you would like to follow along with the exercise, please download it from the link below:

Excel File Download:  Linear Regression Example File 1

What is Linear Regression?

Linear Regression is a method of statistical modeling where the value of a dependent variable based can be found calculated based on the value of one or more independent variables. The general idea, as seen in the picture below, is finding a line of best fit through the data. Using that line, you can then predict the value of Y given X.

linear1

I am not going to go too deep into the math here. I highly the Khan Academy video posted below if you are looking to brush up on your statistics.

Khan Academy – Linear Regression

Lets Start by Looking at the Data

If you download the Excel file at the top of the page, you will find 2 columns labeled Years and Salary. This example data set shows us the years of service and salary of 39 employees for an imaginary company.

linear2

What we are going to attempt to do is to develop a model using Linear Regression that will allow us to predict the salary of an employee given their years of service.

Step 1: Build a Scatter Plot

The first thing we want to do is build a scatter plot. Excel makes this simple enough. Just highlight all of your data > select the Insert Tab from the Ribbon > Select Scatter from Charts:

linear3.jpg

What you will get should look something like this:

linear4

We have a scatter chart with Salary on the Y Axis and Years on the X Axis. **Excel scatter charts set the left most column of the data set to the X Axis by default.

Before we move on, I want to take a moment to look at the scatter plot. Do you see a pattern? Can you see where you might be able to draw a line through the data?

I am not trying to just fill space here. I am asking a serious question. Because the answer is sometimes you will not see a pattern. Sometimes the scattering of data will be so random that there will no need to go forward with a linear regression. Learning to look for patterns in data visualizations is skill worth developing.

linear5

In this example there is a general pattern, or more accurately, we see what looks like Positive Correlation. We call it positive because it appears that as X increases so does Y. So now that our scatter chart has passed the visual test, it is time perform our regression.

Trend Line

Performing a simple linear regression in Excel is ridiculously easy. Simply click on your scatter plot > from the Ribbon select Chart Tools – Design > Add Chart Element > Trendline > Linear

linear6.jpg

Your trendline appears on your chart. I personally find the line a little hard to see as is, so I am going to format it a bit.

linear7

Start by double clicking on the trendline and the Format Trendline window will open on the right.

I made the following changes:

Line: — Color: Red  — Width: 3pt  — Dash type: Solid Line

Trendline Options — Select Display Equation on chart and Display R-squared value on chart

linear8

Alright, that line is much easier to read. Now let us talk about the numbers in the circle. Now I know I said I was not going to get too deep into the math, but I feel I can’t do this subject justice without at least a cursory explanation of what is going on.

linear9

What exactly did Excel do when it added the trendline? Technically it performed a statistical function known as Ordinary Least Squares.  What does that mean? Well if you wanted to attempt this by hand, one approach you could take would be to start by drawing a line that looked best to you. You would then measure the Residuals (the distance from the actual data points and line you drew)

linear10

You then repeat the process (picking a new line and measuring residuals) until you find the line that results in the lowest overall residual.Once you have it, you get the equation for your line:  y = 1357.9x+50974 (Luckily for us Excel makes the process a lot easier)

Now a quick refresher on the line formula: Y= mX + b (where m = Slope and b = Y-Intercept). This equation is what you would use to make predictions. In our equation a person with 0 years in service would have a salary of 50974: Y = 1357.9(0) + 50974 — Y= 50974. And each year of service would add 1357.90 to the salary.

Before we go start using your equation to start making predictions, we still need to discuss the R² you see below your line equation. I won’t bore you with how R² is calculated. You don’t really need to know how it is calculated to use linear regression, but you do need to know how to read it.

The simplest explanation I can give you for R² is that a value of 1 means perfect fit – every point in your data matches up to your line. 0 on the other hand, means your line doesn’t match anything. Our R² is 0.4423, which really is not that great. I generally prefer to aim for a R² value above 0.6.

How can we improve our R² value? My preference would be to get more data. We currently only have 39 tuples. More data could improve our accuracy. If more data is not available though, you can look at your outliers as Linear Regression can be greatly affected by outliers. Unfortunately outliers are often tricky to deal with. A person with 1 year of service making 100,000 a year would definitely be an outlier, but it is not an impossibility. If this employee is a highly experienced individual who just transferred from another company, it is totally feasible they could be earning 100,000.

The hard truth is, considering only the data we have, we cannot rightfully develop a reliable model. This happens more often than you might think. That is okay though, we will chalk this up as a learning experience and move on.

 

Analytics: An Introduction

So exactly what is Analytics? Everyone is talking about it. Colleges and Universities are scrambling to develop programs in it. But what exactly does it mean?

Definition

The the definition I like the best is this:

Analytics: Discovering and communicating meaningful patterns in data.

Analytics are traditionally broken down into the following catagories:

  • Descriptive Analytics: Most people are familiar with this form. So familiar in fact, they probably do not refer to it as analytics. This is looking at past and current data to describe what is going on. Most standard business reporting falls into this category.
  • Predictive Analytics: This is using available data to help predict future events or to provide best guess answers to fill in gaps in data. Using predictive analytics, you can predict how much a house will sell for or what items you should stock near the registers based on current conditions (example: Walmart discovered Pop-Tarts tend to sell well during hurricanes).
  • Prescriptive Analytics: This is the cutting edge of analytics. Prescriptive analytics not only makes predictions about future events, but it utilizes decision making algorithms to determine how to respond to the events. Prescriptive analytics engines could, using the Pop Tarts example above, automatically reroute the shipment of Pop Tarts to stores in hurricane affected areas without any human intervention.

It should be noted that most companies today are still spending most of their time in the descriptive analytics world. That is not necessarily a bad thing. Being able to get the right information in front of a decision maker, in a format that is easily digestible, is a talent all within itself. 

Components

Analytics is not a 1 step process. It is actually a series of steps, often performed in an iterative manner. And just as each business problem is unique, so are the steps to the analytics process used to find the solution.

While the statement above is 100% percent true, I find it very unsatisfying. This is the kind of information I would find when I first developed an interest in analytics. So while I cannot give you a one size fits all answer, I feel that I at least owe you a better explanation than that.

For me, perhaps the best way to understand analytics, is to look at some of the more common tasks performed.

  • Data Management: While designing, building, and maintaining databases and data warehouses may not typically fall under the responsibility of an analytics professional, having a general understanding of how they work is none the less important. Databases and data warehouses are where most businesses keep their data. If you want to be taken seriously as a data professional, you need to have a fundamental understanding of how data is stored and how to query the stored data. (Example Technologies: Hadoop, SQL Server, Oracle)
  • Data Modeling: Data modeling is organizing data into logical structures so that is can be understood and manipulated by a machine. As a simple exercise, make a quick spreadsheet for sales amounts for  5 salespeople across 4 quarters. When you are done, look at the table you created. You have just modeled data. (Example Technologies: Excel, SQL Server, Oracle, Visio)
  • Data Cleaning: While this may not be the sexiest part of the job, it is the part you will spend the most time on. 60-80% of your time will be spent in this phase of the job. And while there are some third party software applications out there that can help ease the pain (Alteryx comes immediately to mind), they are expensive and not every boss will be willing to spring for it. My suggestion is to put sometime aside to become very familiar with Excel. I do 90% of my data cleaning work in Excel and MS SQL Server. (Example Technologies: Excel, SQL Server, Oracle, Alteryx)
  • Data Mining (Machine Learning): Now this is the cool stuff everyone is talking about. Data mining or machine learning, whichever you prefer to call it,  is the Artificial Intelligence (AI) portion of analytics. Data mining is difficult to provide a simple explanation for, but I will try anyway: In traditional programming, the programmer provides explicit instructions to the computer as to how to perform a task. With data mining, data sets are fed through an algorithm. The computer then determines the best way to solve the problem based on the data provided. 

 To help make this a little clearer, how about you try your hand at being the machine.

spam

Look at the pattern above. Without me providing you with any more information,                  you should be able to determine, that two blue squares in a row = SPAM. This is, at                 the most fundamental level, how data mining works. It pours over data and finds                   patterns. Knowing this pattern, if you were now shown only the first three columns               you would be able to predict whether the last column would be red or green.(Example Technologies: R, Python, SAS, XLMiner)

  • Data Visualization: DataViz is fun. It is the real show stopper in the data world. Visualizations make the patterns pop off the page. There are a lot of great programs out there for data visualization. (Again, do not discount Excel — it has some great DataViz features). Now DataViz should rightfully be broken into two separate categories. The first is Exploratory. This is visualizations used by the data professional to help analyze and understand the data. The second is Production. This the finished product that ends up on reports and dashboards for the business users to see. (Example Technologies: Excel, Tableau, R, SAS)
  • Optimization and Simulation: How often is there truly only one solution for a problem? Reality is sometimes the hardest part isn’t coming up with a solution to a problem, but deciding which solution to use. Building optimization models and running simulations helps to provide decision makers with quantitative data as to which solutions will be most effective. (Example Technologies: CPLEX, SAS, Solver)

So I have to learn all of this…

That depends – If your goal to is be a Data Scientist, then yes, you need to learn everything mentioned above and then some (I hope you love Statistics). However, if you are a business user just trying to add analytic skill to your toolbox, my recommendation is to focus your efforts on becoming efficient in data cleaning. In the real world, when trying to put a report together, you often are given data from multiple sources and you have to cobble it together to make sense of it. Learning some data cleaning skills can save you hours on tasks like that.

Once you have workable data, take some time to learn some visualization techniques. An eye popping chart will always garner more attention than pages of numeric columns. Also, take a little time to learn some data mining skills. No one is expecting you to write the complex algorithms the PhD’s at Stanford and MIT are kicking out, but there actually are some pretty user friendly data mining programs out there that help you cull some real insight out of your data.

However you decide to go about it, Analytics is a fascinating, fast growing field. It truly is a 21st century skill. Here at Analytics4All.org, the philosophy is that everyone should develop some analytical talent. Computers were once the sole territory of the science geeks of the world and now they are in everyone’s pockets and purses. Analytics and data driven decision making should also be a accessible to all.