R: ggplot – Histograms

On May 30, 2016May 30, 2016 By Ben Larson Ph.D.In R, VizualizationLeave a comment

Let’s make a Histogram using ggplot

First step, import the ggplot2 library

library(ggplot2)

Now let’s look at our data. In this example I am using chickwts from R Data

head(chickwts)

str(chickwts)

As you can see above, this data contains 2 columns weight ( numeric) and feed(Factor with 6 levels)

Histogram

In a histogram, we don’t need to worry about assigning a Y axis, the Y axis is the frequency count of our X variable.

Let’s set our data and x axis in ggplot. Let’s also assign it to variable.

pl <- ggplot(data=chickwts, aes(x=weight))

Now we can use the assigned variable in conjunction with our geom()

pl + geom_histogram(binwidth=10)

In our geom, I set a binwidth. This tells us how wide to make our bars. Setting binwidth to 10 makes each bar 10 units wide.

histogram2

Let’s give it some color

pl + geom_histogram(binwidth=10, fill="blue")

Now, let’s set our colors based on our 2nd column “feed”. Note how a legend is automatically generated.

pl + geom_histogram(binwidth=10, aes(fill=feed))

For the finishing touch, let’s add some black bordering around our boxes.

pl + geom_histogram(binwidth=10, aes(fill=feed), color="black")

The Code

library(ggplot2)

head(chickwts)

#-- look at structure
str(chickwts)

#set data and x axis value
pl <- ggplot(data=chickwts, aes(x=weight))

#create histogram, binwidth 10
pl + geom_histogram(binwidth=10)

#change bars to blue
pl + geom_histogram(binwidth=10, fill="blue")

#set color of bars by feed column values 
pl + geom_histogram(binwidth=10, aes(fill=feed))

#add black border around boxes
pl + geom_histogram(binwidth=10, aes(fill=feed), color="black")

Excel Solver: Optimization Models: Linear Programming 1

On May 29, 2016May 29, 2016 By Ben Larson Ph.D.In Optimization Modeling3 Comments

Let’s build a more complex model. Here is our problem:

You own a cabinet company and you are currently making 2 types of cabinets this month: wall and base. The wall cabinet sells for $300 and the base sells for $450. The wall unit cost $150 in labor to build, while the base cabinet costs $225.

The cabinets are built from a combination of plywood (which cost $11 per sq foot) and oak (which costs $15 per sq foot). The wall unit needs 5 sq ft of plywood and 1 sq ft of oak. The base unit needs 6 sq ft of plywood and 2 sq ft of oak.

You bought your materials ahead of time to get the best rate. You have 10000 sq ft of plywood and 3000 sq ft of oak available. Finally, based on previous sales records, you estimate the most you will be able to sell is 600 wall cabinets and 1200 base cabinets this month.

So you want to know – how many of each cabinet should I build to maximize my profit?

Let’s Model It

Okay, I know. This just gave you horrible flash backs to math class and word problems. But the truth is, this is what your math class was trying to prepare you for. We are going to use math to create a model that mimics the real world and solves a problem.

Open up Excel and let’s start.

***Note: I suggest you try building this model yourself first. If you need help, just follow my steps below, but only follow me as needed.

Let’s start by inputting some of the information we know:

Known Info

Prefilled Excel for Known Info: cabinet1

Price of materials:

Amount of Available Materials

lp2

Now add what we know about building the cabinets

Finally, input your build limits

lp4

Here is our model so far

Changing Variable

In this model, our changing variable is how many of each cabinet we will build.

lp6

Objective

Next, let’s set up out Objective. In our model, the object will be total profit.

Calculations

Now you need to add the calculations that will make your model work.

Excel File with Calculations: cabinet_with_equations

First let’s calculate material use. I will use SUMPRODUCT() to do this. In the example below, I am multiplying the number of cabinet types built(changing cells F9:G9) by the material requirements (plywood B7:C7)

Then repeat for oak.

lp8

Now calculate profit for each cabinet.

Here is how I did it. Selling Price – Labor Cost – (plywood cost per sq ft * sq ft used + oak price per sq ft * ft used)

I repeated for Floor cabinets

Finally I am calculating the final profit. This is done by using Sumproduct() again. This time it is sumproduct(Profit, Actual Build)

Run Solver:

Click on Data and Solver in Ribbon up top.

Now set the Objective to you Profit Cell (A15), and the Change Variables to your Build cells (PINK – F9:G9). Click Max and set the Solving Method to Simplex LP

lp11

Constraints

Hit Add to the right of the Constraints window and add your constraints

lp12

Looking at your spreadsheet, your constraints below state

plywood used <= plywood available
oak used <= oak available
Wall built <= max Wall Build
Floor built <= max Floor Build

Now Hit Solve

Check Keep Solver Solution:

Hit OK

Solution

Now you have a solution. According to Solver, you should build 560 Wall and 1200 Floor cabinets to maximize your profit.

R: Intro to ggplot()

On May 28, 2016May 28, 2016 By Ben Larson Ph.D.In RLeave a comment

ggplot() is a powerful graphing tool in R. While it is more complex to use than qplot(), its added complexity comes with advantages.

Don’t worry about the complexity, we are going to step into it slowly.

Let’s start by getting a data set. I am going to choose airquality from the R data sets package. You can find a list of data sets from that package here: Link

ggplotIntro

We are going to manipulate some of this data, so first let us set the dataframe to a new variable.

Next, take a look at the structure – str() – we have 6 variables and 153 rows. The first thing I notice is that Month is an int – I don’t want that. I would rather have Month be a factor (a categorical variable)

Use factor() to set Month to a factor. Now look at str() readout again. There are 5 levels (months) represented in our data set

ggplotIntro3

ggplot()

here is the basic syntax – ggplot(data, aes(x value, y value))+geom_[type of plot]

ggplot(data = airQ,  aes(x = Wind, y = Temp)) + geom_point()

ggplotIntro4

Now let’s add some color to the chart. Inside aes() – which stands for aesthetics – we are going to add color = Month

 ggplot(data = airQ, aes(x = Wind, y = Temp, color = Month)) + geom_point()

ggplotIntro5

Now add some size. Inside aes() add size = Ozone.

ggplot(data = airQ, aes(x = Wind, y = Temp, color = Month, size = Ozone)) + geom_point()

ggplotIntro6

The Code

library(ggplot2)

head(airquality)

#-- assign data to a new variable
airQ <- airquality

#-- look at structure
str(airquality)

#-- set month as a factor
airQ$Month <- factor(airQ$Month)
str(airQ)

#-- plot wind vs temp - scatter plot
 ggplot(data = airQ, aes(x = Wind, y = Temp)) + geom_point()
 
#-- plot add factor (Month) as a color
 ggplot(data = airQ, aes(x = Wind, y = Temp, color = Month)) + geom_point()
 
#-- set Ozone as a size element
ggplot(data = airQ, aes(x = Wind, y = Temp, color = Month, size = Ozone)) + geom_point()

Excel: Charts, Customizing Which Columns You Want to Chart

On May 27, 2016May 27, 2016 By Ben Larson Ph.D.In excelLeave a comment

Here is the data set if you want to play along: ExcelCharts2

While Excel does a pretty good job of auto-formatting charts based on your data, sometimes a little customization work is in order. Take this data below for example:

What if I want to create a bar chart of the average readings?

Well first we need to make an “Average” column

I really don’t need all the decimal places. Hit the decimal reduce button in the picture below until you have a column full of integers.

Chart Method 1

Just like you would normally, select all the data and go to Insert > Column Chart

Right Click on your new chart and Select Data…

I don’t want my readings on the X axis, I want my Sensor Names, so click Switch Row/Column in the new window.

The X Axis is now A,B,C,D

Now, we only want to see the Average column, so uncheck Reading 1,2,3

Here is our chart

chart10

Chart Method 2

Don’t select any data yet. Just go to the Insert and Select Column Chart

chart11

Click Select Data now from the Ribbon (if you don’t see this, click anywhere on your new blank chart and this should appear in the Ribbon)

chart12

Click Add in the low left box.

In Series Name, either type a name or select the column header

In Series values: put put your data rows from the Average column

Now we have data loaded into our chart, let’s change the data labels for our X Axis. In the Horizontal(Category) Axis Labels – click Edit

Highlight A,B,C,D from the Sensor column

chart17

Now you see the numbers have been replaced by A,B,C,D

chart18

Here is our chart

Excel: Randomly Select Names from a List

On May 27, 2016June 10, 2022 By Ben Larson Ph.D.In excel4 Comments

One of the main elements of statistical analysis is making sure your data is a good representation of the population. The best way to do that is to make sure your sampled data is truly randomly selected.

I am going to show a quick method in Excel where we can select random names from a list (our population) of names. We will be using the following Excel functions to do so (if you are not familiar with them, don’t worry, I will walk you through it):

index()
randbetween()

We will also be creating a Named Range.

The List

Let’s start with the list. I have 14 names, and I want to randomly select 5 names from this list.

Index()

Index is a method for using the looking up elements in Excel using the indexes (row numbers, column numbers). Note the index is always a number, so when dealing with columns, don’t be tricked by the column letters assigned.

Here I am looking for the 5th index in my list of names

=index(list,index)

My result is Steve

You can also work horizontally. Here I am looking for index 3 from my horizontal list.

Result is d

Now watch this. I am asking for the 4th index in this list. What do you think it will be? The 4th row is Tyrone.

Nope, when running INDEX() the index starts at 1 – starting at the first element in the selected list. If you start counting with Gary, Karen is our 4th person down.

RANDBETWEEN()

RANDBETWEEN accepts 2 arguments, high and low numbers in a range. It then returns a random integer from with the range.

RANDBETWEEN(1,20) will return an integer from 1 to 20 (including 1 and 20)

Named Range

Now before me move forward, when formulas start to get larger (as they are about to) I look for ways to simplify the syntax. One trick would be to name our range.

Highlight all the names in your list, then go up to the name box and enter Names

Now combine RANDBETWEEN() and INDEX()

now try this: =INDEX(Names, RANDBETWEEN(1,14))

Hit enter and drag the formula down 5 cells

Now you have 5 random names selected from your list.

SQL: Reindex a Database

On May 25, 2016August 12, 2022 By Ben Larson Ph.D.In MS SQL ServerLeave a comment

Indexes in databases work kind of like an index in a book. Instead of having to look at each page in a book for something, you can just go to the index – find your topic in an alphabetized list, and go to the page number indicated. Databases use indexes so they do not have to look at every single row in a table that could contain hundreds of thousands up to billions of rows.

The problem is, with all the constant reading and writing to a live database, the indexes quickly become fragmented as they try to keep up will all the new data coming and going. After a while this fragmentation can start to have an effect on your database’s performance.

Rebuild an Index

***Don’t attempt this on a production database while it is in use. Make sure the database is not being used before trying anything in this lesson.

The act of defragging an index in SQL Server is known as rebuilding. You can do it which just a few mouse clicks.

First, let’s find our indexes. You can find Indexes nested under tables in the Object Explorer. I won’t go in depth on Clustered vs Non-Clustered, only know the each table can only have 1 Clustered Index. You can think of it as the master index for that table if it helps.

Right click on an index and go to Properties

Select Fragmentation from the Select a page window. Note my index is 66.67% fragmented.

Click out of that window and right click on your index again. This time click Rebuild

reindex1

Click Okay and the window and your Index will be rebuilt. Simple enough.

Rebuild All Indexes in a Table

If you want to rebuild all the indexes in a table, you can click on the Index folder and click Rebuild All

Then click okay

Or you can use the follow SQL Code

Reindex Code

DBCC stands for Database Console Commands. It is a list of useful tools you can use to administer a SQL Server. The syntax below is as follows: DBCC DBREINDEX(TABLE NAME, Index you want to rebuild (‘ ‘ = all indexes) , fillfactor)

DBCC DBREINDEX([HumanResources].[Employee],' ',90)

A quick note on fill factor. A fill factor of 0 or 100 tells SQL to fill every index page completely – leave no extra room. If the data was stagnant that could work, but when data is constantly being written and deleted, the indexes need room for correction. That is why you will often see 80 or 90 used as a fill factor. It gives a little wiggle room for the real life functionality of the database.

Reindex All the Tables

If you want to Reindex all the tables in a database, you can do it using a Cursor and While loop. If you do not know cursors in SQL, check out my previous lesson on cursors: SQL: Learn to use Cursors – List table names

The only new elements you will notice here is that I am combining TABLE_SCHEMA+’.’+TABLE_NAME. I give an example below to show you how it works.

Note, this query takes a few seconds (or minutes depending on speed of machine and database size) to run. You will not see anything until the query is completed.

use AdventureWorks2012
go

declare @tableName nvarchar(255)

declare myCursor CURSOR FOR
select TABLE_SCHEMA+'.'+TABLE_NAME
from INFORMATION_SCHEMA.TABLES
where TABLE_TYPE = 'base table'
open myCursor
Fetch next from myCursor into @tableName
While @@FETCH_STATUS = 0
Begin
print 'Working on: '+@tableName
DBCC DBREINDEX(@TableName,' ',90)
Fetch next from myCursor into @tableName
end
close myCursor
Deallocate myCursor

TABLE_SCHEMA+’.’+TABLE_NAME

First result set, schema and table name are different columns. Second result set has them concatenated with a . in between.

SQL: Learn to use Cursors – List table names

On May 25, 2016May 25, 2016 By Ben Larson Ph.D.In MS SQL Server1 Comment

In this lesson we are going learn to use cursors to print a list of main tables in a database.

As always, I am using Adventure2012 as my database. If you do not have it, and you would like to play along, please refer to the following lessons to get up and running:

First let me show you where to go to find a list tables. SQL Server maintains a list of tables in a table called Information_Schema.Tables. Running the code below will give you results seen below.

Use AdventureWorks2012
go

SELECT * FROM INFORMATION_SCHEMA.TABLES
go

The last column, TABLE_TYPE lets you know what kind of table you are working with: BASE TABLE is the tables you generally refer to as Tables in your database. The other type you will see in this database is VIEW.

Cursor

In SQL, cursors are a way of letting your step through a result set one row at a time. While the mere mention of cursors can cause many DBA’s and database programmers to twitch and convulse, I still think you should learn about them to have a well rounded SQL education. Also, cursors have come in handy for me more times than I can count.

Look at the code below:

declare @tableName nvarchar(255)

select @tableName = TABLE_NAME
from INFORMATION_SCHEMA.TABLES

print @tableName

First I declare a variable “@tableName” and set it with a select statement. I assign @tableName the values in column TABLE_NAME from INFORMATION_SCHEMA.TABLES.

However when I run print @tableName, you will see the variable only holds the last value. Each previous value was replaced by the one after it.

I want to print each value in the column. So we going to use a cursor

— indicates a REM comment

declare @tableName varchar(255)
declare myCursor CURSOR -- name your cursor
FOR                     -- set cursor to a result set
select TABLE_NAME
from INFORMATION_SCHEMA.TABLES
where TABLE_TYPE = 'base table'
open myCursor                    -- get our cursor ready to use
Fetch next from myCursor into @tableName -- Fetch gets first  
                                         -- row in cursor results
While @@FETCH_STATUS = 0         --@@FETCH_STATUS is 0 while you still
                                 -- have rows in your cursor set. It changes
                                 -- to -1 when your cursor hits the final row
Begin                  -- Begin while loop
print @tableName       -- prints table name from row cursor is currently on
Fetch next from myCursor into @tableName  -- go to next row
end                   -- end While loop
close myCursor        -- close cursor for use
Deallocate myCursor   -- delete values from cursor

R: Intro to qplot()

On May 24, 2016May 24, 2016 By Ben Larson Ph.D.In R2 Comments

qplot() stands for Quick Plot. It is an easy to use plotting tool for R. In order to use qplot(), you need to have ggplot2 installed. To see if you have it, just type the following:

library(ggplot2)

If you get an error, you will need to install the package. To install package:

install.packages("ggplot2")

library(ggplot2)

Okay, now let’s get some data. R comes will a great collection of data sets you can work with. You can search Google for R data sets. A great source is also the following webpage: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html

I have decided for this exercise to use the ChickWeight data set listed below.

To access the dataset in R, simply use the name.

As you can see, the data set contains 4 columns, weight, Time, Chick, Diet

head(ChickWeight)

Using qplot(), we can get right down to plotting. The basic syntax is: qplot(data, x value)

qplot(data=ChickWeight, x = weight)

Notice qplot() automatically labeled our axis. It even created automatic bins for us.

Let’s add a y value

qplot(data=ChickWeight, x = weight, y = Time)

qplot3

Let’s add Diets to the mix now.

qplot(data=ChickWeight, x= weight, y = Time, color =Diet)

What jumps out at me right away is the fact the Diet 1 always seems to trail in weight, while Diet 4 seems to lead in weight gain in the earlier weeks, but it is overtaken by 2 and 3 starting around week 12.

Now let us add a size element to our graph

qplot(data=ChickWeight, x= Time, y = Chick,size=weight, color =Diet)

Now we have Chick’s on the y axis, Time on the X, the size of the dot correlates to weight and the color is diet.

The first thing I notice now is that our data is skewed. We have a lot more chicks on diet 1 than on the other 3 diets. That could effect how we read this data.

And finally, one more just for fun.

qplot(data=ChickWeight, x= weight, y = Chick, color =Diet)

Qlik Sense: A first Impression

On May 24, 2016May 24, 2016 By Ben Larson Ph.D.In Qlik, VizualizationLeave a comment

So I am trying out Qlik Sense as a possible BI option for work. I have some limited experience (through grad school) working with another one of their products Qlik View. I will say right from the get go, Qlik Sense is much more user friendly.

Opening it up, I get the following splash screen

Qlik then asks if I would like to Create a New App.

I click Create A New APP and I get a window to name it.

It says New app created, and asks if I want to open my app

Now I am brought to a page to add my data. I have two options, one a more GUI driven method, and the other that will allow me to create my data model using Qlik’s script (very similar to SQL).

I am going to go with the GUI method for now.

I choose Excel files

qlikSense5

Next I select a file QlikTEst.xlsx that I put together for this test drive.

My file appears in preview mode. Notice I am able to select both sheets to load and unselect columns I don’t want.

I click Prepare Data at the bottom and I am brought to this screen. Notice the red triangle warning on the bottom of the screen. It says I have possible associations.

I click the warning and I now have a recommendation for an association to make between sheets. My Equipmentkey column matches up(53%) on both sheets.

qlikSens9

Just double click the Recommendations box and your pages are now joined

qlikSens11

Now hit Load data

Qlik now brings you to a work sheet.

I want to create a bar chart, so I drag one in from the menu column.

Now I pick my Month as my dimension

And I choose WorkOrderNo as my measure

Now I choose an Aggregate of Count()

And I have a bar chart.

Now, let’s add a filter – drag it in from the menu column

qlikSens20

Select FacilityCode as my dimension

I can adjust the filter size by clicking on these points

Now hit done at the top of the screen

And I now have a live interactive bar chart.

My Thoughts on Qlik Sense

Qlik Sense is insanely easy to get up an running. You really don’t need to know much more than how to drag and drop. However, if you are like me and enjoy getting under the hood, Qlik does allow you custom code your data models and formulas.

My only complaint, and this could be because I am new and don’t know a work around, is having to hit Done to leave the edit screen before I can test the interactivity of my charts.

SQL: Working with Date/Time Functions

On May 24, 2016May 24, 2016 By Ben Larson Ph.D.In MS SQL Server, UncategorizedLeave a comment

SQL has a pretty extensive list of date time functions that come in handy when writing queries. I am going to cover a handful of the most common ones today.

GETDATE()

Getdate() returns the current datetime

select getdate()

sqlDate6

Date Parts

use AdventureWorks2012
go

SELECT *
 FROM [AdventureWorks2012].[Sales].[ShoppingCartItem]

First, looking at the data above. The data columns are in a format known as datetime in SQL (you may also hear many refer to it as a time stamp). It contains a date and time. What if you only want the date?

Convert()

SELECT convert(date, ModifiedDate)
 FROM [AdventureWorks2012].[Sales].[ShoppingCartItem]

To select the time only

SELECT convert(time, ModifiedDate)
 FROM [AdventureWorks2012].[Sales].[ShoppingCartItem]

YEAR(), MONTH(), DAY

SELECT year(ModifiedDate) as Year, month(modifieddate) as Month,
day(modifieddate) as day
from sales.SalesOrderDetail

sqlDate3

You can filter by date elements too.

SELECT year(ModifiedDate) as Year, month(modifieddate) as Month,
day(modifieddate) as day
from sales.SalesOrderDetail
where day(modifieddate) like '28'

Datepart()

You can extract time elements using Datepart()

SELECT datepart(hour, modifieddate) as hour, datepart(Minute, ModifiedDate) as minute
 FROM [AdventureWorks2012].[Sales].[ShoppingCartItem]

DateAdd()

dateadd() lets you add or subtract days or months or years to a date. In the example below I add 10 days. The syntax is dateadd(interval(d, m, y), amount, date)

select dateadd(d,10,getdate())

DateDiff()

Datediff() gives you the difference between two dates. Look at the dates in the query below:

We want to know how many days difference between DueDate and ModifiedDate

SELECT datediff(d,duedate, ModifiedDate)
 FROM [AdventureWorks2012].[Purchasing].[PurchaseOrderDetail]

	Anonymous on Python: Accessing a SQL databa…
	Anonymous on Top 7 skills a Data Analyst ha…
	lovingfox4e1d0e653e on Data Jobs: What does a Data An…
	Anonymous on Top 7 skills a Data Analyst ha…
	Anonymous on Python Web Scraping / Automati…

	Anonymous on Python: Accessing a SQL databa…
	Anonymous on Top 7 skills a Data Analyst ha…
	lovingfox4e1d0e653e on Data Jobs: What does a Data An…
	Anonymous on Top 7 skills a Data Analyst ha…
	Anonymous on Python Web Scraping / Automati…

	Anonymous on Python: Accessing a SQL databa…
	Anonymous on Top 7 skills a Data Analyst ha…
	lovingfox4e1d0e653e on Data Jobs: What does a Data An…
	Anonymous on Top 7 skills a Data Analyst ha…
	Anonymous on Python Web Scraping / Automati…

	Anonymous on Python: Accessing a SQL databa…
	Anonymous on Top 7 skills a Data Analyst ha…
	lovingfox4e1d0e653e on Data Jobs: What does a Data An…
	Anonymous on Top 7 skills a Data Analyst ha…
	Anonymous on Python Web Scraping / Automati…

	Anonymous on Python: Accessing a SQL databa…
	Anonymous on Top 7 skills a Data Analyst ha…
	lovingfox4e1d0e653e on Data Jobs: What does a Data An…
	Anonymous on Top 7 skills a Data Analyst ha…
	Anonymous on Python Web Scraping / Automati…