# Python: An Interesting Problem with Pandas

I was writing a little tongue and cheek article for LinkedIn on fraud detection using frequency distributions (you can read the article here: LinkedIn). While this was a non-technical article, I wanted to use some histograms from a real data set, so I uploaded a spread sheet into Python and went to work.

While working with the data I ran into an interesting problem that had me chasing my tail for about 10 minutes before I figured it out. It is a fun little problem involving Series and Dataframes.

As always, you can upload the data set here: FraudCheck1

```import pandas as pd
("C:\\Users\\Benjamin\\OneDrive\\Documents\\article\\python\\FraudCheck1.xlsx") The data is pretty simple here. We are concerned with our answer column and the CreatedBy (which is the employee ID).  What I am trying to do is see if the “answer”  (a reading from an electric meter) are really random or if they have been contrived by someone trying to fake the data.

First, I want to get the readings for all the employees, so I used pop() to place the answer column into a separate list.

```df1 = df Then, to make my histogram more pleasant looking, I decided to only use the last digit before the decimal. That way I will have 10 bars (0-9). (Remember, this is solely for making charts for an article. So I was not concerned with any more stringent methods of normalization)

What I am doing below is int(199.7%10). Remember % is the modulus – leaves you with the remainder and int converts your float to an integers. So 199.7 is cut to 199. The 199/10 remainder = 9.

```a= []
i = 0
while i < len(y):
a.append(int(y[i]%10))
i += 1
a[1:10]``` Then I created my histogram.

```%matplotlib inline
from matplotlib import pyplot as plt
plt.hist(a)``` ## Now my problem

Now I want graph only the answers from employee 619, so first I filter out all rows but the ones for employee 619.

```df2 = df.query('CreatedBy == 619')

Then I ran my loop to turn my answers into a single digit.

And I get an error.  Why? Well the answer lies in the datatypes we are working with. Pandas read_excel function creates a Dataframe.

When you pop a column from a dataframe, you end up with a Series. And remember a series is an indexed listing of values.

Let’s look at our Series. Check out the point my line is pointing to below. Notice how my index jumps from 31 to 62. My while loop counts by 1, so after 31, I went looking for y1 and it doesn’t exist. Using .tolist() converts our Series to a list and now our while loop works. And now we can build another histogram. ## The Code

```import pandas as pd
("C:\\Users\\Benjamin\\OneDrive\\Documents\\article\\python\\FraudCheck1.xlsx")

df1 = df

a= []
i = 0
while i < len(y):
a.append(int(y[i]%10))
i += 1
a[1:10]

%matplotlib inline
from matplotlib import pyplot as plta1 = []
i = 0
while i < len(y2):
a1.append(int(y2[i])%10)
i = i+1
a1[1:10]

plt.hist(a)

df2 = df.query('CreatedBy == 619')

y2= y1.tolist()
type(y2)

a1 = []
i = 0
while i < len(y2):
a1.append(int(y2[i])%10)
i = i+1
a1[1:10]

plt.hist(a1)

```

# R: Intro to qplot()

qplot() stands for Quick Plot. It is an easy to use plotting tool for R. In order to use qplot(), you need to have ggplot2 installed. To see if you have it, just type the following:

`library(ggplot2)`

If you get an error, you will need to install the package. To install package:

```install.packages("ggplot2")

library(ggplot2)```

Okay, now let’s get some data. R comes will a great collection of data sets you can work with. You can search Google for R data sets. A great source is also the following webpage: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html

I have decided for this exercise to use the ChickWeight data set listed below. To access the dataset in R, simply use the name.

As you can see, the data set contains 4 columns, weight, Time, Chick, Diet

`head(ChickWeight)` Using qplot(), we can get right down to plotting. The basic syntax is: qplot(data, x value)

`qplot(data=ChickWeight, x = weight)`

Notice qplot() automatically labeled our axis. It even created automatic bins for us. `qplot(data=ChickWeight, x = weight, y = Time)` Let’s add Diets to the mix now.

`qplot(data=ChickWeight, x= weight, y = Time, color =Diet)`

What jumps out at me right away is the fact the Diet 1 always seems to trail in weight, while Diet 4 seems to lead in weight gain in the earlier weeks, but it is overtaken by 2 and 3 starting around week 12. Now let us add a size element to our graph

`qplot(data=ChickWeight, x= Time, y = Chick,size=weight, color =Diet)`

Now we have Chick’s on the y axis, Time on the X, the size of the dot correlates to weight and the color is diet.

The first thing I notice now is that our data is skewed. We have a lot more chicks on diet 1 than on the other 3 diets. That could effect how we read this data. And finally, one more just for fun.

`qplot(data=ChickWeight, x= weight, y = Chick, color =Diet)` # Qlik Sense: A first Impression

So I am trying out Qlik Sense as a possible BI option for work. I have some limited experience (through grad school) working with another one of their products Qlik View. I will say right from the get go, Qlik Sense is much more user friendly.

Opening it up, I get the following splash screen Qlik then asks if I would like to Create a New App. I click Create A New APP and I get a window to name it. It says New app created, and asks if I want to open my app Now I am brought to a page to add my data. I have two options, one a more GUI driven method, and the other that will allow me to create my data model using Qlik’s script (very similar to SQL).

I am going to go with the GUI method for now. I choose Excel files Next I select a file QlikTEst.xlsx that I put together for this test drive. My file appears in preview mode. Notice I am able to select both sheets to load and unselect columns I don’t want. I click Prepare Data at the bottom and I am brought to this screen. Notice the red triangle warning on the bottom of the screen. It says I have possible associations. I click the warning and I now have a recommendation for an association to make between sheets. My Equipmentkey column matches up(53%) on both sheets. Just double click the Recommendations box and your pages are now joined  Qlik now brings you to a work sheet. I want to create a bar chart, so I drag one in from the menu column. Now I pick my Month as my dimension And I choose WorkOrderNo as my measure Now I choose an Aggregate of Count() And I have a bar chart. Now, let’s add a filter – drag it in from the menu column Select FacilityCode as my dimension I can adjust the filter size by clicking on these points Now hit done at the top of the screen And I now have a live interactive bar chart. ### My Thoughts on Qlik Sense

Qlik Sense is insanely easy to get up an running. You really don’t need to know much more than how to drag and drop. However, if you are like me and enjoy getting under the hood, Qlik does allow you custom code your data models and formulas.

My only complaint, and this could be because I am new and don’t know a work around, is having to hit Done to leave the edit screen before I can test the interactivity of my charts.

# R: Graphing with matplot()

matplot() is a R function that you can use to make easy graphs in R.

Let’s start by creating a data set.

We are going to create a 5 x 10 matrix representing 10 quiz and test grades for 5 students

```x <- sample(50:100,50, T)
x
# convert to a 5x10 matrix
A <- matrix(x,5,10)
A``` Let’s make name our rows and columns

```#name and assignments vectors
names <- c("Sara", "Jill", "Jared", "Kim", "Don")
assignments <- c("Quiz 1", "Quiz 2", "Test 1", "Quiz 3", "Quiz 4", "Midterm", "Quiz 5", "Quiz 6", "Test 2","Final")
#assign labels to columns and rows
row.names(A) <- names
colnames(A) <- assignments
A``` ### Let’s graph it

Start simple

`matplot(A)` Note our X axis goes to 5 – we are graphing students against grades.

What if I want to graph assignments against grades? Simple, transpose the matrix

`matplot(t(A))` Okay, but the numbers are distracting. Let’s do something about that.

`matplot(t(A), type="b")` I don’t like the numbers

```# replace numbers with shapes
matplot(t(A), type="b", pch=15:19)```

pch 15:19 lets us rotate through 4 shapes (15 – 19) – try different numbers on your own I am adding a color element next. You will see the need for this in our next section.

` matplot(t(A), type="b", pch=15:18, col=c(1:5))`

This has no current effect on our graph

## Legend

```legend("bottomleft", inset=0.01, legend=names, col=c(1:5),pch=15:19,
bg= ("white"), horiz=F)```

okay, here is the syntax: legend(location, padding, legend data, colors, symbols, background, horizontal vs vertical layout)

This is why I added the color element to our matplot, so the colors in the legend would match up to the colors in my graph. # Excel: Scatter Plots

Our data set contains 3 sets of readings for 4 sensors (A,B,C,D) ## Scatter Plot

Highlight Columns A and B – From Ribbon > Insert>Scatter Here is a close up of the Scatter Plot icon Here is our plotting of Reading 1 To add the Reading 2 column to the plot, right click on the chart area and Select Data  • Select the Heading from Column C for Series Name
• A2:A5 as Series X values
• C2:C5 as Series Y values  Now, double click on the Y-Axis and go to the Format Axis box that will appear on your right.

Select the Bar Chart Icon, and change the Axis Bounds to minimum of 40 and max of 100. This will help make the spacing between dots more pronounced. Generally altering a Y Axis away from 0 is considered bad taste as it tends to over-pronounce differences between elements. Now click on X-Axis and make the changes below. Next, go to Add Chart Element in the upper right corner. Legend>Right Here is our scatter plot Now wait a minute. I know what you are thinking, why is my X Axis 1 – 4 and not A,B,C,D

This is a flaw in Excel. There are some third party packages you can install that will allow you to rename the X Axis in a XY Scatter plot, but not with Excel in its default state.

You can however change the chart type to a Line Chart.

Go to Ribbon > Change Chart Type

Select Line Now your X-Axis is properly labeled # Python: Line Graph

Let’s build on what we learned in : Python: Intro to Graphs

First, import pyplot from matplotlib

Remember %matplotlib inline let’s you see your graphs in the jupyter notebooks ## Line 1

Note I am using a list comprehension to fill the x axis. Now, let us do some formatting:

• ‘r:’    –   red dotted line
• label = ‘up’   –  adds label to the line
• py.legend(loc=9)  – adds a legend to the chart ## Line 2

For line 2, I want to create a reverse curve:

• d = u  – list u is copied to d
• d.reverse() – reverses values in d Plot the line.

• ‘g-.’ – green dashed line Combine the two lines ## Line 3

Here I use another list comprehension  and zip to make our last like. This adds each element of list u and d in order.

• ‘b-‘ – blue solid line If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT.

# Python: Intro to Graphs

Visualizations are big part of analytics. You will need to produce visually engaging graphics for presentations, reports, and dashboards. You will also make graphs for your own use in data discovery and analysis. As bonus, unlike data cleaning, data viz can be pretty fun.

## Matplotlab and Pyplot

Matplotlab is a module you can import into Python that will help you to build some basic graphs and charts. Pyplot is part of Matplotlab and the part we will be using in the following example.

**If you are using the Anaconda Python distribution, Matplotlab is already installed. If not, you may need to download it from another source.

## Line Graph

Syntax

• %matplotlib inline – this code allows you to view your graphs inside jupyter notebooks
• from matplotlib import pyplot as plt – here we import pyplot from matplotlib into our program (note, we only want pyplot not all the functions in matplotlib).Adding “as plt” gives us a shorter alias to work with
• age and height lines – fill our lists with age and height information for an individual
• plt.plot(age, height, color = ‘blue’) – here we tell Python to plot age against height and to color our line blue
• plt.show() – prints out our graph ## Bar Chart

For this example, we will make a bar charting showing ages of 4 people.

Syntax

• You should understand the first few lines from the first example
• count_names = [i for i,_ in enumerate(name)]  – since the name list is a list of strings, we cannot really graph that onto a chart. We need a way to convert these strings into numbers.

Wait? What does for i,_ mean? Let’s jump to the next code sample While you don’t see it when making the list, a Python list is technically a list of tuples (index number, element). So if instead of i,_ we asked for both elements in the tuple, (i,j) we would get the following. So by iterating by for i,_ we only return the first element in the tuple (the index)

** notice we are using a list comprehension. If you are unfamiliar with list comprehensions, check out my earlier post: Python: List Comprehension

Let’s clean up our bar chart a little now.

• plt.ylabel(‘Age’) – label the y-axis Age
• plt.title(‘Age of People’) – give the graph a title
• plt.xticks([i+0.5 for i,_ in enumerate(name)], name) – this label function is using a list comprehension to first chose the position on the X-axis, and name provides the person’s name for the label. If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT # Intro to Tableau: Line Chart: 3 or More Measures

## Note:

If you do not currently have Tableau, you can download a free version at: https://public.tableau.com/s/

# Line Chart: 3 or More Measures

This lesson is a continuation of an earlier lesson. If you are already familiar with Tableau, feel free to continue on. Otherwise, check out my first Tableau lesson: Line and Bar Charts

If you want to add 3 or more measures to a line chart, you need to take a different approach than in regular charts.

## Import the Data

If you are continuing on from the Line and Bar Charts lesson, you can skip this step, your data is already loaded. ## Create a New Worksheet

Click the New Worksheet icon found on the bottom of your screen. Drag Year to Columns and Measure Values to Rows 1. Get rid of Sum(Number of Records) by dragging it back into Measures
2. While holding down Ctrl drag Measure Names from the Dimensions slot to Color And there you have it. 3 Measures in one chart.  # Intro to Tableau: Dual Axis Charts

## Note:

If you do not currently have Tableau, you can download a free version at: https://public.tableau.com/s/

# Dual Axis Charts

This lesson is a continuation of an earlier lesson. If you are already familiar with Tableau, feel free to continue on. Otherwise, check out my first Tableau lesson: Line and Bar Charts

## Import the Data

If you are continuing on from the Line and Bar Charts lesson, you can skip this step, your data is already loaded. ## Create a New Worksheet

Click the New Worksheet icon found on the bottom of your screen. Drag Year from Dimensions and Free from Measures into Columns and Rows respectively. You should now have a line chart. (if not, refer to Lesson 1 for troubleshooting tips) Now, drag Full Price into Rows. You should now notice you have two graphs. Free up top and Full Price on the bottom. Now you could just stop there. You do have both Measures graphed. But this really isn’t the best way analyze this data. It is hard to do a good comparison this way.

## Dual Axis

For better analysis, we are going to create a Dual Axis Chart.

Right click on the Y Axis of the bottom chart and select Dual axis Now you have both measures on one graph. If you look closely at the Left and Right Y-Axis’s, you will notice they are not the same. This could skew how someone would interpret this data.

To fix this, right click on the Right Y-Axis and select Synchronize axis Finally, since both of your Y-Axis match up, you don’t need them both. Right click on the Right Axis again and uncheck Show header.

Previous > Lesson 1

Next > 3 or More Measures # Intro to Tableau: Line and Bar Charts

## Note:

If you do not currently have Tableau, you can download a free version at: https://public.tableau.com/s/

# Lines and Bar Charts

## Start Tableau

When you start up Tableau, the first thing you need to do is select a data source. In this case, select Excel and choose the file the you downloaded above (schoolLunch.xls)

Once loaded, the Data Source Page will open up.

a. Data Source File

b. Shows Sheets in the file (there is only one sheet in this particular file)

c. Shows data. ## The Data

In this example, we are looking at the number of kids receiving Free, Reduced Priced, and Full Price lunches at American public schools from 1971 to 2015.

## Line Chart

Start by making a new sheet ### A Quick Note About Dimensions and Measures:

Notice on the new sheet that the columns from your imported Excel sheet have been placed into two boxes on the left of the screen: Dimension and Measures. Think of Dimensions as Factors or Labels. While Measures are columns you would perform calculations against (adding, averaging, etc).

Drag Year from Dimensions and Free from Measures into Columns and Rows respectively. The line graph should appear automatically. If not, follow the next steps:

First, make sure your Row variable says SUM(Free)this means we are summing up all numbers in the Free column— If it doesn’t, hover over the measure until a small downward arrow appears. Then go to Measure and select Sum. If you don’t have a line chart, go to Marks and select Line from the drop down menu ## Bar Chart

Now, go to Marks again and select Bar. Your chart will change over to a bar chart. Try a few of the other options like Area and Shape. Next Lesson: Dual Axis Charts