Simpson’s Paradox: How to Lie with Statistics

We’ve all heard the saying from Benjamin Disreali, “Lies, damn lies, and statistics.”

While statistics has proven to be of great benefit to mankind in almost every endeavor, inexperienced, sloppy, and downright unscrupulous statisticians have made some pretty wild claims. And because these wild claims are often presented as statistical fact, people in all industries – from business, to healthcare, to education -have chased these white elephants right down the rabbit hole.

Anyone who has taken even an introductory statistics course can tell you how easily statistics can be misrepresented. One of my favorite examples involves using bar charts to confuse the audience. Look at the chart below. It represents the number of games won by two teams in a season of beer league softball.

2017-01-10_10-28-21

At first glance, you might think Team B won twice as many games as Team A, and that is indeed the intention of the person who made this chart. But when you look at the numbers to the left, you will see Team A won 15 games to Team B’s 20. While I am no mathematician, even I know 15 is not half of 20.

This deception was perpetrated by simply adjusting the starting point of the Y – Axis. When you reset it to 0, the chart tells a different story.

2017-01-10_10-31-54.jpg

Even Honest People Can Lie by Accident

In the example above, the person creating the chart was manipulating the data on purpose to achieve a desired effect. You may look at this and say I would never deceive people like that, the truth is – you just might do it by accident.

What do I mean? Let’s take an example from an industry fraught with horrible statistics – our education system.

Below you will find a chart depicting the average math scores on a standardized test since 2000 for Happy Town, USA. You will notice the test scores are significantly lower now than they were back in 2000.

2017-01-10_10-40-48

What does this mean? Are the kids getting stupider? Has teacher quality gone down? Who should be held accountable for this? Certainly those lazy tenured teachers who are only there to collect their pensions and leach off the tax payers.

I mean look at the test scores. The average score has dipped from around 90 to close to 70. Surely something in the system is failing.

Now what if I were to tell you that the chart above – while correct – does not tell the whole story. Test scores in Happy Town, USA are actually up – if you look at the data correctly.

What we are dealing with is something known in statistics as Simpson Paradox, and even some of the brightest academic minds have published research that ignored this very important concept.

What do I mean?

Let me tell you the whole story about Happy Town, USA. Happy Town was your average American middle class town. The economic make-up of this town in 2000 was 20% of the families made over $150K, 60% made between $150K and $50K, with 20% earning less than $50K a year.

In 2008, that all changed. The recession hit causing people to lose their jobs and default on their mortgages. Families moved out, housing prices fell. Due to the new lower housing prices, families from Non-So Happy Town, USA were able to afford houses in Happy Town. They moved their families there in hopes of a better education and better life for their children.

While the schools in Happy Town were better, the teachers were not miracle workers. These kids from Not So Happy Town did not have the strong educational foundation the pre-recession residents of Happy Town did. Many teachers found themselves starting almost from scratch.

No matter how hard these new kids and their teachers tried, they could never be expected to jump right in and perform as well as the pre-2008 Happy Town kids. The economic makeup of the town shifted. The under $50K’s now represent 60% of the town’s population, with 150K-50K making up only 30% and the top earners dwindling down to 10%.

So while taking an average of all the students is not a sign of someone necessarily trying to pull the wool over your eyes, it does not tell the whole story.

To see the whole story, and to unravel Simpson Paradox, you need to look at the scores across the different economic sectors of this town which has undergone drastic changes.

2017-01-10_11-26-34

Looking at from the standpoint of economic sector, you will see the scores in each sector have improved. With the under $50K improving at an impressive rate. Clearly the teachers and staff at Happy Town School are doing their job and then some.

So while the person who took the average of the whole school may not have intended to lie with their statistics, a deeper dive into the numbers showed that the truth was hidden inside the aggregate.

Keep this in mind next time someone shows you falling SAT scores, crime stats, or disease rates. All of these elements are easily affected by a shift in demographics. If you don’t see the breakdown, don’t believe the hype.

Factor Analysis: Picking the Right Variables

Factor Analysis, what is it?

In layman’s terms, it means choosing which factors (variables) in a data set you should use for your model. Consider the following data set:

2017-01-03_19-43-48.jpg

In the above example, the columns (highlighted in light orange) would be our Factors. It can be very tempting, especially for new data science students, to want to include as many factors as possible. In fact, as you add more factors to a model, you will see many classic statistical markers for model goodness increase. This can give you a false sense of trust in the model.

The problem is, with too many poorly chosen factors, you model is almost guaranteed to under perform. To avoid this issue, try approaching a new model with the idea of minimizing factors, only using the factors that drive the greatest impact.

It may seem overwhelming at first. I mean where do you start? Looking at the list above, what do you get rid of? Well, for those who really love a little self torture, there are entire statistics textbooks dedicated to factor analysis. For the rest of us, consider some of the following concepts. While not an exhaustive list, these should get you started in the right direction.

Collinearity

In terms of regression analysis, collinearity concerns itself with factors that have strong correlations with each other. In my example above, think Height and Weight. In general, as Height increases so does Weight. You would expect a 6’4 senior to easily outweigh a 4’11 freshman. So as one factor (Height) increases or decreases (Weight) follows in kind. Correlations can also be negative with one factor decreasing as another  factors increases or visa versa.

2017-01-03_19-57-58.jpg

The problem with these factors is that when used in a model, they tend to amplify their effect. So the model is skewed placing too much weight on what is essentially a single factor.

So what do you do about it?

Simply enough, in cases like this. You pick one. Height or Weight will do. In more complex models you can use mathematical techniques like Singular Value Decomposition (SVD), but I won’t cover that in this lesson.

I am also not going to cover any of the methods for detecting collinearity in this lesson, I will be covering those in further lessons. But it should be noted that a lot of times domain knowledge is really all you need to spot it. It doesn’t take a doctor to realize that taller people are generally heavier.

But wait…

I know what you are thinking, what about the 250 lb 5’1 kid or the 120 lb 6’2 kid? Well if you have enough of these outliers in your data and you feel that being over or under weight is an important variable to consider, I would recommend using a proxy. In this case, you could substitute BMI (body mass index – a calculation based on height and weight) to replace both height and weight as factors.

Stepwise

Stepwise regression is a method for determining which factors provide value to the model. The way it works (in the most basic definition I can offer) is you run your regression model with all your factors, removing the weakest factor each time (based on statistical evaluation methods like R^2 values and P values). This is done repeatedly until only high value factors are left in the model.

 

NEXT….Not “technically Factor Analysis” but can be useful in removing bad factors from your model.

 

Binning or Categorizing Data

Let’s say, looking at the data example above, our data covered all grades from 1-12. What if you want to look a kids in a two year period. You would want to bin the data into equal groups of 2: 1-2,3-4,5-6,7-8,9-10,11-12. You can now analyze the data in these blocks.

What if you wanted to measure the effectiveness of certain schools in the system. You might be wise to categorize the data. What that means is, we will take grades 1-6 and place them in one category (elementary), 7-8 in another(middle school), 9-12(high school).

Incomplete Data

Imagine a factor called household income. This is a field that very likely may not be readily answered by parents. If there are only a few missing fields, some algorithms won’t be too affected, but if there are a lot, say 5%, you need to do something about it.

What are you options?

You could perform a simple mean or median replacement for all missing values, or try to calculate a best guess based on other factors. You could delete records missing this value. Or, as I often do, just toss this factor away. Most likely any value this adds to your model is going to be questionable at best. Don’t fall for the Big Data more is always better trap. Sometimes simplicity wins out in the end.

Outliers and Erroneous Data

Outliers can really skew you model, but even worse, erroneous data can make you model absolutely worthless. Look out for outliers, question strange looking data. Unless you can come up with a real good reason why these should stay in your model, I say chuck the records containing them.

 

 

 

 

Excel: Data Transformations

Data Transformations

In statistics, data transformations are used, for lack of better words, to even out the numbers. While there are lots of reasons to perform transformations in the world of Statistics, in Data Science and Analytics, our main concern is to bring distributions to the normal form (the classic bell curve).

normal-curve

As with many concepts, sometimes it is easy to see by doing. So let’s jump into Excel, you can download the file to play along here: transform1

First things first – Install Analysis Toolkit

In order to created histograms in Excel (which we will need for this exercise), you need to install the Analysis Toolkit.

Click on the green “File” in the upper left corner of Excel, then click options.

transform

Click Add-Ins > select Analysis TookPak

transform1.jpg

Hit OK and now lets look at our data.

The Data

We have two columns –

UNITSF =10,000 records of house square footage

Bins = list of Bin sizes we will use for the histogram

transform3

First we will create a histogram to look at the data.

To do so, go to the Data tab in the Ribbon bar. Then on the far right select Data Analysis

tranform4.png

Select Histogram

transform5.png

For Input Range – select number values from the UNITSF column.

Bin Range – select number values from Bin column

transform6

Now you get a Frequency Distribution. Let’s visualize it to see better.

transform7.png

In the Ribbon Bar, select Insert>Recommended Charts

transform8

Select the Frequency bar chart

transform9.png

Look at this Frequency distribution. Notice how it bunched to the left. In Stats – they call this Right-Skewed since the long tail trails out to the right.

transform10.png

The problem is, many popular data mining (machine-learning) algorithms can be easily thrown off by data like this. The algorithms (like regressions, decision trees, boosted forests, naive bayes, etc.) are designed to work better with normal distributions.

So to normalize this distribution, we have to transform the data. What that means in practice is to apply a mathematical operation to all the data in the column. Operations like: logs, square roots, cube roots, squaring, cubing, etc.

In our example, the natural log does a very good job of transforming our data.

To do so, lets start by applying the natural log to the first element in our data.

=ln(A2)

transform11

Now hover over the little green box in the bottom corner until your cursor turns into a black plus sign. Double click and the natural log will be applied to all elements.

transform12

Build a new bin column ranging from 3 – 11 by .5

transform13.png

Now lets build our frequency table. Data> Data Analysis > Histogram

Make sure you select the B column for Input and E column for Bin Range

transform14.png

Now go to your new distribution and graph it like we did before.

Notice how the new distribution looks a lot more like the class bell curve.

tranform15.png

 

 

 

Python: Central Limit Theorem

The Central Limit Theorem is one of core principles of probability and statistics. So much so, that a good portion of inferential statistical testing is built around it. What the Central Limit Theorem states is that, given a data set – let’s say of 100 elements (See below) if I were to take a random sampling of 10 data points from this sample and take the average (arithmetic mean) of this sample and plot the result on a histogram, given enough samples my histogram would approach what is known as a normal bell curve.

In plain English

  • Take a random sample from your data
  • Take the average of your sample
  • Plot your sample on a histogram
  • Repeat 1000 times
  • You will have what looks like a normal distribution bell curve when you are done.

hist4

For those who don’t know what a normal distribution bell  curve looks like, here is an example. I created it using numpy’s normal method

hist5.jpg

If you don’t believe me, or want to see a more graphical demonstration – here is a link to a simulation that helps a lot of people to grasp this concept: link

Okay, I have bell curve, who cares?

The normal distribution of (Gaussian Distribution – named after the mathematician Carl Gauss) is an amazing statistical tool. This is the powerhouse behind inferential statistics.

The Central Limit Theorem tells me (under certain circumstances), no matter what my population distribution looks like, if I take enough means of sample sets, my sample distribution will approach a normal bell curve.

Once I have a normal bell curve, I now know something very powerful.

Known as the 68,95,99 rule, I know that 68% of my sample is going to be within one standard deviation of the mean. 95% will be within 2 standard deviations and 99.7% within 3.

hist.jpg

So let’s apply this to something tangible. Let’s say I took random sampling of heights for adult men in the United States. I may get something like this (warning, this data is completely made up – do not even cite this graph as anything but bad art work)

hist6.jpg

But reading this graph, I can see that 68% of men are between 65 and 70 inches tall. While less than 0.15% of men are shorter than 55 inches or taller than 80 inches.

Now, there are plenty of resources online if you want to dig deeper into the math. However, if you just want to take my word for it and move forward, this is what you need to take away from this lesson:

p value

As we move into statistical testing like Linear Regression, you will see that we are focus on a p value. And generally, we want to keep that p value under 0.5. The purple box below shows a p value of 0.5 – with 0.25 on either side of the curve. A finding with a p value that low basically states that there is only a 0.5% chance that the results of whatever test you are running are a result of random chance. In other words, your results are 99% repeatable and your test demonstrates statistical significance.

hist7.jpg

Statistics: Range, Variance, and Standard Deviation

Measuring the spread can be a useful tool when analyzing a set of numbers. Three common measures of spread of range, variance, and standard deviation.

Here is the data set we will be working with: [2,4,6,7,8,10,15,18,22,26]

Range

Range is the simplest of the three measures. To find the range, all you need to do is subtract the smallest number in the set from the largest number

range = large-small

range = 26-2 = 24

Variance

Variance is created by taking the average of the squared difference between each value in the set minus the mean. We square the differences so that values above and below the mean do not cancel each other out.

Let’s find the mean:

statSpread

If you haven’t seen the x with the line over it before, this is referred to as x bar and it is used to represent the mean.

To find the variance you take the first number in your set, subtract the mean and square the result. You repeat that for each number in your list. Finally you add up all the results and divide by n (the number of items in your list)

ex – ((2-12)^2 + (4-12)^2 +….+(26-12)^2) / 10

statSpread1

variance = 58.56

Now I know what you are thinking, how can the average distance from the mean be 58.56 when the furthest point from the mean (26) is only 14? This is because we are squaring the differences. To get a number more in line with the data set, we have another measure called the standard deviation.

Standard Deviation

The standard deviation returns a value more in-line with what you would expect based on your data. To find the standard deviation – simply take the square root of the variance.

std dev = 7.65

Population vs Sample

The equations above work great if you have the entire population. What I mean by that is, if your data contains all the  data in the set. Using our data, imagine if the numbers were ages of children in a large family. If there are 10 kids in the family, then I have all the ages, so I am dealing with the population.

However, if we instead have sampled 10 random ages from all the kids in a large extended family where the total number of kids is 90. In this case, since we are looking at 10 out of 90, we are not dealing with the population, but the sample.

When working with the sample, you need to make an adjustment to your variance and standard deviation equations. The change is simple. Instead of dividing by n you will now divide by n-1.  This offset makes up for the fact you do not have all the data.

statSpread2