Factor Analysis, what is it?
In layman’s terms, it means choosing which factors (variables) in a data set you should use for your model. Consider the following data set:
In the above example, the columns (highlighted in light orange) would be our Factors. It can be very tempting, especially for new data science students, to want to include as many factors as possible. In fact, as you add more factors to a model, you will see many classic statistical markers for model goodness increase. This can give you a false sense of trust in the model.
The problem is, with too many poorly chosen factors, you model is almost guaranteed to under perform. To avoid this issue, try approaching a new model with the idea of minimizing factors, only using the factors that drive the greatest impact.
It may seem overwhelming at first. I mean where do you start? Looking at the list above, what do you get rid of? Well, for those who really love a little self torture, there are entire statistics textbooks dedicated to factor analysis. For the rest of us, consider some of the following concepts. While not an exhaustive list, these should get you started in the right direction.
In terms of regression analysis, collinearity concerns itself with factors that have strong correlations with each other. In my example above, think Height and Weight. In general, as Height increases so does Weight. You would expect a 6’4 senior to easily outweigh a 4’11 freshman. So as one factor (Height) increases or decreases (Weight) follows in kind. Correlations can also be negative with one factor decreasing as another factors increases or visa versa.
The problem with these factors is that when used in a model, they tend to amplify their effect. So the model is skewed placing too much weight on what is essentially a single factor.
So what do you do about it?
Simply enough, in cases like this. You pick one. Height or Weight will do. In more complex models you can use mathematical techniques like Singular Value Decomposition (SVD), but I won’t cover that in this lesson.
I am also not going to cover any of the methods for detecting collinearity in this lesson, I will be covering those in further lessons. But it should be noted that a lot of times domain knowledge is really all you need to spot it. It doesn’t take a doctor to realize that taller people are generally heavier.
I know what you are thinking, what about the 250 lb 5’1 kid or the 120 lb 6’2 kid? Well if you have enough of these outliers in your data and you feel that being over or under weight is an important variable to consider, I would recommend using a proxy. In this case, you could substitute BMI (body mass index – a calculation based on height and weight) to replace both height and weight as factors.
Stepwise regression is a method for determining which factors provide value to the model. The way it works (in the most basic definition I can offer) is you run your regression model with all your factors, removing the weakest factor each time (based on statistical evaluation methods like R^2 values and P values). This is done repeatedly until only high value factors are left in the model.
NEXT….Not “technically Factor Analysis” but can be useful in removing bad factors from your model.
Binning or Categorizing Data
Let’s say, looking at the data example above, our data covered all grades from 1-12. What if you want to look a kids in a two year period. You would want to bin the data into equal groups of 2: 1-2,3-4,5-6,7-8,9-10,11-12. You can now analyze the data in these blocks.
What if you wanted to measure the effectiveness of certain schools in the system. You might be wise to categorize the data. What that means is, we will take grades 1-6 and place them in one category (elementary), 7-8 in another(middle school), 9-12(high school).
Imagine a factor called household income. This is a field that very likely may not be readily answered by parents. If there are only a few missing fields, some algorithms won’t be too affected, but if there are a lot, say 5%, you need to do something about it.
What are you options?
You could perform a simple mean or median replacement for all missing values, or try to calculate a best guess based on other factors. You could delete records missing this value. Or, as I often do, just toss this factor away. Most likely any value this adds to your model is going to be questionable at best. Don’t fall for the Big Data more is always better trap. Sometimes simplicity wins out in the end.
Outliers and Erroneous Data
Outliers can really skew you model, but even worse, erroneous data can make you model absolutely worthless. Look out for outliers, question strange looking data. Unless you can come up with a real good reason why these should stay in your model, I say chuck the records containing them.