Excel: Data Transformations

Data Transformations

In statistics, data transformations are used, for lack of better words, to even out the numbers. While there are lots of reasons to perform transformations in the world of Statistics, in Data Science and Analytics, our main concern is to bring distributions to the normal form (the classic bell curve).

normal-curve

As with many concepts, sometimes it is easy to see by doing. So let’s jump into Excel, you can download the file to play along here: transform1

First things first – Install Analysis Toolkit

In order to created histograms in Excel (which we will need for this exercise), you need to install the Analysis Toolkit.

Click on the green “File” in the upper left corner of Excel, then click options.

transform

Click Add-Ins > select Analysis TookPak

transform1.jpg

Hit OK and now lets look at our data.

The Data

We have two columns –

UNITSF =10,000 records of house square footage

Bins = list of Bin sizes we will use for the histogram

transform3

First we will create a histogram to look at the data.

To do so, go to the Data tab in the Ribbon bar. Then on the far right select Data Analysis

tranform4.png

Select Histogram

transform5.png

For Input Range – select number values from the UNITSF column.

Bin Range – select number values from Bin column

transform6

Now you get a Frequency Distribution. Let’s visualize it to see better.

transform7.png

In the Ribbon Bar, select Insert>Recommended Charts

transform8

Select the Frequency bar chart

transform9.png

Look at this Frequency distribution. Notice how it bunched to the left. In Stats – they call this Right-Skewed since the long tail trails out to the right.

transform10.png

The problem is, many popular data mining (machine-learning) algorithms can be easily thrown off by data like this. The algorithms (like regressions, decision trees, boosted forests, naive bayes, etc.) are designed to work better with normal distributions.

So to normalize this distribution, we have to transform the data. What that means in practice is to apply a mathematical operation to all the data in the column. Operations like: logs, square roots, cube roots, squaring, cubing, etc.

In our example, the natural log does a very good job of transforming our data.

To do so, lets start by applying the natural log to the first element in our data.

=ln(A2)

transform11

Now hover over the little green box in the bottom corner until your cursor turns into a black plus sign. Double click and the natural log will be applied to all elements.

transform12

Build a new bin column ranging from 3 – 11 by .5

transform13.png

Now lets build our frequency table. Data> Data Analysis > Histogram

Make sure you select the B column for Input and E column for Bin Range

transform14.png

Now go to your new distribution and graph it like we did before.

Notice how the new distribution looks a lot more like the class bell curve.

tranform15.png

 

 

 

Data Modeling

Being a data professional requires a variety of skills. While you don’t have too look far to find heated debates on what skills are most valuable: R vs Python, SQL vs Pig Script. But to me, before you can make use of any of these skills, you need to have a grasp on data modeling.

What is Data Modeling?

Simply put…Data modeling is finding a way to represent objects or concepts through the use of data elements. Up until recently, these elements were limited to a few data types: text, numeric, boolean, date and time. Newer elements, such as spatial, temporal, video and image files have found their way into data bases, but the job of the data modeler remains the same.

Now there are hundreds of books out there on data modeling, and having read a few, I can attest that the field of data modeling is vast. Luckily for us, we only have to dip our toes in.

Why is Data Modeling Important?

Don’t let training data sets fool you. In the real world, the majority of your time will be spent extracting data from multiple sources, merging it into 1 or more sets, and cleaning and transforming the data for analysis. If you do not understand the underlying principles of how data is stored, you are going to have a hard time even finding it, much less doing anything with it.

While learning about data modeling through more formal methods, you will be exposed to terms like: Entity, Relationship, E-R Diagram, Attributes, Tuples, etc. I am going to try to keep the language here simpler. I will use terms I believe everyone already understands. Terms like: Table, Row, Column. I will leave some side notes here and there for those who want to geek out, but I really would like you to keep your focus on more on the concepts and less on the picky details.

Enough talk, let’s start modeling.

Our First Data Model

So, let us imagine you are working for a college, one that apparently just discovered computers, because they have asked you to come up with a way to track professors, the classes they teach, and what room the class is in.

That sounds simple enough. You go to you office and knock out the model below.

DataModel

Does it work? Technically yes. It does meet the requirements, but it has some flaws that could be fixed with a better model.

  1. Notice the box bordered in red. There is not B Larsen – that is a typo. It is supposed to say B Larson, but any model that relies on a human to correctly input information over and over again in the exact same fashion, is bound to run into issues like the this. This issue falls under the name Data Integrity.
  2. Notice the duplicated information in the Professor and Classroom columns. Having to right the same name over and over again in your model is inefficient and it leads to data integrity errors as highlighted above.

Solution: Normalization

If you pick up any book on database design or data modeling, you will undoubtly come across the phrase normalization. What normalization refers to is a systematic method of converting a data model like the one above into a data model suitable for use in a relational database. **a relational database is the official term for most databases you use on a daily basis. Relational refers to how entities(Tables) and attributes(Columns) in the database interact. It will be explained better later in the article.

Now most databases you will come across (data marts and warehouses excluded) are normalized to what is known as the Third Normal Form (3NF). Now if you want to learn the specifics what it means to be 3NF vs 2NF or 1Nf, I have added a link to a good site on normalization in databases.:

Normalization

For everyone else, rest assured you really don’t need to know that much detail. You just need to understand a few concepts.

Below we have the data model from above, only now it has been normalized for use in a database.

datamodel1

So what have we done? The first thing you will notice is that we now have 3 tables instead of only 1.Why did we do that? Well first I want you to notice the Professors and Classrooms tables. Notice how each name and classroom is only listed once in these tables? This is done to prevent repeated entry of the same information into the database. Not only does this save the data entry clerk time, but it also reduces the chance for spelling errors like seen in the first data model.

Next, notice the first column in Professors and Classrooms. They ID fields are what is known as primary keys. These numbers be unique as they are used to identify  unique data. For and example, let us say S Rider stood for Shawn Rider. Now halfway through the year Sarah Rider was hired. We could have assigned her a new name, or we could simply create another S Rider and assign it a Prof ID of 7. This way Shawn (2) and Sarah (7) can easily be told apart.

Finally, notice how in the Classes table, instead of the professor’s name and classroom number, the ID codes are used. This is the relationship part of the relational database I had mentioned earlier.

datamodel2

You can see in the image above how the 3 tables relate to each other. This is why I stated above that data modeling is so important for you to learn. At some point in your career you maybe given access to a database to pull your own data. When you open up a table like Classes, I want you to understand that the numbers in ProfID and ClassID are reference points to separate tables which store the information you are looking for.

SQL, as well as most BI tools, provide methods for joining the three tables back together to provide you with the data you need to do your analysis. But without a basic understanding of data modeling and how items such as primary keys function, extracting any meaningful data from the database.