In statistics, data transformations are used, for lack of better words, to even out the numbers. While there are lots of reasons to perform transformations in the world of Statistics, in Data Science and Analytics, our main concern is to bring distributions to the normal form (the classic bell curve).
As with many concepts, sometimes it is easy to see by doing. So let’s jump into Excel, you can download the file to play along here: transform1
First things first – Install Analysis Toolkit
In order to created histograms in Excel (which we will need for this exercise), you need to install the Analysis Toolkit.
Click on the green “File” in the upper left corner of Excel, then click options.
Click Add-Ins > select Analysis TookPak
Hit OK and now lets look at our data.
We have two columns –
UNITSF =10,000 records of house square footage
Bins = list of Bin sizes we will use for the histogram
First we will create a histogram to look at the data.
To do so, go to the Data tab in the Ribbon bar. Then on the far right select Data Analysis
For Input Range – select number values from the UNITSF column.
Bin Range – select number values from Bin column
Now you get a Frequency Distribution. Let’s visualize it to see better.
In the Ribbon Bar, select Insert>Recommended Charts
Select the Frequency bar chart
Look at this Frequency distribution. Notice how it bunched to the left. In Stats – they call this Right-Skewed since the long tail trails out to the right.
The problem is, many popular data mining (machine-learning) algorithms can be easily thrown off by data like this. The algorithms (like regressions, decision trees, boosted forests, naive bayes, etc.) are designed to work better with normal distributions.
So to normalize this distribution, we have to transform the data. What that means in practice is to apply a mathematical operation to all the data in the column. Operations like: logs, square roots, cube roots, squaring, cubing, etc.
In our example, the natural log does a very good job of transforming our data.
To do so, lets start by applying the natural log to the first element in our data.
Now hover over the little green box in the bottom corner until your cursor turns into a black plus sign. Double click and the natural log will be applied to all elements.
Build a new bin column ranging from 3 – 11 by .5
Now lets build our frequency table. Data> Data Analysis > Histogram
Make sure you select the B column for Input and E column for Bin Range
Now go to your new distribution and graph it like we did before.
Notice how the new distribution looks a lot more like the class bell curve.
Being a data professional requires a variety of skills. While you don’t have too look far to find heated debates on what skills are most valuable: R vs Python, SQL vs Pig Script. But to me, before you can make use of any of these skills, you need to have a grasp on data modeling.
What is Data Modeling?
Simply put…Data modeling is finding a way to represent objects or concepts through the use of data elements. Up until recently, these elements were limited to a few data types: text, numeric, boolean, date and time. Newer elements, such as spatial, temporal, video and image files have found their way into data bases, but the job of the data modeler remains the same.
Now there are hundreds of books out there on data modeling, and having read a few, I can attest that the field of data modeling is vast. Luckily for us, we only have to dip our toes in.
Why is Data Modeling Important?
Don’t let training data sets fool you. In the real world, the majority of your time will be spent extracting data from multiple sources, merging it into 1 or more sets, and cleaning and transforming the data for analysis. If you do not understand the underlying principles of how data is stored, you are going to have a hard time even finding it, much less doing anything with it.
While learning about data modeling through more formal methods, you will be exposed to terms like: Entity, Relationship, E-R Diagram, Attributes, Tuples, etc. I am going to try to keep the language here simpler. I will use terms I believe everyone already understands. Terms like: Table, Row, Column. I will leave some side notes here and there for those who want to geek out, but I really would like you to keep your focus on more on the concepts and less on the picky details.
Enough talk, let’s start modeling.
Our First Data Model
So, let us imagine you are working for a college, one that apparently just discovered computers, because they have asked you to come up with a way to track professors, the classes they teach, and what room the class is in.
That sounds simple enough. You go to you office and knock out the model below.
Does it work? Technically yes. It does meet the requirements, but it has some flaws that could be fixed with a better model.
Notice the box bordered in red. There is not B Larsen – that is a typo. It is supposed to say B Larson, but any model that relies on a human to correctly input information over and over again in the exact same fashion, is bound to run into issues like the this. This issue falls under the name Data Integrity.
Notice the duplicated information in the Professor and Classroom columns. Having to right the same name over and over again in your model is inefficient and it leads to data integrity errors as highlighted above.
If you pick up any book on database design or data modeling, you will undoubtly come across the phrase normalization. What normalization refers to is a systematic method of converting a data model like the one above into a data model suitable for use in a relational database. **a relational database is the official term for most databases you use on a daily basis. Relational refers to how entities(Tables) and attributes(Columns) in the database interact. It will be explained better later in the article.
Now most databases you will come across (data marts and warehouses excluded) are normalized to what is known as the Third Normal Form (3NF). Now if you want to learn the specifics what it means to be 3NF vs 2NF or 1Nf, I have added a link to a good site on normalization in databases.:
For everyone else, rest assured you really don’t need to know that much detail. You just need to understand a few concepts.
Below we have the data model from above, only now it has been normalized for use in a database.
So what have we done? The first thing you will notice is that we now have 3 tables instead of only 1.Why did we do that? Well first I want you to notice the Professors and Classrooms tables. Notice how each name and classroom is only listed once in these tables? This is done to prevent repeated entry of the same information into the database. Not only does this save the data entry clerk time, but it also reduces the chance for spelling errors like seen in the first data model.
Next, notice the first column in Professors and Classrooms. They ID fields are what is known as primary keys. These numbers be unique as they are used to identify unique data. For and example, let us say S Rider stood for Shawn Rider. Now halfway through the year Sarah Rider was hired. We could have assigned her a new name, or we could simply create another S Rider and assign it a Prof ID of 7. This way Shawn (2) and Sarah (7) can easily be told apart.
Finally, notice how in the Classes table, instead of the professor’s name and classroom number, the ID codes are used. This is the relationship part of the relational database I had mentioned earlier.
You can see in the image above how the 3 tables relate to each other. This is why I stated above that data modeling is so important for you to learn. At some point in your career you maybe given access to a database to pull your own data. When you open up a table like Classes, I want you to understand that the numbers in ProfID and ClassID are reference points to separate tables which store the information you are looking for.
SQL, as well as most BI tools, provide methods for joining the three tables back together to provide you with the data you need to do your analysis. But without a basic understanding of data modeling and how items such as primary keys function, extracting any meaningful data from the database.
So exactly what is Analytics? Everyone is talking about it. Colleges and Universities are scrambling to develop programs in it. But what exactly does it mean?
The the definition I like the best is this:
Analytics:Discovering and communicating meaningful patterns in data.
Analytics are traditionally broken down into the following catagories:
Descriptive Analytics: Most people are familiar with this form. So familiar in fact, they probably do not refer to it as analytics. This is looking at past and current data to describe what is going on. Most standard business reporting falls into this category.
Predictive Analytics:This is using available data to help predict future events or to provide best guess answers to fill in gaps in data. Using predictive analytics, you can predict how much a house will sell for or what items you should stock near the registers based on current conditions (example: Walmart discovered Pop-Tarts tend to sell well during hurricanes).
Prescriptive Analytics: This is the cutting edge of analytics. Prescriptive analytics not only makes predictions about future events, but it utilizes decision making algorithms to determine how to respond to the events. Prescriptive analytics engines could, using the Pop Tarts example above, automatically reroute the shipment of Pop Tarts to stores in hurricane affected areas without any human intervention.
It should be noted that most companies today are still spending most of their time in the descriptive analytics world. That is not necessarily a bad thing. Being able to get the right information in front of a decision maker, in a format that is easily digestible, is a talent all within itself.
Analytics is not a 1 step process. It is actually a series of steps, often performed in an iterative manner. And just as each business problem is unique, so are the steps to the analytics process used to find the solution.
While the statement above is 100% percent true, I find it very unsatisfying. This is the kind of information I would find when I first developed an interest in analytics. So while I cannot give you a one size fits all answer, I feel that I at least owe you a better explanation than that.
For me, perhaps the best way to understand analytics, is to look at some of the more common tasks performed.
Data Management: While designing, building, and maintaining databases and data warehouses may not typically fall under the responsibility of an analytics professional, having a general understanding of how they work is none the less important. Databases and data warehouses are where most businesses keep their data. If you want to be taken seriously as a data professional, you need to have a fundamental understanding of how data is stored and how to query the stored data. (Example Technologies: Hadoop, SQL Server, Oracle)
Data Modeling:Data modeling is organizing data into logical structures so that is can be understood and manipulated by a machine. As a simple exercise, make a quick spreadsheet for sales amounts for 5 salespeople across 4 quarters. When you are done, look at the table you created. You have just modeled data. (Example Technologies: Excel, SQL Server, Oracle, Visio)
Data Cleaning: While this may not be the sexiest part of the job, it is the part you will spend the most time on. 60-80% of your time will be spent in this phase of the job. And while there are some third party software applications out there that can help ease the pain (Alteryx comes immediately to mind), they are expensive and not every boss will be willing to spring for it. My suggestion is to put sometime aside to become very familiar with Excel. I do 90% of my data cleaning work in Excel and MS SQL Server. (Example Technologies: Excel, SQL Server, Oracle, Alteryx)
Data Mining (Machine Learning):Now this is the cool stuff everyone is talking about. Data mining or machine learning, whichever you prefer to call it, is the Artificial Intelligence (AI) portion of analytics. Data mining is difficult to provide a simple explanation for, but I will try anyway: In traditional programming, the programmer provides explicit instructions to the computer as to how to perform a task. With data mining, data sets are fed through an algorithm. The computer then determines the best way to solve the problem based on the data provided.
To help make this a little clearer, how about you try your hand at being the machine.
Look at the pattern above. Without me providing you with any more information, you should be able to determine, that two blue squares in a row = SPAM. This is, at the most fundamental level, how data mining works. It pours over data and finds patterns. Knowing this pattern, if you were now shown only the first three columns you would be able to predict whether the last column would be red or green.(Example Technologies: R, Python, SAS, XLMiner)
Data Visualization: DataViz is fun. It is the real show stopper in the data world. Visualizations make the patterns pop off the page. There are a lot of great programs out there for data visualization. (Again, do not discount Excel — it has some great DataViz features). Now DataViz should rightfully be broken into two separate categories. The first is Exploratory. This is visualizations used by the data professional to help analyze and understand the data. The second is Production. This the finished product that ends up on reports and dashboards for the business users to see. (Example Technologies: Excel, Tableau, R, SAS)
Optimization and Simulation:How often is there truly only one solution for a problem? Reality is sometimes the hardest part isn’t coming up with a solution to a problem, but deciding which solution to use. Building optimization models and running simulations helps to provide decision makers with quantitative data as to which solutions will be most effective. (Example Technologies: CPLEX, SAS, Solver)
So I have to learn all of this…
That depends – If your goal to is be a Data Scientist, then yes, you need to learn everything mentioned above and then some (I hope you love Statistics). However, if you are a business user just trying to add analytic skill to your toolbox, my recommendation is to focus your efforts on becoming efficient in data cleaning. In the real world, when trying to put a report together, you often are given data from multiple sources and you have to cobble it together to make sense of it. Learning some data cleaning skills can save you hours on tasks like that.
Once you have workable data, take some time to learn some visualization techniques. An eye popping chart will always garner more attention than pages of numeric columns. Also, take a little time to learn some data mining skills. No one is expecting you to write the complex algorithms the PhD’s at Stanford and MIT are kicking out, but there actually are some pretty user friendly data mining programs out there that help you cull some real insight out of your data.
However you decide to go about it, Analytics is a fascinating, fast growing field. It truly is a 21st century skill. Here at Analytics4All.org, the philosophy is that everyone should develop some analytical talent. Computers were once the sole territory of the science geeks of the world and now they are in everyone’s pockets and purses. Analytics and data driven decision making should also be a accessible to all.