How I Found Love Using Pivot Tables

Okay, a little background is in order here. I work for a Clinical Engineering Department in a large hospital system. Our main purpose is to inspect and repair medical equipment, from MRIs and CT Scanners down to IV pumps.

Now I know the title says love, and I promise there is a love connection, just be patient.

While doing some database work on the system we use to track repairs, I decided to do a little data exploration (I don’t have a lot of hobbies). I asked myself, “What equipment is breaking down the most?” I figured this could be a valuable piece of information. So, I exported six months worth of repair history into a CSV file.

Using Excel, I started playing with Pivot Tables. I started by checking to see what types of equipment seemed to break down the most. Turns out it was infusion pumps, not a real surprise to anyone who has ever worked in the field.

excelpivot2

But looking a little more closely. One hospital out of my system used Brand A and they wanted to get rid of them out of the belief they were the unreliable. However, a quick look at the data provided otherwise.

excelpivot3

Okay, so after I unsullied the reputation of the Brand A pump, I decided “Why not look at the repair rates of individual pieces of equipment?” (I know, I live a WILD life)

Below is the list from one of my hospitals. Pay special attention to the area highlighted in red. The amount of repair work orders opened for dental chairs was way off anything my 20 plus years of experience in the field would have led me to expect.

excelpivot1.jpg

So I decided to dig a little further. Well, it turns out all 78 work orders we opened by one of our younger (24 year old) single technicians. A quick walk up to the dental department quickly explained why the dental chairs needed so much attention. The problem was about 5’2″, long blond hair, a cute smile, and (even in scrubs) quite the little body.

So there you go. Young love revealed itself through the power of the pivot table.

8 Software Packages You Should Learn for Data Science

1. Excel

Before everyone starts booing me, Excel is a highly under rated tool by many. I blame it on over familiarity. We have all seen our computer illiterate co-worker using Excel to make up a phone list for their kid’s little league. It is hard to imagine that same tool can be a powerhouse when it comes to data.

Take sometime to learn data cleaning and manipulation methods in Excel. Download the following free add-ons: PowerQuery, PowerPivot, and Solver. These tools offer a lot of the same functionality as a $30K license to SAS, admittedly on a small scale.

2. R

R is a statistical programming package. While far from user friendly – you are going to have to learn to program here – you will be hard pressed to find something that R cannot do. Even better, it is free. As part of the open source community, R is constantly being updated and libraries are being created to cover almost anything you can imagine.

3. Python

Python is another free programming language that is quite popular in the data science world. Unlike R, Python is not specifically designed for data and statistics. It is a full fledged Object Oriented Programming language that has plenty of uses in the computer world.

If you spend even a little time around Data Science forums, you will see the battle of R vs Python play out over and over again. My advice – take a little time to learn the fundamentals of both.

4. MS SQL Server

I chose MS SQL Server over Oracle for 2 reasons. 1) MS SQL Server is 100 times easier to install and configure for most computer users. 2) While Oracle’s PL\SQL is definitely a more robust language, Microsoft’s T-SQL is easier to pick up.

My one caveat here would be if your desire is to become a database design or database engineer. In that case, I would suggest learning Oracle first, as all the fundamental you develop there, will be easily translated to MS SQL Server.

Another great advantage of MS SQL Server is the developer’s edition. For around $70, you can purchase a developer’s license which gives you the whole suite of tools including: SSIS and SSRS.

5. SSIS

SSIS (SQL Server Integration Services) is a robust ETL (Extract Transform Load) tool build around the Microsoft Visual Studios platform. It comes included with the developer’s edition and provides a graphical interface for building ETL processes.

6. Tableau

Until further notice, Tableau is the reigning king of data visualizations. Just download a copy of free Tableau Public and you will wonder why you spent all that effort fighting with Excel charts for all these years.

While Tableau’s analytics tools leave a lot to be desired, the time it will save you in data exploration will have you singing its praises.

7. Qlik

Admittedly, Qlik is focused more as an end user BI tool, but it provides robust data modeling capabilities. Also, Qlik’s interactive dashboards are not only easy to construct, but leave the competition in the dust when it comes to ease of use.

8. Hadoop

It’s Big Data’s world, we just live here. If you want to be a data professional in this century, you are going to need to become familiar with Big Data platforms. My suggestion is to download the Hortonworks Sandbox edition of Hadoop. It is free and they provide hours worth of tutorials. Time spent learning Pig Script and Hive Script (Big Data query languages) will be well worth your effort.

What about SAS, SSPS, Cognos, blah, blah, blah…

While these packages do have a dominant position in the world of analytics, they are not very nice about offering free or low cost versions for people wanting to get into the profession to learn. I wanted to fill my list with software that could be obtain for little or no cost.

If you are currently a college student, you have a better position. Check with your college for software partnerships. Also, check with software vendors to see if they have student editions. As of this writing, I know Tableau, SAS, and Statistica offer student versions that you may want to look into.

Analytics: An Introduction

So exactly what is Analytics? Everyone is talking about it. Colleges are scrambling to develop programs in it. But what exactly does it mean?

Definition

The the definition I like the best is this:

Analytics: Discovering and communicating meaningful patterns in data.

Analytics are traditionally broken down into the following catagories:

  • Descriptive Analytics: Most people are familiar with this form. So familiar in fact, they probably do not refer to it as analytics. This is looking at past and current data to describe what is going on. Most standard business reporting falls into this category.
  • Predictive Analytics: This is using available data to help predict future events or to provide best guess answers to fill in gaps in data. Using predictive analytics, you can predict how much a house will sell for or what items you should stock near the registers based on current conditions (example: Walmart discovered Pop-Tarts tend to sell well during hurricanes).
  • Prescriptive Analytics: This is the cutting edge of analytics. Prescriptive analytics not only makes predictions about future events, but it utilizes decision making algorithms to determine how to respond to the events. Prescriptive analytics engines could, using the Pop Tarts example above, automatically reroute the shipment of Pop Tarts to stores in hurricane affected areas without any human intervention.

It should be noted that most companies today are still spending most of their time in the descriptive analytics world. That is not necessarily a bad thing. Being able to get the right information in front of a decision maker, in a format that is easily digestible, is a talent all within itself. 

Components

Analytics is not a 1 step process. It is actually a series of steps, often performed in an iterative manner. And just as each business problem is unique, so are the steps to the analytics process used to find the solution.

While the statement above is 100% percent true, I find it very unsatisfying. This is the kind of information I would find when I first developed an interest in analytics. So while I cannot give you a one size fits all answer, I feel that I at least owe you a better explanation than that.

For me, perhaps the best way to understand analytics, is to look at some of the more common tasks performed.

  • Data Management: While designing, building, and maintaining databases and data warehouses may not typically fall under the responsibility of an analytics professional, having a general understanding of how they work is none the less important. Databases and data warehouses are where most businesses keep their data. If you want to be taken seriously as a data professional, you need to have a fundamental understanding of how data is stored and how to query the stored data. (Example Technologies: Hadoop, SQL Server, Oracle)
  • Data Modeling: Data modeling is organizing data into logical structures so that is can be understood and manipulated by a machine. As a simple exercise, make a quick spreadsheet for sales amounts for  5 salespeople across 4 quarters. When you are done, look at the table you created. You have just modeled data. (Example Technologies: Excel, SQL Server, Oracle, Visio)
  • Data Cleaning: While this may not be the sexiest part of the job, it is the part you will spend the most time on. 60-80% of your time will be spent in this phase of the job. And while there are some third party software applications out there that can help ease the pain (Alteryx comes immediately to mind), they are expensive and not every boss will be willing to spring for it. My suggestion is to put sometime aside to become very familiar with Excel. I do 90% of my data cleaning work in Excel and MS SQL Server. (Example Technologies: Excel, SQL Server, Oracle, Alteryx)
  • Data Mining (Machine Learning): Now this is the cool stuff everyone is talking about. Data mining or machine learning, whichever you prefer to call it,  is the Artificial Intelligence (AI) portion of analytics. Data mining is difficult to provide a simple explanation for, but I will try anyway: In traditional programming, the programmer provides explicit instructions to the computer as to how to perform a task. With data mining, data sets are fed through an algorithm. The computer then determines the best way to solve the problem based on the data provided. 

 To help make this a little clearer, how about you try your hand at being the machine.

spam

Look at the pattern above. Without me providing you with any more information,                  you should be able to determine, that two blue squares in a row = SPAM. This is, at                 the most fundamental level, how data mining works. It pours over data and finds                   patterns. Knowing this pattern, if you were now shown only the first three columns               you would be able to predict whether the last column would be red or green.(Example Technologies: R, Python, SAS, XLMiner)

  • Data Visualization: DataViz is fun. It is the real show stopper in the data world. Visualizations make the patterns pop off the page. There are a lot of great programs out there for data visualization. (Again, do not discount Excel — it has some great DataViz features). Now DataViz should rightfully be broken into two separate categories. The first is Exploratory. This is visualizations used by the data professional to help analyze and understand the data. The second is Production. This the finished product that ends up on reports and dashboards for the business users to see. (Example Technologies: Excel, Tableau, R, SAS)
  • Optimization and Simulation: How often is there truly only one solution for a problem? Reality is sometimes the hardest part isn’t coming up with a solution to a problem, but deciding which solution to use. Building optimization models and running simulations helps to provide decision makers with quantitative data as to which solutions will be most effective. (Example Technologies: CPLEX, SAS, Solver)

So I have to learn all of this…

That depends – If your goal to is be a Data Scientist, then yes, you need to learn everything mentioned above and then some (I hope you love Statistics). However, if you are a business user just trying to add analytic skill to your toolbox, my recommendation is to focus your efforts on becoming efficient in data cleaning. In the real world, when trying to put a report together, you often are given data from multiple sources and you have to cobble it together to make sense of it. Learning some data cleaning skills can save you hours on tasks like that.

Once you have workable data, take some time to learn some visualization techniques. An eye popping chart will always garner more attention than pages of numeric columns. Also, take a little time to learn some data mining skills. No one is expecting you to write the complex algorithms the PhD’s at Stanford and MIT are kicking out, but there actually are some pretty user friendly data mining programs out there that help you cull some real insight out of your data.

However you decide to go about it, Analytics is a fascinating, fast growing field. It truly is a 21st century skill. Here at Analytics4All.org, the philosophy is that everyone should develop some analytical talent. Computers were once the sole territory of the science geeks of the world and now they are in everyone’s pockets and purses. Analytics and data driven decision making should also be a accessible to all.