Big Data is Like Teenage Sex: Everyone talks about it, nobody really knows how to do it, everyone things everyone else is doing it, so everyone claims they are doing it.
– Dan Ariely
I remember when I first started developing an interest in Big Data and Analytics. One of the biggest frustrations I faced was that it seemed like everyone in the know was talking in code. They would toss words around like supervised machine learning, map reduce, hadoop, SAP HANA, in-memory, and the biggest buzz word of them all, Big Data.
So what is Big Data?
In all honesty, it is a buzzword. Big Data isn’t a single thing as much as it is a collection of technologies and concepts that surround the management and analysis of massive data sets.
What kind of data qualifies as Big Data?
The common consensus you will find in textbooks is that Big Data is concerned with the 3 V’s: Velocity, Volume, Variety.
Velocity: Velocity is not so much concerned with how fast the data gets to you. This is not something you can clock using network metrics. Instead, this is how fast data can become actionable. In the days of yore, managers would rely on monthly or quarterly reports to determine business strategy. Now these decisions are being made more dynamically. Data only 2 hours old can be viewed outdated in this new high velocity world.
Volume: Volume is the name of the game in Big Data. Think of sheer volume of data produced by a wireless telecom company: every call, every tower connection, length of calls, etc. These guys are racking up terabytes upon terabytes.
Variety: Big Data is all about variety. As a complete 180 from the rigid structure that makes relational databases work, Big Data lives in the world of unstructured data. Big Data repositories are full of videos, pictures, free text, audio clips, and all other forms of unstructured data.
How do you store all this data?
Storing and managing all this data is one of big challenges. This is where specialized data management systems like Hadoop come into play. What makes Hadoop so great? First it is Hadoop’s ability to scale. By scale I mean, Hadoop can grow and shrink on demand.
For those unfamiliar with the back end storage methodology of standard relational databases (Oracle, DB2, SQL Server), they don’t play well across multiple computers. Instead you will find you need to invest in high end servers and plan out ahead any clusters you are going to use with a general idea of your storage needs in mind. If you build out a database solution designed to handle 10 terabytes and suddenly find yourself needing to manage 50, you are going to have some serious (and expensive) reconfiguration work ahead of you.
Hadoop on the other hand is designed to run easily across commodity hardware. Which means you can have a rack full of mid priced servers and Hadoop can provision and utilize them at will. So if you typically run at 10 terabyte database and there is a sudden need to run another 50 terabytes – (your company is on-boarding a new company) Hadoop will just grow as needed (assuming you have 50 tb worth of commodity servers available). It will also free up the space when it is done. So if the 50 terabytes were only needed for a particular job, once that job is over, Hadoop can release the storage space for other systems to use.
What about MapReduce?
MapReduce is algorithm designed to make querying or processing massive data sets possible. In a very simplified explanation MapReduce works like so:
Map – The data is broken into chunks and handed off to mappers. These mappers perform the data processing job at on their individual chunk of the data set. There are hundreds (or many many hundreds) of these mappers working in parallel, taking advantage of different processors on the racks of commodity hardware we were talking about earlier.
Reduce – The output of all of these map jobs are then passed into the reduce. This part of the algorithm puts all the pieces back together, providing the user with one result set. The entire purpose behind MapReduce is speed.
Now that you have the data, you are going to want to make some sense of it. To pull information out of this mass of data requires specially designed algorithms running on high end hardware. Platforms like SAP HANA tout in memory analytics to drive up speed, while a lot of the buzz around deep learning seems to mention accessing the incredibly fast memory found in GPUs (graphical processor units).
At the root of all of this, you will still find some old familiar stand-bys. Regression is still at the top of pack in prediction methods used by most companies. Other popular machine learning algorithms like Affinity Analysis (market basket) and Clustering are also commonly used with Big Data.
What really separates Big Data analytics from regular analysis methods is that with it’s sheer volume of data, it is not as reliant on inferential statistical methods to draw conclusions.
If you think about an election in a city with 50,000 registered voters. The classic method of polling was to ask a representative sample (say 2000 voters) how they were going to vote. Using that information, you could infer how the election was to play out (with a margin of error of course). With Big Data, we are asking all 50,000 voters. We do not need to infer anymore. We know the answer
Imagine this more “real world” application example. Think of a large manufacturing plant. A pretty common maintenance strategy in a plant like this is to have people perform periodic checks on motors, belts, fans, etc. They will take reading from gauges, write them on a clipboard and maybe the information is enter into a computer that can analyze trends to check for out of range parameters.
In the IoT(Internet of Things) world, we now have cheap, network connected sensors on all of this equipment sending out readings every second. This data is fed through algorithms designed to analyze trends. The trends can tell you a fan is starting to go bad and needs to be replaced 2 or 3 days before most humans could.
Big Data is all about correlation. And this can be a sticking point for some people. As humans, we love to look for a root cause. If we can’t find one, we often create one to satisfy our desire for one – hence the Roman volcano gods.
With Big Data, cause is not name of the game. Analyzing massive streams of data, we can find correlations that can help up make better decisions. Using the fan example above, the algorithms may pick up that if the fan begins drawing more current while maintaining the same speed that the soon the fan motor will fail. And using probabilistic algorithms, we can show you the increasing odds of it failing each minute, hour, or day you ignore it.
This we can do, with great certainty in many instances. But what we can’t do is tell you why it is happening. That is for someone else. We leave whys to the engineers and the academics. We are happy enough knowing If A Then B.
Correlation can also show us relationship we never knew existed. Like did you know Pop Tarts are hot sellers in the days leading up to a hurricane? Walmart does. They even know what flavor ( I want to say strawberry, but don’t quote me on this).
This pattern finding power can be found everywhere from Walmart check outs, to credit card fraud detection, dating sites, and even medicine.