Being a data professional requires a variety of skills. While you don’t have too look far to find heated debates on what skills are most valuable: R vs Python, SQL vs Pig Script. But to me, before you can make use of any of these skills, you need to have a grasp on data modeling.
What is Data Modeling?
Simply put…Data modeling is finding a way to represent objects or concepts through the use of data elements. Up until recently, these elements were limited to a few data types: text, numeric, boolean, date and time. Newer elements, such as spatial, temporal, video and image files have found their way into data bases, but the job of the data modeler remains the same.
Now there are hundreds of books out there on data modeling, and having read a few, I can attest that the field of data modeling is vast. Luckily for us, we only have to dip our toes in.
Why is Data Modeling Important?
Don’t let training data sets fool you. In the real world, the majority of your time will be spent extracting data from multiple sources, merging it into 1 or more sets, and cleaning and transforming the data for analysis. If you do not understand the underlying principles of how data is stored, you are going to have a hard time even finding it, much less doing anything with it.
While learning about data modeling through more formal methods, you will be exposed to terms like: Entity, Relationship, E-R Diagram, Attributes, Tuples, etc. I am going to try to keep the language here simpler. I will use terms I believe everyone already understands. Terms like: Table, Row, Column. I will leave some side notes here and there for those who want to geek out, but I really would like you to keep your focus on more on the concepts and less on the picky details.
Enough talk, let’s start modeling.
Our First Data Model
So, let us imagine you are working for a college, one that apparently just discovered computers, because they have asked you to come up with a way to track professors, the classes they teach, and what room the class is in.
That sounds simple enough. You go to you office and knock out the model below.
Does it work? Technically yes. It does meet the requirements, but it has some flaws that could be fixed with a better model.
- Notice the box bordered in red. There is not B Larsen – that is a typo. It is supposed to say B Larson, but any model that relies on a human to correctly input information over and over again in the exact same fashion, is bound to run into issues like the this. This issue falls under the name Data Integrity.
- Notice the duplicated information in the Professor and Classroom columns. Having to right the same name over and over again in your model is inefficient and it leads to data integrity errors as highlighted above.
If you pick up any book on database design or data modeling, you will undoubtly come across the phrase normalization. What normalization refers to is a systematic method of converting a data model like the one above into a data model suitable for use in a relational database. **a relational database is the official term for most databases you use on a daily basis. Relational refers to how entities(Tables) and attributes(Columns) in the database interact. It will be explained better later in the article.
Now most databases you will come across (data marts and warehouses excluded) are normalized to what is known as the Third Normal Form (3NF). Now if you want to learn the specifics what it means to be 3NF vs 2NF or 1Nf, I have added a link to a good site on normalization in databases.:
For everyone else, rest assured you really don’t need to know that much detail. You just need to understand a few concepts.
Below we have the data model from above, only now it has been normalized for use in a database.
So what have we done? The first thing you will notice is that we now have 3 tables instead of only 1.Why did we do that? Well first I want you to notice the Professors and Classrooms tables. Notice how each name and classroom is only listed once in these tables? This is done to prevent repeated entry of the same information into the database. Not only does this save the data entry clerk time, but it also reduces the chance for spelling errors like seen in the first data model.
Next, notice the first column in Professors and Classrooms. They ID fields are what is known as primary keys. These numbers be unique as they are used to identify unique data. For and example, let us say S Rider stood for Shawn Rider. Now halfway through the year Sarah Rider was hired. We could have assigned her a new name, or we could simply create another S Rider and assign it a Prof ID of 7. This way Shawn (2) and Sarah (7) can easily be told apart.
Finally, notice how in the Classes table, instead of the professor’s name and classroom number, the ID codes are used. This is the relationship part of the relational database I had mentioned earlier.
You can see in the image above how the 3 tables relate to each other. This is why I stated above that data modeling is so important for you to learn. At some point in your career you maybe given access to a database to pull your own data. When you open up a table like Classes, I want you to understand that the numbers in ProfID and ClassID are reference points to separate tables which store the information you are looking for.
SQL, as well as most BI tools, provide methods for joining the three tables back together to provide you with the data you need to do your analysis. But without a basic understanding of data modeling and how items such as primary keys function, extracting any meaningful data from the database.