Data scientist, data architect, data engineer, database administrator, data analyst: all of these jobs have 1 big thing in common – they all work with databases. So what exactly are databases?
A database is a software platform designed to store, manage, and manipulate data. They come in different flavors from small desktop based solutions to large enterprise databases sprawling across 100’s or 1000’s of servers.
Databases are used for all types of business and personal needs. Some common examples are transactional databases > like the ones run on cash registers in a retail store. They track inventory, prices, and sales.
Data warehouses: they are used by data analysts and business intelligence teams to run reports and build dashboards against. Also, there are unstructured databases that hold information like documents, images, audio, and free text.
From a very broad view, databases can be broken down into two categories: structure and unstructured. Another way you may describe it is SQL vs NoSQL. While not all structured databases run SQL, the vast majority do, so for the sake of this article, we will use SQL and unstructured interchangeably.
Structured SQL databases are still king of the data world at the moment. Most companies use them for both day to day functions as well as reporting and advanced analytics. These databases fall under the umbrella of relational databases.
RDBMS = Relational Database Management System
RDBMS databases stores data in one or more tables formatted into rows and columns (think Excel or Google Sheets). These tables are typically set up to handle some specific data, like an HR employee list or a student grade table. The tables can be joined using SQL to create rich datasets.
In the picture above, we have 2 tables. Using the ID from the first table, you can determine who the instructors are for each of the classes in table 2. This is, at the most basic level, a relational database.
Relational databases are normalized, meaning the duplicates are removed and big tables are broken up in to small more specific tables. This is done to save disk space as well as improve performance. I promise, I will have a whole write up dedicated solely to data normalization, but for now, just know it is a property of RDMS.
Other factors of Relation database design include:
ACID – set of rules to guide database design
Atomicity – integrity of an entire database transaction
Ex: if you are updating a person’s first name, the entire value must get updated to be successful (not just the first letter); no view of partial transactions in progress
Consistency – only data that pass validation rules are permitted
Ex: email address requires an @ otherwise it should not be written to the database
Isolation – the ability for a database to process multiple transactions simultaneously
Ex: if you are performing 2 updates, the first must complete before the second
Durability – once data is saved it should remain saved even when the machine shuts down
Ex: if a machine shuts down in the middle of an update, it should be rolled back; if an update is completed and a machine shuts down then the update should exist on the restart
Non-relational databases (NoSQL) are designed to fill in the gaps left by RDMS. These databases scale easily. Their loose approach to storing and managing data allow them to scale across multiple servers, allowing them to grow with demand and shrink as demand shrinks. They have a more flexible design, allowing for storage of data not easily handled by a RDMS such as: images, audio, documents.
When it comes to growth and scaling, there is a clear difference between the two approaches
The following articles will be based on RDMS, mostly due to the fact they are still the primary database in use today, and also they are easier to conceptualize. Don’t worry though, everything you learn here will help you to later understand the world of NoSQL unstructured data.
Back to Main Course Page: Course