In this lesson, we will be importing an Excel file into MS Access and learning to use the sort and filter functionality. If you want to follow along, download the Excel file below:
I am using the same database I build in the previous lesson. You can start a new one if you want, but since most of these early lessons will be sandboxing( a programming term for playing around with a software platform) in Access it doesn’t really matter if you just use the same database for everything.
From the Home screen in Access, click on the External Data tab and open up the External Data ribbon
Select New Data Source > From File > Excel
Browse for your excel file, select Import the source data into a new table and click ok (note, if you want to use a different Excel file, you can)
My excel file has column names in the first row, so I make sure that is check and I click next
Up top, you can change column names, and data types, but in this case, Access did a good job of assigning datatypes for me. Next>
I check Let Access add primary key, you could just select your own from the drop down though. Next >
Name your table whatever you want and hit Finish
If you spent a lot of time fixing data types and column names or if you are going to upload a file like this on a regular basis, you can save import steps, but I am going to skip this for now, just hit close
Click on your new table to open it up
Sorting
Filters and sorting work like they do in Excel, here I select my column and Sort Largest to Smallest
Filter
Here is select Model and filter down to a single model (Accuvix A30)
MS Access is a all in one database solution provided as part of MS Office. Access was much more popular years ago when more powerful solutions such as Oracle an SQL Server couldn’t be effectively run on desktops. While Access’s time in the sun is definitely in decline, for someone new to concepts of databases, Access can be a great introduction.
Let’s start by creating a simple database. Open up Access and click on Blank Database
Give your blank database a name in the pop up window
By default, you’re new database will start with a Table1 with an ID column. Click the drop-down on the next column space to add a new column. For this example, let us select Short Text as our data type. Short text can handle any text up to 255 characters
After you set the datatype, you can click on the column name and rename it
Let’s add another column, set the datatype to number
Now click the X next to table1 and you will be prompted to name your table
Another way to build a table is through the table design feature
You can name your columns and select data types: Note AutoNumber is an auto incrementing datatype that works to provide you with an auto populating ID number
Access requires primary keys for all tables (something not required, but recommended in other systems like MySQL or SQL Server). To create a Primary Key, right click on ClassID and select Primary Key
Finally, let’s just use SQL to create a table. Select Create an then Query Design from the ribbon menu
Select SQL View
Put in the SQL below
create table teacher (
teacherID int primary key,
teacher_name varchar(255))
Click on the new teacher table in the left and you can fill in some data
Now let’s create a relationship between the Class and Teacher tables. Click on Database Tools > Relationships
Drag the Class and Teacher table into the blank sheet and click Edit Relationships
Click create new, select your tables and columns from the drop down, and select Enforce Referential Integrity and select the Create button
A Relationship line will appear, showing you your connection
When start on the design of a database there are few questions you need to ask right away:
Not to be dismissive, but do not simply build a database to the specs of a request
Take stock of the data at hand, ask around if there is related data that also might be included
Think about what users will be doing with the data
As best you can, try to anticipate other uses for the data that come up
Focus on the type of data you are dealing with, and how it can be best used/store
Don’t only focus on the current use of the data, consider the data may find other uses as well
Now, let’s look at database development through the lens of a software development cycle
Keep in mind, this process is not simply linear. This process, in the real world, goes through many iterations. Change management is good idea to put in place. It helps you to deal with issues like:
New data sources
New requirements from business users
Technology platform changes in your organization
To get started, we need to come up with a problem statement, something that can describe the problem we are trying to solve. Use cases are commonly used when developing problem statements. Use cases help to demonstrate how a user will be interacting with your database:
Transactional – think like a cash register, adding new transactions to database, updating inventor
Reporting base – data doesn’t change often, people want to look more at aggregate numbers over the details of each transaction. (End of the day reporting – what is the final sales total from the cash register)
You can use Unified Modeling Language (UML) to create your use case. UML uses a series of diagramming techniques to help visualize system design. Below, the stick figure represents users and each oval will show one of the tasks the user hopes to be able to use the database for. UML is great for mock-ups, but you should also include a detailed document that provides deeper insight into the tasks below.
Now imagine we want to create a database to store plant data:
Using this data, we can come up with a few basic use cases
Use case 1: Enter (or edit) all the data we have about each plant; that is, plant ID, genus, species, common name, and uses.
Use case 2: Find or report information about a plant (or every plant) and see what it is useful for.
Use case 3: Specify a use and find the appropriate plants (or report for all uses).
How does our existing table design handle our use cases:
1.Can we maintain data? Yes
2.Given a plant can we return the uses? Yes
3.Given a use can we return the plant? Not really
We need to re-examine this data from a class point of view
We are dealing with 2 classes in the Plant data:
1.Plant
2.Use
•One plant can have many uses
•This is an example of a relationship between classes
•Relationships between classes are represented with a line in UML
•We need more details besides just a line to know how classes relate
•The pair of numbers at each end of the line indicates how many objects of one class can be associated with a particular object of the other class
•The first number is the minimum number (0 or 1 to indicate whether there must be a related object)
•The second number is the greatest number of related objects (1 or n to represent many)
Design Phase
•You spend a lot of time in the top 2 quadrants
•It is an iterative process, not simply linear
•Keep reviewing the model to ensure it satisfies the problem
•Once you have a model that works, move on to design
•For the database world, design involves:
Converting our Class Diagram into Database Objects: Tables, Keys, and Relationships
Application Phase
•Now that you have a database foundation in place, the Application phase builds upon it by adding additional functionality for the users
•The Application phase satisfies the Use Cases:
Input Use Cases will most likely be satisfied with Forms
Output Use Cases will most likely be satisfied with Reports
Database = A collection of structured tables designed to store data
Table = A database object that stores data by organizing it into rows and columns
Field = Each column in a database table is called a field
Record = Each row in a database table is called a record
We want to create databases that will:
1.Store our data
2.Let us interact and ask questions of the data
3.Retrieve or export the data
In order to do this, we need to have a properly designed database
Example Situation #1:
A teacher puts out a sign up sheet during back to school night to allow parents to donate various classroom supplies. The parents of the class fill out the sign up sheet and it looks like this:
The next day the teacher types up the donation sheet and puts it in Excel. The teacher decides that he wants to try to build a database to keep track of what each person donated. After noticing that at most someone is donating 3 items he builds a database table that looks like this:
In Excel:
Database:
The teacher is pleased. He set out to create a database to keep track of what each parent donated. Given a parent name he can easily see their donations. However, when he wants to see all the parents that have donated tissues he ran into some problems.
Problems:
•The table answers questions in 1 direction only (given a parent you can easily return donations)
•To answer the question in the other direction (what parents donated tissues?) you need to search across 3 different fields
•The table was designed without recognizing we are dealing with 2 different classes
With a parent name provided as input (ex: Greg Carter), a record in the table can be identified. Once we have identified a record, the values for all fields in that record can be retrieved.
So, this table works to answer questions like this:
“What donations did Greg Carter provide?”
Example Situation #1 Takeaways:
1.Do not let an existing form dictate your table design (a form can be simple like a paper signup sheet or more complex like the interface of an application)
2.You must get to know your data to be able to correctly identify how many classes of data you have
3.Be careful when designing your table to answer a particular question – will you ever want to ask that question in the opposite direction?
4.Don’t rush to simply load given data into a table – take the time to think through design
Repeated Information:
•Another common problem is storing the same piece of information several times
•If you find this happening, you may be placing too much emphasis on an existing form
Example:
A business’s paper order form has customer name, address, phone number. We wouldn’t want to create a database table to keep storing all the information for every single order. It is inefficient and will most likely lead to inconsistencies and future problems.
If you see repeated information in a table, it should be a huge red flag for you to re-examine the design.
Designing for a single report:
•You can’t allow an existing form to determine your table design
•Similarly, you can’t allow a report to determine your table design
•Think about reports as nicely presented data coming out of your database
Data scientist, data architect, data engineer, database administrator, data analyst: all of these jobs have 1 big thing in common – they all work with databases. So what exactly are databases?
A database is a software platform designed to store, manage, and manipulate data. They come in different flavors from small desktop based solutions to large enterprise databases sprawling across 100’s or 1000’s of servers.
Databases are used for all types of business and personal needs. Some common examples are transactional databases > like the ones run on cash registers in a retail store. They track inventory, prices, and sales.
Data warehouses: they are used by data analysts and business intelligence teams to run reports and build dashboards against. Also, there are unstructured databases that hold information like documents, images, audio, and free text.
From a very broad view, databases can be broken down into two categories: structure and unstructured. Another way you may describe it is SQL vs NoSQL. While not all structured databases run SQL, the vast majority do, so for the sake of this article, we will use SQL and unstructured interchangeably.
Structured SQL databases are still king of the data world at the moment. Most companies use them for both day to day functions as well as reporting and advanced analytics. These databases fall under the umbrella of relational databases.
RDBMS = Relational Database Management System
RDBMS databases stores data in one or more tables formatted into rows and columns (think Excel or Google Sheets). These tables are typically set up to handle some specific data, like an HR employee list or a student grade table. The tables can be joined using SQL to create rich datasets.
In the picture above, we have 2 tables. Using the ID from the first table, you can determine who the instructors are for each of the classes in table 2. This is, at the most basic level, a relational database.
Relational databases are normalized, meaning the duplicates are removed and big tables are broken up in to small more specific tables. This is done to save disk space as well as improve performance. I promise, I will have a whole write up dedicated solely to data normalization, but for now, just know it is a property of RDMS.
Other factors of Relation database design include:
ACID – set of rules to guide database design
Atomicity – integrity of an entire database transaction
Ex: if you are updating a person’s first name, the entire value must get updated to be successful (not just the first letter); no view of partial transactions in progress
Consistency – only data that pass validation rules are permitted
Ex: email address requires an @ otherwise it should not be written to the database
Isolation – the ability for a database to process multiple transactions simultaneously
Ex: if you are performing 2 updates, the first must complete before the second
Durability – once data is saved it should remain saved even when the machine shuts down
Ex: if a machine shuts down in the middle of an update, it should be rolled back; if an update is completed and a machine shuts down then the update should exist on the restart
Non-relational databases (NoSQL) are designed to fill in the gaps left by RDMS. These databases scale easily. Their loose approach to storing and managing data allow them to scale across multiple servers, allowing them to grow with demand and shrink as demand shrinks. They have a more flexible design, allowing for storage of data not easily handled by a RDMS such as: images, audio, documents.
When it comes to growth and scaling, there is a clear difference between the two approaches
The following articles will be based on RDMS, mostly due to the fact they are still the primary database in use today, and also they are easier to conceptualize. Don’t worry though, everything you learn here will help you to later understand the world of NoSQL unstructured data.
Data Analysts get a bad wrap. With the advent of the Data Scientist, Data Analysts are often viewed as Data Scientists lite, however I feel that is not the honest case. Truth is, there is a lot of overlap between the two fields. I will dive deeper into what a Data Scientist is in a future article, but just know my opinion is the definition of Data Scientist as a job is still a bit fuzzy and I think the job title may eventually be broken into a few different titles to better define the differences.
Data Analyst
So what does a Data Analyst do?
A lot actually. You could put 10 data analysts into a room and you would get ten different answers to this question. So the best I can do here is make sweeping generalities. As the old saying goes “Your results may vary”
In general, data analysts perform statistical analysis, create reporting, run ad-hoc queries from data warehouses, create data visualizations, create and maintain dashboards, perform data mining, and create machine learning models (yes, ML is not only for data scientists). Their assignments are business driven. A data analysts is either embedded with a business unit (financial planning, fraud, risk management, cyber security, etc.) or working in a centralized reporting/analytics team. They use their skills to provide reporting and analytics for the business.
Tools used by Data Analysts
SQL – MySql, SQL Server, Oracle, Teradata, Postgres – whether simply querying a data warehouse or creating and managing a local data mart, data analysts need to be advanced SQL programmers
Visualization tools – Tableau, Qlik, Power BI, Excel, analysts use these tools to create visualizations and dashboards
Python/R – Data analysts should be familiar with languages like Python or R to help manage data and perform statistical analysis or build machine learning models
Spreadsheets – Excel, Google Sheets, Smart Sheets are used to create reports, and pivot tables used to analyze the data
ETL tools – SSIS, Alteryx, Talend, Knime, these tools are design to move data to and from databases, CSV files, and spreadsheets. Until the data is in a usable format, analysis cannot be performed.
Educational Requirements
Typically a data analyst position will ask for a bachelors degrees, preferably in computer science, statistics, database management or even business. While the barrier to entry for a data analyst job is generally not as high as a data scientist, that does not mean you cannot make a meaningful and well paid career as a data analyst. Also, the demand for data professionals seems to keep going up and up and it most likely will for the foreseeable future.
When it comes to deleting data from a SQL server, you have 3 main options: Drop, Delete, and Truncate
Drop
The Drop command completely deletes a table from the database. Let’s take the table Employee
When I run the following code: Drop Table Employee;
You can see the entire table has been dropped from the database, data and all. There is no way to recover data from a dropped table, short of having a back up
Delete
Delete removes data from the table, not the table its. Also, Delete can be used in conjunction with a Where clause to choose exactly which data to delete
You see that only the row with ‘Chris’ was deleted
Without the Where clause, all data is deleted, however the table remains
Truncate
Truncate acts just like Delete but you can’t add a Where clause, it simply deletes all data from a table