Database Design: First Things to Consider

First lets start with some basic terminology

Some Relational Database basics before we begin:

Database = A collection of structured tables designed to store data

Table = A database object that stores data by organizing it into rows and columns

Field = Each column in a database table is called a field

Record = Each row in a database table is called a record

We want to create databases that will:

1.Store our data

2.Let us interact and ask questions of the data

3.Retrieve or export the data

In order to do this, we need to have a properly designed database

Example Situation #1:

A teacher puts out a sign up sheet during back to school night to allow parents to donate various classroom supplies.  The parents of the class fill out the sign up sheet and it looks like this:

The next day the teacher types up the donation sheet and puts it in Excel.  The teacher decides that he wants to try to build a database to keep track of what each person donated.  After noticing that at most someone is donating 3 items he builds a database table that looks like this:

In Excel:

Database:

The teacher is pleased.  He set out to create a database to keep track of what each parent donated.  Given a parent name he can easily see their donations.  However, when he wants to see all the parents that have donated tissues he ran into some problems. 

Problems:

•The table answers questions in 1 direction only (given a parent you can easily return donations)

•To answer the question in the other direction (what parents donated tissues?) you need to search across 3 different fields

•The table was designed without recognizing we are dealing with 2 different classes


With a parent name provided as input (ex: Greg Carter), a record in the table can be identified.  Once we have identified a record, the values for all fields in that record can be retrieved.

So, this table works to answer questions like this:

“What donations did Greg Carter provide?”

Example Situation #1 Takeaways:

1.Do not let an existing form dictate your table design (a form can be simple like a paper signup sheet or more complex like the interface of an application)

2.You must get to know your data to be able to correctly identify how many classes of data you have

3.Be careful when designing your table to answer a particular question – will you ever want to ask that question in the opposite direction?

4.Don’t rush to simply load given data into a table – take the time to think through design


Repeated Information:

•Another common problem is storing the same piece of information several times

•If you find this happening, you may be placing too much emphasis on an existing form

Example:

A business’s paper order form has customer name, address, phone number.  We wouldn’t want to create a database table to keep storing all the information for every single order.  It is inefficient and will most likely lead to inconsistencies and future problems. 

If you see repeated information in a table, it should be a huge red flag for you to re-examine the design. 

Designing for a single report:

•You can’t allow an existing form to determine your table design

•Similarly, you can’t allow a report to determine your table design

•Think about reports as nicely presented data coming out of your database

Back to Main Course Page: Course

Data Jobs: What do Data Scientists do?

While generalizing any profession is difficult since so many factors can come into play: (academic versus corporate — large company versus start-up), I find the role of Data Scientist can truly only be described by breaking it apart into three very different roles. For lack of better terms I will name these roles: Academic/Research Data Scientist, Applied Data Scientist, and finally Data Analyst/Scientist. These roles require different skill sets and different mindset that I will discuss below.

First off, I would like to put my bias out front, I am an Applied Data Scientist. Now, that does not mean I feel that my current role is superior to any other data scientist role out there. Also, I work for a very large company with lots of resources, so I also want to point out that the roles I will describe below can overlap depending on your working environment.

Academic/Research Data Scientist

These are the people who design new machine learning algorithms and push the boundaries of data science and AI. When you see advances in self driving cars and computer vision, these are the people behind those advances. These individuals work either in a university setting or as part of a research team for companies like Google, Facebook, Tesla. Most either hold a PhD or are currently working on one. These data scientists actively develop and conduct research experiments that are written up and published in scientific journals. These are the pure scientists among the data science world.

To be amongst this crowd of data scientists, you need to be at the top of your game with advanced mathematics and programming skills. If you really love diving deep into the field and can handle the often glacial pace research moves at, this could be the job for you. The biggest drawback is that these jobs are limited. There are only so many academic or industry research positions out there. Most data scientists today instead fall into the other two categories.

Applied Data Scientist

Applied Data Scientists are a bit more pragmatic. They work for a company that drives their goals, instead of being funded by research grants. They do not have the luxury of time often afforded those working on grants, so the solutions they build need to go into production sooner rather than later. (Now to be fair to researchers, applied data scientists also don’t have to deal with the headaches surrounding grant proposals).

Typically, data scientists working in industry (not part of a research team) are not out developing new algorithms or trying to push the limits of machine learning. Instead, they use tools created by others to explore and derive meaning from data that can be acted upon. When the boss wants actionable data and they need it now. Most applied data scientists keep a few algorithms on hand that they know work for certain scenarios and spend most of their time gathering, cleaning, and prepping the data to build out the models.

I kind of like this position. I view myself more of an applied scientist. Even while going through my PhD, I always leaned towards applied versus theoretical.

To work as an applied data scientist, a candidate should have a master’s degree or at least 6 years industry experience. They should be inquisitive and honestly interested in the domain in which they work in. I work in cyber security, I have spent a lot of time researching and studying the field so I can identify opportunities to provide a data driven solution. Candidates should also have a wide ranging skill set beyond just ML. An applied data scientist should be well versed in tools such as dashboard development, optimization modeling, forecasting, and simulation modeling.

Data Analyst/Scientist

This is probably the most common data science position right now. What this position is calling for is basically a top level analyst familiar with data science tools. Now I am not denigrating this positon. What these data scientists do is every bit as challenging and important as the other two I described above. They are expected to operate as a data scientist while also handling analyst or business intelligence duties as well.

People in this position are not less than or incapable of doing the other roles described above, they are instead part of an organization that either does not have a fully developed data science program or they are working outside the data science organization. These data scientists are often embedded in a department providing data management and analysis expertise. Truly jacks of all trades, these data scientist often stand-up and manage data warehouses or data marts for their department. They can be expected to handle reporting duties as well as machine learning model development.

Conclusion

To sum this all up, data science is a broad and still evolving field. Solid industry definitions are not common place and titles often do not represent actual job duties. However, most data scientists can be generalized under one of the three roles I discussed above: Academic/Research Data Scientist, Applied Data Scientist, or Data Analysts/Scientists. Neither role is inherently better or more important, but the differences in the rows can definitely attract different individuals.

Python: Working with Rows, Pandas DataFrames

Here is a quick lesson on working with rows in DataFrames

Let’s start by building a DateFrame

Lets start by deleting (or dropping) a row:


Let’s try dropping a couple of rows:

[-1] allows us to drop the last row:

Side note:

Here is another method from pandas you should know. This is the shape method. See it returns (rows, columns)

We can filter rows out using conditional logic

The index method returns the row indexes

We can use the .index as well as conditional logic to add a new column to our dataframe

Using append(), we can add a dictionary as a new row to the DataFrame

Finally, using .iloc we can iterate through the values of a row

Last Lesson: Pandas Renaming Columns

Next Lesson: Python: Working with Rows in DataFrames

Return to: Python for Data Science Course

Follow this link for more Python content: Python

Python: Closures

This is a more advanced programming topic, and honestly not one I make much use of myself. However, I got a request to make a lesson on Closures and Decorators. I am going to make it into two lessons to try to make it a bit clearer.

Now to understand closures and decorators, the first thing we need to discuss is nested functions. In programming, nested functions means wrapping a function inside of another function. Here is a very simple example of a nested function:

def outerFunction(x):
    innerFunction():
        print(x)
    innerFunction()

Now if I just call this with outerFunction(‘Hello World’), the argument x gets passed into the innerFunction which is in turn called by the outerFunction. The result is shown below:

Now why would anyone do this? Outside of just being able to show you that you can? I don’t know. There are reasons for nested functions that involve hiding variables from the code outside of the functions. That is a bit beyond the scope of this lesson. I will try to tackle it when I create my Object Oriented Programming lessons.

For now, let us get on to Closures. Let’s look at the code below:

It is very close to the first example, except notice 2 little changes. First, instead of just calling the innerFunction at the end of the function, we are returning it. Notice there is no () when we return innerFunction

Also notice we are passing the outerFunction to a variable and when we call the var we add () to the end: var()

What we have done now is created an official Closure, let me show you why that matters below.

First, notice I delete the outerFunction. When I try calling it returns an error saying outerFunction is not defined (meaning it doesn’t exist).

But, look what happens when I called var(). It is still there. The closure commits the value to memory when you create it. That is not something that would happen from a normal function or even nesting function.

SO WHY THE HELL SHOULD I CARE????

So why this is cool, and why you might use this in the future, is imagine a function that does processer heavy multi-step calculations. If you just keep calling the function as usual, each time you call it, it has to run the whole chain of calculations again. However, once you’ve called it as closure, you don’t have to run the calculations again, you just get the value.

So, how about a more concrete example

Here I created a closure that uses two functions which each take 1 argument, the arguments are then multiplied together.

Notice, I created a variable called double where I sent the outer function mult_by() a value of 2.

When I call double() I can pass it a value for (y)

I can create as many of these instances as I want. Look below, created one for Triple, and notice double is still functional

I’ll continue this in the Decorators lesson.

Just the main thing to keep in mind about closures, is that they improve program performance by no requiring the function to be run each time.

If you are still a little confused after reading this, don’t feel bad. It took me a few tries to finally understand closures myself.

Data Jobs: What does a Data Analyst Do?

Data Analysts get a bad wrap. With the advent of the Data Scientist, Data Analysts are often viewed as Data Scientists lite, however I feel that is not the honest case. Truth is, there is a lot of overlap between the two fields. I will dive deeper into what a Data Scientist is in a future article, but just know my opinion is the definition of Data Scientist as a job is still a bit fuzzy and I think the job title may eventually be broken into a few different titles to better define the differences.

Data Analyst

So what does a Data Analyst do?

A lot actually. You could put 10 data analysts into a room and you would get ten different answers to this question. So the best I can do here is make sweeping generalities. As the old saying goes “Your results may vary”

In general, data analysts perform statistical analysis, create reporting, run ad-hoc queries from data warehouses, create data visualizations, create and maintain dashboards, perform data mining, and create machine learning models (yes, ML is not only for data scientists). Their assignments are business driven. A data analysts is either embedded with a business unit (financial planning, fraud, risk management, cyber security, etc.) or working in a centralized reporting/analytics team. They use their skills to provide reporting and analytics for the business.

Tools used by Data Analysts

  • SQL – MySql, SQL Server, Oracle, Teradata, Postgres – whether simply querying a data warehouse or creating and managing a local data mart, data analysts need to be advanced SQL programmers
  • Visualization tools – Tableau, Qlik, Power BI, Excel, analysts use these tools to create visualizations and dashboards
  • Python/R – Data analysts should be familiar with languages like Python or R to help manage data and perform statistical analysis or build machine learning models
  • Spreadsheets – Excel, Google Sheets, Smart Sheets are used to create reports, and pivot tables used to analyze the data
  • ETL tools – SSIS, Alteryx, Talend, Knime, these tools are design to move data to and from databases, CSV files, and spreadsheets. Until the data is in a usable format, analysis cannot be performed.

Educational Requirements

Typically a data analyst position will ask for a bachelors degrees, preferably in computer science, statistics, database management or even business. While the barrier to entry for a data analyst job is generally not as high as a data scientist, that does not mean you cannot make a meaningful and well paid career as a data analyst. Also, the demand for data professionals seems to keep going up and up and it most likely will for the foreseeable future.

Python: Create a QR code with pyqrcode

We are using a module that does not come with the Anaconda distribution. To get this code, open up your Anaconda Prompt and type the following code:

> pip install pyqrcode

Next, you will import pyqrcode into python. Also, since the output of this program will be a png picture file, I am running pwd to determine where on my machine the file will be saved.

Next pick a URL you want your QRCode to go to. I chose my website. I assigned my website address to a variable url.

Then, using the pyqrcode module and its method create(), place your variable in the parenthesis ().

Note printing the QR code you created will not display it, but it will give you the basic information on it. To turn it into a QR code, we will need to covert it to a png.

*Note: Some of you may need to download the png module if you get an error. Just go to pip install png and add import png to the top of the script.

Otherwise, just give you file a name, and scale — the larger the scale, the larger your QR code will be.

Finally, go to your directory where the file was saved and open it.

Scanning the QR Code below should take you to my website’s main page.

Python Project 1: Lists and Dictionaries

For your first project in the course, I am giving you the code below. You will notice I am placing 2 dictionaries in a list:

d = {'Name': 'Ben', 'Age': 35}
e = {'Name': 'Christine', 'Age': 30}
a = []
a.append(d)
a.append(e)
print(a)

Your challenge, should you choose to accept it, will be to run this code on your machine or in the test python browser (click on the blue arrow below) 

Try your Python code in the free console

I then want you to

1) Print out the second dictionary from the list

2) Print out the name from the first dictionary

3) Print out both ages

Note: I have not shown how to work with dictionaries inside a list. So consider this a stretch project. Don’t be afraid to use Google to help find an answer.


Project Answer Notebook Download Available at Course Page below:

Back to Python Course: Course

Solution Video

Python Web Scraping / Automation: Connecting to Firefox with Selenium

Selenium is a Python package that allows you to control web browsers through Python. In this tutorial (and the following tutorials), we will be connecting to Googles Chrome browser, Selenium does work with other browsers as well.

First you will need to download Selenium, you can use the following commands depending on your Python distribution

c:\> Pip install selenium

c:\> Conda install selenium

If you are on a work computer or dealing with a restrictive VPN, the offline install option may help you: Selenium_Install_Offline

Next you need to download the driver that let’s you manage Firefox through Python.

Start by determining what version of Firefox you have on your computer

Click the three horizontal lines in the upper right corner > Help >About Firefox

Search for geckodriver to download the file that matches your Firefox version. (note, this is something you will need to do every time Firefox is updated, so get used to it.)

Open up the zipfile you downloaded, you will find a file called geckodriver.exe

Put it somewhere you can find, put in the following code to let Python know where to find it.

from selenium import webdriver
opts = webdriver.FirefoxOptions()
dr = webdriver.Firefox('C:/Users/larsobe/Desktop/geckodriver.exe',chrome_options=opts)

Now to see if this works, use the following line, (you can try another website if you choose)   

Note the message Firefoxis being controlled by automated test software.

You are now running a web browser via Python.

How do I get a job in Data Science?

This has to be the most common question on data science I am asked, and honestly it is a hard one to answer. For everyone out there trying to get your foot in the door on your first data job, believe me, I feel for you. Multiple interviews without any offers, or even not getting any interviews at all can be beyond frustrating. Now unfortunately, I do not have any magic trick to get your into the data field, but I can share how I did it.

So, how did I get into the data science field…

Honestly, I “Made” my first job. My first career out of the Army was as a biomedical equipment technician. I fixed medical equipment like patient monitors, ultrasounds, and x-ray machines.

We had a ticketing system called MediMizer where all the repairs and routine maintenance jobs were recorded. My bosses would run monthly reports out the system. I read some of the reports and just felt like we could do better.

I started with just Excel. I downloaded some data, created some pivot charts and made some basic visualizations. I asked new questions from the data. I looked at angles that weren’t covered in the existing reporting.

I showed these to my bosses, my co-workers, other department managers, basically anyone who would listen to me. Then I learned about Tableau, and using its free version I was able to create some more professional looking visualizations.

I learned how to make a dashboard, I started analyzing data sets from other departments, and I began feeding them my reports. I went back to school to get a degree and used what I was learning in school to improve my reporting skills.

While my job title didn’t change, I was now able to put data analysis skills on my resume. I was lucky enough to have very supportive management who saw the value in what I was doing, and allowed me to dedicate some of my time to it.

But most importantly, I was now a data professional (even if not in title). I was using data to solve real world problems. I put together a portfolio of some of the reporting I was doing. This allowed me to show my future employer that not only was I able to create reporting, but more importantly I was able to identify real world business problems and use data to help solve them.

The take away is don’t let your job title hold you back. Look around, what kind of problems do you see? Can you find a data-driven solution to help fix the problem? If you do this, you are a now a data professional (even if not in title). A portfolio made from real world examples can be more impressive than generic tutorial or Kaggle projects.

Remember, when trying to break into a new field, sometimes you need to make your own luck.

SQL: Load (Insert) data into a table

Here are the steps to add data to existing table:

Let’s use the employee table created on the create table page: Create_table

To insert some data into this table, first you need to know what the data types of each of the columns in the table are. In this example I am using my MySQL. An easy way to see column data types is to Right Click on Table name> Send to SQL Editor > Create Statement

The results will appear in the query window. In this table we can see the columns are a integer (INT), string (varchar), DOB (date)

The syntax for inserting data into the table is as follows

insert into <table name> (column1, column2,...)
values (value1, value2,...)

In the example below, we are loading 3 rows in, separating each row by a comma:

If we run this, we can then check on the results with a Select statement