SQL: Drop, Delete, Truncate commands

When it comes to deleting data from a SQL server, you have 3 main options: Drop, Delete, and Truncate

Drop

The Drop command completely deletes a table from the database. Let’s take the table Employee

When I run the following code: Drop Table Employee;

You can see the entire table has been dropped from the database, data and all. There is no way to recover data from a dropped table, short of having a back up

Delete

Delete removes data from the table, not the table its. Also, Delete can be used in conjunction with a Where clause to choose exactly which data to delete

You see that only the row with ‘Chris’ was deleted

Without the Where clause, all data is deleted, however the table remains

Truncate

Truncate acts just like Delete but you can’t add a Where clause, it simply deletes all data from a table

Python: Create a Word Cloud

Word Clouds are a simple way of visualizing word frequency in a corpus of text. Word Clouds typically work by displaying frequently used words in a text corpus, with the most frequent words appearing in larger text.

Here is the data file I will be using in this example if you want to follow along:

As far as libraries go, you will need pandas, matplotlib, os, and wordcloud. If you are using the Anaconda python distribution you should have all the libraries but wordcloud. You can install it using PIP or Conda install.

Lets start by loading the data

import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import os

#Set working directory
os.chdir('C:\\Users\\blars\\Documents')

#Import CSV
df = pd.read_csv("movies.csv")

#First look at the Data
df.head()

** Note: if you are using Jupyter notebooks to run this, add %matplotlib inline to the end of the import matplotlib line, otherwise you will not be able to see the word cloud

import matplotlib.pyplot as plt %matplotlib inline

We can use df.info() to look a little closer at the data

We have to decide what column we want to build our word cloud from. In this example I will be using the title column, but feel free to use any text column you would like.

Let look at the title column

As you can see, we have 20 movie titles in our data set. Next thing we have to do is merge these 20 rows into one large string

corpus = " ".join(tl for tl in df.title)

The code above is basically a one line for loop. For every Row in the Column df.title, join it with the next row, separating by a space ” “

Now build the word cloud

wordcloud = WordCloud(width=640, height=480, max_words=20).generate(corpus)

You can change the width and height, number of words that will appear. Play around with the numbers, see how it changes your output

Finally, let’s chart it, so we can see the cloud

plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.show()

interpolation = “bilinear” is what lets the words so sideways and up and down

plt.axis(“off”) gets rid or axis markers (see below)

You can also go back to the word cloud and change the background color
wordcloud = WordCloud(width=640, height=480, background_color = 'white', max_words=25).generate(corpus)
plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.show()

SQL: Load (Insert) data into a table

Here are the steps to add data to existing table:

Let’s use the employee table created on the create table page: Create_table

To insert some data into this table, first you need to know what the data types of each of the columns in the table are. In this example I am using my MySQL. An easy way to see column data types is to Right Click on Table name> Send to SQL Editor > Create Statement

The results will appear in the query window. In this table we can see the columns are a integer (INT), string (varchar), DOB (date)

The syntax for inserting data into the table is as follows

insert into <table name> (column1, column2,...)
values (value1, value2,...)

In the example below, we are loading 3 rows in, separating each row by a comma:

If we run this, we can then check on the results with a Select statement

SQL: Create a Table

Create a table in MySql and Sql Server

Create table <table name> (
           <column 1> data type,
           <column 2> data type )

As an example, let’s create an employee table with 3 columns, ID, Name, and Date of Birth

create table Employee(
   Id int,
  Emp_NM varchar(255),
  DOB date);

** remember that unless you set your database to be case sensitive, most SQL databases are not

You may need to refresh your database explore view, but after running this command, you should now have a new table

If you need a reference table for the types of data types available, check out this page: data types

SQL: Add, rename, or delete a column in an existing table

To add a column here is the syntax

alter table <table name>
add <column name> <datatype>;

example: (add a column named location to a table emp_info)

alter table emp_info
add location varchar(255);

To rename a column in sql, here is the syntax

alter table <table name>
rename column <old name> to <new name>;

example: (change name of column location to office)

alter table emp_info
rename column location to office;

To delete or remove a column, use the drop command

alter table <table name>
drop column <column name>;

Example:

alter table emp_info
drop column office;

Return to SQL page:

Python: Read all files in a folder

os.listdir() command will easily give you a list off all files in a folder.

So for this exercise I created a folder and threw a few files in it.

Using the following code, can iterate through the file list

import os
for files in os.listdir("C:/Users/blars/Documents/test_folder"):
    print(files)

Now if I wanted to read the files, I could use the Pandas command pd.read_excel to read each file in the loop

***Note, I made this folder with only Excel files on purpose for ease of demonstration. You could do this with multiple file types in a folder, it would however require some conditional logic to handle the different file types

To read all the Excel files in the folder:

import pandas as pd
import os

os.chdir('C:/Users/blars/Documents/test_folder')
for files in os.listdir("C:/Users/blars/Documents/test_folder"):
    print(files)
    file = pd.read_excel(files)
    print(file)

Data Science Interview Questions: Parts of a SQL Query

A common question I have seen good people get tripped up on. Remember KISS – Keep it Simple Stupid.

Don’t let the question trick you into overthinking. For a SQL query you need a SELECT command (what do we want), a FROM (what table is it in) and if you want to go for the extra credit point, throw in a WHERE (its our filter)

Return to questions page

Real Data Science Interview Questions

Below is a list (in no particular order) of real interview questions I have either asked, been asked, or saw asked in a real interview. While everyone has their own take on interview advice, mine is pretty clear cut. Answer the question, nothing more nothing less. Don’t get caught up in the trap of trying to add too much detail. Unless specifically asked for more detail, the interviewer more often than not just wants to make sure you have a grasp of the concepts.

1. Explain the difference between Regression and Classfiers:

Click here for answer

2. What does ETL stand for?

Click here for answer

3. What is the difference between a data warehouse and a transactional database

Answer is coming

4. Name 3 ETL tools

Answer is coming

5. Explain Probability vs Odds

Click here for answer

6. What are the basic elements or parts for writing SQL queries?

Click here for answer

7. Explain supervised versus unsupervised machine learning

Answer is coming

8. Explain the difference between a left join and an inner join

Answer is coming

9. What is ensemble modeling?

Answer is coming

10. What is a confusion matrix?

Answer is coming

11. What is DDL vs DML

Click Here for Answer

Real Data Science Interview Questions: Question 1

Explain the difference between Regression and a Classifier:

Link to video

While regression and classifiers are both popular machine learning model types, the difference sits in the results they return:

Regression returns distinct values: think height of a person, price of a house, weight of truck

Classifiers return categories: tall / short, expensive/mid-ranged/cheap, heavy/light

Return to interview questions

Data Jobs: What does a Data Architect do?

If experience has taught me anything, it is that while companies and organizations have gotten much better at collecting data, most of this data is still just stored away in unstructured data stores. So while the data is “technically” there, it is not a whole lot of use to anyone trying to build a report or create a machine learning model. In order to get actually use out of all the data being stored, it needs to organized into some sort usable structure: Enter the Data Architect.

Data Architect is a job title I have held on multiple occasions throughout my career. My best description when explaining the job to other people is that I was kind of like a data janitor. You see, everyone takes their data and dumps it into the storage closet. So you find some billing data on the second shelf, tucked away behind employee HR records. The floor is cluttered with server logs and six years worth of expense reports stored as PDFs with completely useless names like er123.pdf.

As a data architect, it was my job to try to organize this mess and put the data into some sort of structure that lends itself to reporting or modeling. So data architects have to be well versed in data modeling, data storage, and data governance.

Data Modeling

ERD diagram showing an HR database design

Data modeling is basically taking raw data dumps and organizing them into structure that fit the needs of company. It could involve creating an HR database like above or creating a series of aggregated tables designed for reporting or dashboarding. It is the job of the data architect to best fit business needs to a data platform: be it a transactional database, a data warehouse, or perhaps a data lake.

Data Storage

Data architects also need to address data storage. While I often defer to the server crew in the IT department, as a data architect I do advise on to how and where to store data. Cloud infrastructure and cheaper faster disk storage has made a lot of data storage decisions easier, but it is good to have a working understanding of storage platforms.

Data Governance

Data governance is all about how the data is managed from a regulatory and security standpoint. It is the practice of deciding who can have access to what data. Can some data be “outside facing” versus should some data sit behind multiple firewalls in a DMZ zone.

You will often work hand in hand with Legal and Security departments when figuring out how data governance practices will be implemented.