SQL: Load (Insert) data into a table

Here are the steps to add data to existing table:

Let’s use the employee table created on the create table page: Create_table

To insert some data into this table, first you need to know what the data types of each of the columns in the table are. In this example I am using my MySQL. An easy way to see column data types is to Right Click on Table name> Send to SQL Editor > Create Statement

The results will appear in the query window. In this table we can see the columns are a integer (INT), string (varchar), DOB (date)

The syntax for inserting data into the table is as follows

insert into <table name> (column1, column2,...)
values (value1, value2,...)

In the example below, we are loading 3 rows in, separating each row by a comma:

If we run this, we can then check on the results with a Select statement

SQL: Create a Table

Create a table in MySql and Sql Server

Create table <table name> (
           <column 1> data type,
           <column 2> data type )

As an example, let’s create an employee table with 3 columns, ID, Name, and Date of Birth

create table Employee(
   Id int,
  Emp_NM varchar(255),
  DOB date);

** remember that unless you set your database to be case sensitive, most SQL databases are not

You may need to refresh your database explore view, but after running this command, you should now have a new table

If you need a reference table for the types of data types available, check out this page: data types

SQL: Common Data Types in MySQL

Here is a table of the most commonly used data types in MySQL

Data TypeDescription
Char()Fixed length string, unused spaces get padded and eat up memory: size 0-255
Varchar()Variable length string, unused spaces don’t use memory: size 0 to 65535
MediumText()A string up to 16,777,215 characters long
LongText()A string up to 4,294,967,295 characters long
INTinteger (whole number, no decimals)
Double(x, d)floating point decimal number, x is size, d is number of places after the decimal
Bool or BooleanBinary choice, 0 = False and 1 = True
DateDate data type “YYYY-MM-DD” (if set to US settings)
DATETIMEdatetime data type “YYYY-MM-DD HH:MM:SS” (if set to US settings)
TIMETime “HH:MM:SS”
YEARyear in for digit representation (ex 1908,1965,2011)

Data Jobs: What does a Data Architect do?

If experience has taught me anything, it is that while companies and organizations have gotten much better at collecting data, most of this data is still just stored away in unstructured data stores. So while the data is “technically” there, it is not a whole lot of use to anyone trying to build a report or create a machine learning model. In order to get actually use out of all the data being stored, it needs to organized into some sort usable structure: Enter the Data Architect.

Data Architect is a job title I have held on multiple occasions throughout my career. My best description when explaining the job to other people is that I was kind of like a data janitor. You see, everyone takes their data and dumps it into the storage closet. So you find some billing data on the second shelf, tucked away behind employee HR records. The floor is cluttered with server logs and six years worth of expense reports stored as PDFs with completely useless names like er123.pdf.

As a data architect, it was my job to try to organize this mess and put the data into some sort of structure that lends itself to reporting or modeling. So data architects have to be well versed in data modeling, data storage, and data governance.

Data Modeling

ERD diagram showing an HR database design

Data modeling is basically taking raw data dumps and organizing them into structure that fit the needs of company. It could involve creating an HR database like above or creating a series of aggregated tables designed for reporting or dashboarding. It is the job of the data architect to best fit business needs to a data platform: be it a transactional database, a data warehouse, or perhaps a data lake.

Data Storage

Data architects also need to address data storage. While I often defer to the server crew in the IT department, as a data architect I do advise on to how and where to store data. Cloud infrastructure and cheaper faster disk storage has made a lot of data storage decisions easier, but it is good to have a working understanding of storage platforms.

Data Governance

Data governance is all about how the data is managed from a regulatory and security standpoint. It is the practice of deciding who can have access to what data. Can some data be “outside facing” versus should some data sit behind multiple firewalls in a DMZ zone.

You will often work hand in hand with Legal and Security departments when figuring out how data governance practices will be implemented.

Where to find real data sets for data science training

Practice makes perfect. But finding good data sets to practice with can be a pain. Here my current list for best places to find practice data sets:

SSIS: Lesson 1 – Export to CSV

SSIS Tutorial Lesson 1

Exporting Data to a CSV file

One of the main purposes of SSIS is moving data from one location to another with ease. In this lesson we are going to export data from our SQL Server database to a CSV file.

If you want to follow along, you can use the following code to create the data table I will be working with.  I loaded this into a database I created for practice called Sandbox.

CREATE TABLE employee_id (        
             emp_nm nvarchar(30) not null, 
            emp_id nvarchar(8),
             b_emp_id nvarchar(8)  
     PRIMARY KEY(emp_nm) );
INSERT INTO employee_id       
       (emp_nm, emp_id)
VALUES       
       ('Bob', 'A1234567'),
       ('Lisa', 'A1234568')
INSERT INTO employee_id       
        (emp_nm, b_emp_id)
VALUES       
        ('Priyanka', 'B1234567');

We will start by opening the Training_Project we created in the intro lesson (SSIS Tutorial: Introduction )and creating a new package.

2018-03-08_10-00-21

I renamed my new package Lesson_1

2018-03-08_10-01-27

Now we need to make a  new connection so that SSIS knows what our data source will be communicating with. To do so, go to Solution Explorer, right click on Connection Manager and select New Connection Manager

2018-03-08_10-06-41

Since we will be connecting to a SQL Server, select OLE DB from the list below

2018-03-08_10-07-17

Since there are no existing connections to pick from, choose New

2018-03-08_10-07-35

Okay, starting from the top select Native OLE DB\SQL Server Native Client 11.0  (note you might have 10.0 – that will work as well)

Since my SQL Server is locally installed on my machine, I am using Localhost as Server Name, otherwise provide the server name here.

Select your database from the drop down, again my database is Sandbox

Finally, hit Test Connection, you should get a connection successful message box.

Click Okay

2018-03-08_10-10-11.png

You’ll see your new connection in the box, now click Okay

2018-03-08_10-10-55.png

Now at the bottom of your Design Window, you’ll see your new connection in the box labeled Connection Managers

2018-03-08_10-11-15

So next, we need to go to the SSIS Toolbox and drag a Data Flow Task over to the designer

2018-03-08_10-14-02

Once in the designer, click on the Data Flow Task box, this will bring you to the Data Flow window

2018-03-08_10-14-21

Data Flow is used whenever you need to move data between disparate systems. Since we will be moving data from SQL Server to a CSV file, we need to use a Data Flow Task

You should note that the SSIS Toolbox has changed, offering up new Data Flow related tools.

Scroll down to Other Sources and drag OLE DB Source to the design box

2018-03-08_10-14-51

Double click on the new object

2018-03-08_10-15-15

Make sure your Connection manager you just created in the top drop down. Leave Data access mode at Table or view and select dbo.employee_id as your table

2018-03-08_10-15-52

If you click Preview in the bottom left, you will get a pop up of the data in the table

2018-03-08_10-16-08

If you click Columns in the upper left, you will see the columns that will be exported from the table. You can change the Output Column names if you want to use a different name in your output file.

We will skip over Error Output for now, as that is easily an entire lesson all its own.

2018-03-08_10-16-28

Now go back to the SSIS toolbox and under Other Destinations, click on Flat File Destination and drag it over to the design window.

2018-03-08_10-16-54.png

Drag the blue arrow from the OLE DB source box to the Flat File Destination Box

2018-03-08_10-17-18

It should look like this when done

2018-03-08_10-17-39

Now click on Flat File Destination

Since we don’t have a current Flat File Connection, we will need to click on New to create one.

2018-03-08_10-18-04

Select Delimited and click OK

2018-03-08_10-18-42.png

Find a folder you want the file to end up in. Select CSV files from the bottom right drop down, and name your file.  Click OK  (Note, use a name of a file that does not currently exist. This will create the file)

2018-03-08_10-22-56

Check the box: Column names in first data row

2018-03-08_10-23-41.png

If you click on Columns in the upper left, you will see the names of your header columns.

Click Okay to go back to the Flat File Destination Window

2018-03-08_10-24-14.png

If you click on Mappings, you will see you have 3 columns from your source going to three columns in what will be the new CSV file.  In future lessons I will show how you can use this to match up different named columns.

Click Okay to return to the design window

2018-03-08_10-24-35

Go to Solution Explorer, right click on the package and click Execute Package

2018-03-08_10-25-11.png

You should get green check marks to indicate success. This is a very small package, so they probably turned green instantly. In larger jobs, the green check will first be a yellow circle to indicate progress. It turns green when that step is complete.

Note on the connecting line between Source and Destination that you get a read out of how many rows were processed.

2018-03-08_10-25-34

Go to the folder you chose as your destination, you will see your new CSV file there

2018-03-08_10-26-36.png

Open the file, you will see your data has been exported into your CSV file

2018-03-08_10-27-35.png

Python: Naive Bayes’

Naive Bayes’ is a supervised machine learning classification algorithm based off of Bayes’ Theorem. If you don’t remember Bayes’ Theorem, here it is:

bayes

Seriously though, if you need a refresher, I have a lesson on it here: Bayes’ Theorem

The naive part comes from the idea that the probability of each column is computed alone. They are “naive” to what the other columns contain.

You can download the data file here: logi2

Import the Data

import pandas as pd
df = pd.read_excel("C:\Users\Benjamin\Documents\logi2.xlsx")
df.head()

nb.jpg

Let’s look at the data. We have 3 columns – Score, ExtraCir, Accepted. These represent:

  • Score – Student Test Score
  • ExtraCir – Was Student in an Extra Circular Activity
  • Accepted – Was the Student Accepted

Now the Accepted column is our result column – or the column we are trying to predict. Having a result in your data set makes this a supervised machine learning algorithm.

Split the Data

Next split the data into input(score and extracir) and results (accepted).

y = df.pop('Accepted')
X = df

y.head()

X.head()

nb1.jpg

Fit Naive Bayes

Lucky for us, scikitlearn has a bit in Naive Bayes algorithm – (MultinomialNB)

Import MultinomialNB and fit our split columns to it (X,y)

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X,y)

nb2.jpg

Run the some predictions

Let’s run the predictions below. The results show 1 (Accepted) 0 (Not Accepted)

#--score of 1200, ExtraCir = 1
print(classifier.predict([1200,1]))

#--score of 1000, ExtraCir = 0
print(classifier.predict([1000,0]))

nb3

The Code

import pandas as pd
df = pd.read_excel("C:\Users\Benjamin\Documents\logi2.xlsx")
df.head()

y = df.pop('Accepted')
X = df

y.head()
X.head()

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X,y)

#--score of 1200, ExtraCir = 1
print(classifier.predict([1200,1]))

#--score of 1000, ExtraCir = 0
print(classifier.predict([1000,0]))

 

R: Filter Dataframes

When working with large data sets, often you only want to see a portion of the data. If you are running a query on the graduating class of a high school, you don’t want your results to be clogged up with hundreds of freshmen, sophomores, juniors. To accomplish this, we use filtering. Filtering a dataframe is a very useful skill to have in R. Luckily R makes this task relatively simple.

Download the data set here: boxplot2

Import the data

df <- read.csv(file.choose())
head(df)

Let’s take a look at it. We have a data set with 4 columns, workorderNo – Work Order Number, dif – change code since last work order reading, quart – quarter (referring to calendar year), Employee – employee name.

dataframe1

The first thing we need to know when dealing with dataframes is how to use the $

In R, the “$” placed between a dataframe name and a column name signify which column you want to work with.

head(df$Employee)
head(df$quart)

dataframeFilter1.jpg

Notice the line that says Levels: after the Employee list. This lets you know this column is a factor. You can see all the unique elements (factors) in your list by using the levels() command.

level(df$Employee)

dataframeFilter2

Taking another look at our data, the quart column consists of 3 numbers (1,2,3) representing quarters of the year. If you think back to your statistics classes, these numbers are ordinal (meaning they only provide a ranking – you will never use them in a calculation – quart 2 + quart 3 has no real meaning.) So we should make this column a factor as well.

R makes that easy to do.

df$quart <- factor(df$quart)
head(df$quart)

dataframeFilter2

Okay, now onto filtering. Let’s see how easy it can be.

First we assign our filtering logic to a variable, then we place our variable inside the dataframe’s square brackets []

filter <- df$quart==2
df[filter,]

dataframeFilter3.jpg

Or you can just put your logic directly in the dataframe’s square brackets[]

df[df$Employee=="Sally",]

Now we just see Sally

dataframeFilter4.jpg

We can use an and “&” operator to filter by Sally and quart 1

df[df$Employee=="Sally" & df$quart == 1,]

dataframeFilter5.jpg

 

Python: Co-variance and Correlation

In this lesson, I will using data from a CSV file. You can download the file here: heightWeight

If you do not know how to import CSV files into Python, check out my lesson on it first: Python: Working with CSV Files

The Data

The data set includes 20 heights (inches) and weights(pounds). Given what you already know, you could tell me the average height and average weight. You could tell me medians, variances and standard deviations.

correl

But all of those measurements are only concerned with a single variable. What if I want to see how height interacts with weight? Does weight increase as height increases?

**Note while there are plenty of fat short people and overly skinny tall people, when you look at the population at large, taller people will tend to weigh more than shorter people. This generalization of information is very common as it gives you a big picture view and is not easily skewed by outliers.

Populate Python with Data

The first thing we are going to focus on is co-variance. Let’s start by getting our data in Python.

correl1.jpg

Now there is a small problem. Our lists are filled with strings, not numbers. We can’t do calculations on strings.

We can fix this by populating converting the values using int(). Below I created 2 new lists (height and weight), created a for loop counting up to number of values in our lists : range(len(hgt)). Then I filled the new lists using lst.append(int(value))

correl2.jpg

**Now I know I could have resolved this in fewer steps, but this is a tutorial, so I want to provide more of a walk through.

Co-variance

Co-variance tells us how much two variables disperse from the mean together. There are multiple ways to find co-variance, but for me, using a dot product approach has always been the simplest.

For those unfamiliar with dot product. Imagine I had 2 lists (a,b) with 4 elements each. The dot product with calculated as so: a[0]*b[0]+a[1]*b[1]+a[2]*b[2]+a[3]*b[3]

Here is how it works:

If I take the individual variance of height[0] and weight[0] and they are both positive or negative – the product will be positive. – both variables are moving in the same direction

One positive and one negative will be negative. The variables are moving in different directions

One you add them all up, a positive number will mean that overall, you variables seem to have a positive co-variance (if a goes up, b goes up – if a goes down, b goes down)

If the final result is negative, you have negative co-variance (if a goes up, b goes down – if a goes down, b goes up)

If your final answer is 0 – your variables have no measurable interaction

Okay, let’s program this thing

** we will be using numpy’s mean() – mean and dot() – dot product methods and corrcoef() – correlation coefficient

First we need to find the individual variances from mean for each list

I create a function called ind_var that uses a list comprehension to subtract the mean from each element in the list.

correl3.jpg

Now, let’s change out the print statement for a return, because we are going to be using this function inside another function.

correl4.jpg

Co-variance function

Now let’s build the co-variance function. Here we are taking the dot product of the variances of each element of height and weight. We then divide the result by the N-1 (the number of elements – 1 : the minus 1 is due to the fact we are dealing with sample data not population)

correl6

So what were are doing:

  • Take the first height (68 inches) and subtract the mean (66.8) from it (1.2)
  • Take the first weight (165 lbs) and subtract the mean(165.8)  from it (-0.8)
  • We then multiply these values together (-0.96)
  • We repeat this for each element
  • Add the elements up.
  • Divide by 19 (20-1)
  • 144.75789 is our answer

Our result is 144.75789  – a positive co-variance. So when height goes up – weight goes up, when height goes down – weight goes down.

But what does 144 mean? Not much unfortunately. The co-variance doesn’t relate any information as to what units we are working with. 144 miles is a long way, 144 cm not so much.

Correlation

So we have another measurement known as correlation. A very basic correlation equation divides out the standard deviation of both height and weight. The result of a correlation is between 1 and -1. With -1 being perfect anti-correlation and 1 being perfect correlation. 0 mean no correlation exists.

With my equation get 1.028  – more than one. This equation is simplistic and prone to some error.

correl7.jpg

numpy’s corrcoef() is more accurate. It shows us a correlation matrix. Ignore the 1’s – they are part of what is known as the identity. Instead look at the other numbers = 0.97739.  That is about as close to one as you will ever get in reality. So even if my equation is off, it isn’t too far off.

correl9

Now just to humor me. Create another list to play with.

correl10

Let’s run this against height in my correlation function

correl11.jpg

Run these value through the more accurate corrcoef() . This will show my formula is still a bit off, but for the most part, it is not all that bad.

correl8


If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT. 

Follow this link for more Python content: Python

 

8 Software Packages You Should Learn for Data Science

1. Excel

Before everyone starts booing me, Excel is a highly under rated tool by many. I blame it on over familiarity. We have all seen our computer illiterate co-worker using Excel to make up a phone list for their kid’s little league. It is hard to imagine that same tool can be a powerhouse when it comes to data.

Take sometime to learn data cleaning and manipulation methods in Excel. Download the following free add-ons: PowerQuery, PowerPivot, and Solver. These tools offer a lot of the same functionality as a $30K license to SAS, admittedly on a small scale.

2. R

R is a statistical programming package. While far from user friendly – you are going to have to learn to program here – you will be hard pressed to find something that R cannot do. Even better, it is free. As part of the open source community, R is constantly being updated and libraries are being created to cover almost anything you can imagine.

3. Python

Python is another free programming language that is quite popular in the data science world. Unlike R, Python is not specifically designed for data and statistics. It is a full fledged Object Oriented Programming language that has plenty of uses in the computer world.

If you spend even a little time around Data Science forums, you will see the battle of R vs Python play out over and over again. My advice – take a little time to learn the fundamentals of both.

4. MS SQL Server

I chose MS SQL Server over Oracle for 2 reasons. 1) MS SQL Server is 100 times easier to install and configure for most computer users. 2) While Oracle’s PL\SQL is definitely a more robust language, Microsoft’s T-SQL is easier to pick up.

My one caveat here would be if your desire is to become a database design or database engineer. In that case, I would suggest learning Oracle first, as all the fundamental you develop there, will be easily translated to MS SQL Server.

Another great advantage of MS SQL Server is the developer’s edition. For around $70, you can purchase a developer’s license which gives you the whole suite of tools including: SSIS and SSRS.

5. SSIS

SSIS (SQL Server Integration Services) is a robust ETL (Extract Transform Load) tool build around the Microsoft Visual Studios platform. It comes included with the developer’s edition and provides a graphical interface for building ETL processes.

6. Tableau

Until further notice, Tableau is the reigning king of data visualizations. Just download a copy of free Tableau Public and you will wonder why you spent all that effort fighting with Excel charts for all these years.

While Tableau’s analytics tools leave a lot to be desired, the time it will save you in data exploration will have you singing its praises.

7. Qlik

Admittedly, Qlik is focused more as an end user BI tool, but it provides robust data modeling capabilities. Also, Qlik’s interactive dashboards are not only easy to construct, but leave the competition in the dust when it comes to ease of use.

8. Hadoop

It’s Big Data’s world, we just live here. If you want to be a data professional in this century, you are going to need to become familiar with Big Data platforms. My suggestion is to download the Hortonworks Sandbox edition of Hadoop. It is free and they provide hours worth of tutorials. Time spent learning Pig Script and Hive Script (Big Data query languages) will be well worth your effort.

What about SAS, SSPS, Cognos, blah, blah, blah…

While these packages do have a dominant position in the world of analytics, they are not very nice about offering free or low cost versions for people wanting to get into the profession to learn. I wanted to fill my list with software that could be obtain for little or no cost.

If you are currently a college student, you have a better position. Check with your college for software partnerships. Also, check with software vendors to see if they have student editions. As of this writing, I know Tableau, SAS, and Statistica offer student versions that you may want to look into.