SSIS: Lesson 1 – Export to CSV

SSIS Tutorial Lesson 1

Exporting Data to a CSV file

One of the main purposes of SSIS is moving data from one location to another with ease. In this lesson we are going to export data from our SQL Server database to a CSV file.

If you want to follow along, you can use the following code to create the data table I will be working with.  I loaded this into a database I created for practice called Sandbox.

CREATE TABLE employee_id (        
             emp_nm nvarchar(30) not null, 
            emp_id nvarchar(8),
             b_emp_id nvarchar(8)  
     PRIMARY KEY(emp_nm) );
INSERT INTO employee_id       
       (emp_nm, emp_id)
VALUES       
       ('Bob', 'A1234567'),
       ('Lisa', 'A1234568')
INSERT INTO employee_id       
        (emp_nm, b_emp_id)
VALUES       
        ('Priyanka', 'B1234567');

We will start by opening the Training_Project we created in the intro lesson (SSIS Tutorial: Introduction )and creating a new package.

2018-03-08_10-00-21

I renamed my new package Lesson_1

2018-03-08_10-01-27

Now we need to make a  new connection so that SSIS knows what our data source will be communicating with. To do so, go to Solution Explorer, right click on Connection Manager and select New Connection Manager

2018-03-08_10-06-41

Since we will be connecting to a SQL Server, select OLE DB from the list below

2018-03-08_10-07-17

Since there are no existing connections to pick from, choose New

2018-03-08_10-07-35

Okay, starting from the top select Native OLE DB\SQL Server Native Client 11.0  (note you might have 10.0 – that will work as well)

Since my SQL Server is locally installed on my machine, I am using Localhost as Server Name, otherwise provide the server name here.

Select your database from the drop down, again my database is Sandbox

Finally, hit Test Connection, you should get a connection successful message box.

Click Okay

2018-03-08_10-10-11.png

You’ll see your new connection in the box, now click Okay

2018-03-08_10-10-55.png

Now at the bottom of your Design Window, you’ll see your new connection in the box labeled Connection Managers

2018-03-08_10-11-15

So next, we need to go to the SSIS Toolbox and drag a Data Flow Task over to the designer

2018-03-08_10-14-02

Once in the designer, click on the Data Flow Task box, this will bring you to the Data Flow window

2018-03-08_10-14-21

Data Flow is used whenever you need to move data between disparate systems. Since we will be moving data from SQL Server to a CSV file, we need to use a Data Flow Task

You should note that the SSIS Toolbox has changed, offering up new Data Flow related tools.

Scroll down to Other Sources and drag OLE DB Source to the design box

2018-03-08_10-14-51

Double click on the new object

2018-03-08_10-15-15

Make sure your Connection manager you just created in the top drop down. Leave Data access mode at Table or view and select dbo.employee_id as your table

2018-03-08_10-15-52

If you click Preview in the bottom left, you will get a pop up of the data in the table

2018-03-08_10-16-08

If you click Columns in the upper left, you will see the columns that will be exported from the table. You can change the Output Column names if you want to use a different name in your output file.

We will skip over Error Output for now, as that is easily an entire lesson all its own.

2018-03-08_10-16-28

Now go back to the SSIS toolbox and under Other Destinations, click on Flat File Destination and drag it over to the design window.

2018-03-08_10-16-54.png

Drag the blue arrow from the OLE DB source box to the Flat File Destination Box

2018-03-08_10-17-18

It should look like this when done

2018-03-08_10-17-39

Now click on Flat File Destination

Since we don’t have a current Flat File Connection, we will need to click on New to create one.

2018-03-08_10-18-04

Select Delimited and click OK

2018-03-08_10-18-42.png

Find a folder you want the file to end up in. Select CSV files from the bottom right drop down, and name your file.  Click OK  (Note, use a name of a file that does not currently exist. This will create the file)

2018-03-08_10-22-56

Check the box: Column names in first data row

2018-03-08_10-23-41.png

If you click on Columns in the upper left, you will see the names of your header columns.

Click Okay to go back to the Flat File Destination Window

2018-03-08_10-24-14.png

If you click on Mappings, you will see you have 3 columns from your source going to three columns in what will be the new CSV file.  In future lessons I will show how you can use this to match up different named columns.

Click Okay to return to the design window

2018-03-08_10-24-35

Go to Solution Explorer, right click on the package and click Execute Package

2018-03-08_10-25-11.png

You should get green check marks to indicate success. This is a very small package, so they probably turned green instantly. In larger jobs, the green check will first be a yellow circle to indicate progress. It turns green when that step is complete.

Note on the connecting line between Source and Destination that you get a read out of how many rows were processed.

2018-03-08_10-25-34

Go to the folder you chose as your destination, you will see your new CSV file there

2018-03-08_10-26-36.png

Open the file, you will see your data has been exported into your CSV file

2018-03-08_10-27-35.png

Advertisements

Python: Naive Bayes’

Naive Bayes’ is a supervised machine learning classification algorithm based off of Bayes’ Theorem. If you don’t remember Bayes’ Theorem, here it is:

bayes

Seriously though, if you need a refresher, I have a lesson on it here: Bayes’ Theorem

The naive part comes from the idea that the probability of each column is computed alone. They are “naive” to what the other columns contain.

You can download the data file here: logi2

Import the Data

import pandas as pd
df = pd.read_excel("C:\Users\Benjamin\Documents\logi2.xlsx")
df.head()

nb.jpg

Let’s look at the data. We have 3 columns – Score, ExtraCir, Accepted. These represent:

  • Score – Student Test Score
  • ExtraCir – Was Student in an Extra Circular Activity
  • Accepted – Was the Student Accepted

Now the Accepted column is our result column – or the column we are trying to predict. Having a result in your data set makes this a supervised machine learning algorithm.

Split the Data

Next split the data into input(score and extracir) and results (accepted).

y = df.pop('Accepted')
X = df

y.head()

X.head()

nb1.jpg

Fit Naive Bayes

Lucky for us, scikitlearn has a bit in Naive Bayes algorithm – (MultinomialNB)

Import MultinomialNB and fit our split columns to it (X,y)

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X,y)

nb2.jpg

Run the some predictions

Let’s run the predictions below. The results show 1 (Accepted) 0 (Not Accepted)

#--score of 1200, ExtraCir = 1
print(classifier.predict([1200,1]))

#--score of 1000, ExtraCir = 0
print(classifier.predict([1000,0]))

nb3

The Code

import pandas as pd
df = pd.read_excel("C:\Users\Benjamin\Documents\logi2.xlsx")
df.head()

y = df.pop('Accepted')
X = df

y.head()
X.head()

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X,y)

#--score of 1200, ExtraCir = 1
print(classifier.predict([1200,1]))

#--score of 1000, ExtraCir = 0
print(classifier.predict([1000,0]))

 

R: Filter Dataframes

When working with large data sets, often you only want to see a portion of the data. If you are running a query on the graduating class of a high school, you don’t want your results to be clogged up with hundreds of freshmen, sophomores, juniors. To accomplish this, we use filtering. Filtering a dataframe is a very useful skill to have in R. Luckily R makes this task relatively simple.

Download the data set here: boxplot2

Import the data

df <- read.csv(file.choose())
head(df)

Let’s take a look at it. We have a data set with 4 columns, workorderNo – Work Order Number, dif – change code since last work order reading, quart – quarter (referring to calendar year), Employee – employee name.

dataframe1

The first thing we need to know when dealing with dataframes is how to use the $

In R, the “$” placed between a dataframe name and a column name signify which column you want to work with.

head(df$Employee)
head(df$quart)

dataframeFilter1.jpg

Notice the line that says Levels: after the Employee list. This lets you know this column is a factor. You can see all the unique elements (factors) in your list by using the levels() command.

level(df$Employee)

dataframeFilter2

Taking another look at our data, the quart column consists of 3 numbers (1,2,3) representing quarters of the year. If you think back to your statistics classes, these numbers are ordinal (meaning they only provide a ranking – you will never use them in a calculation – quart 2 + quart 3 has no real meaning.) So we should make this column a factor as well.

R makes that easy to do.

df$quart <- factor(df$quart)
head(df$quart)

dataframeFilter2

Okay, now onto filtering. Let’s see how easy it can be.

First we assign our filtering logic to a variable, then we place our variable inside the dataframe’s square brackets []

filter <- df$quart==2
df[filter,]

dataframeFilter3.jpg

Or you can just put your logic directly in the dataframe’s square brackets[]

df[df$Employee=="Sally",]

Now we just see Sally

dataframeFilter4.jpg

We can use an and “&” operator to filter by Sally and quart 1

df[df$Employee=="Sally" & df$quart == 1,]

dataframeFilter5.jpg

 

Python: Co-variance and Correlation

In this lesson, I will using data from a CSV file. You can download the file here: heightWeight

If you do not know how to import CSV files into Python, check out my lesson on it first: Python: Working with CSV Files

The Data

The data set includes 20 heights (inches) and weights(pounds). Given what you already know, you could tell me the average height and average weight. You could tell me medians, variances and standard deviations.

correl

But all of those measurements are only concerned with a single variable. What if I want to see how height interacts with weight? Does weight increase as height increases?

**Note while there are plenty of fat short people and overly skinny tall people, when you look at the population at large, taller people will tend to weigh more than shorter people. This generalization of information is very common as it gives you a big picture view and is not easily skewed by outliers.

Populate Python with Data

The first thing we are going to focus on is co-variance. Let’s start by getting our data in Python.

correl1.jpg

Now there is a small problem. Our lists are filled with strings, not numbers. We can’t do calculations on strings.

We can fix this by populating converting the values using int(). Below I created 2 new lists (height and weight), created a for loop counting up to number of values in our lists : range(len(hgt)). Then I filled the new lists using lst.append(int(value))

correl2.jpg

**Now I know I could have resolved this in fewer steps, but this is a tutorial, so I want to provide more of a walk through.

Co-variance

Co-variance tells us how much two variables disperse from the mean together. There are multiple ways to find co-variance, but for me, using a dot product approach has always been the simplest.

For those unfamiliar with dot product. Imagine I had 2 lists (a,b) with 4 elements each. The dot product with calculated as so: a[0]*b[0]+a[1]*b[1]+a[2]*b[2]+a[3]*b[3]

Here is how it works:

If I take the individual variance of height[0] and weight[0] and they are both positive or negative – the product will be positive. – both variables are moving in the same direction

One positive and one negative will be negative. The variables are moving in different directions

One you add them all up, a positive number will mean that overall, you variables seem to have a positive co-variance (if a goes up, b goes up – if a goes down, b goes down)

If the final result is negative, you have negative co-variance (if a goes up, b goes down – if a goes down, b goes up)

If your final answer is 0 – your variables have no measurable interaction

Okay, let’s program this thing

** we will be using numpy’s mean() – mean and dot() – dot product methods and corrcoef() – correlation coefficient

First we need to find the individual variances from mean for each list

I create a function called ind_var that uses a list comprehension to subtract the mean from each element in the list.

correl3.jpg

Now, let’s change out the print statement for a return, because we are going to be using this function inside another function.

correl4.jpg

Co-variance function

Now let’s build the co-variance function. Here we are taking the dot product of the variances of each element of height and weight. We then divide the result by the N-1 (the number of elements – 1 : the minus 1 is due to the fact we are dealing with sample data not population)

correl6

So what were are doing:

  • Take the first height (68 inches) and subtract the mean (66.8) from it (1.2)
  • Take the first weight (165 lbs) and subtract the mean(165.8)  from it (-0.8)
  • We then multiply these values together (-0.96)
  • We repeat this for each element
  • Add the elements up.
  • Divide by 19 (20-1)
  • 144.75789 is our answer

Our result is 144.75789  – a positive co-variance. So when height goes up – weight goes up, when height goes down – weight goes down.

But what does 144 mean? Not much unfortunately. The co-variance doesn’t relate any information as to what units we are working with. 144 miles is a long way, 144 cm not so much.

Correlation

So we have another measurement known as correlation. A very basic correlation equation divides out the standard deviation of both height and weight. The result of a correlation is between 1 and -1. With -1 being perfect anti-correlation and 1 being perfect correlation. 0 mean no correlation exists.

With my equation get 1.028  – more than one. This equation is simplistic and prone to some error.

correl7.jpg

numpy’s corrcoef() is more accurate. It shows us a correlation matrix. Ignore the 1’s – they are part of what is known as the identity. Instead look at the other numbers = 0.97739.  That is about as close to one as you will ever get in reality. So even if my equation is off, it isn’t too far off.

correl9

Now just to humor me. Create another list to play with.

correl10

Let’s run this against height in my correlation function

correl11.jpg

Run these value through the more accurate corrcoef() . This will show my formula is still a bit off, but for the most part, it is not all that bad.

correl8


If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT. 

Follow this link for more Python content: Python

 

8 Software Packages You Should Learn for Data Science

1. Excel

Before everyone starts booing me, Excel is a highly under rated tool by many. I blame it on over familiarity. We have all seen our computer illiterate co-worker using Excel to make up a phone list for their kid’s little league. It is hard to imagine that same tool can be a powerhouse when it comes to data.

Take sometime to learn data cleaning and manipulation methods in Excel. Download the following free add-ons: PowerQuery, PowerPivot, and Solver. These tools offer a lot of the same functionality as a $30K license to SAS, admittedly on a small scale.

2. R

R is a statistical programming package. While far from user friendly – you are going to have to learn to program here – you will be hard pressed to find something that R cannot do. Even better, it is free. As part of the open source community, R is constantly being updated and libraries are being created to cover almost anything you can imagine.

3. Python

Python is another free programming language that is quite popular in the data science world. Unlike R, Python is not specifically designed for data and statistics. It is a full fledged Object Oriented Programming language that has plenty of uses in the computer world.

If you spend even a little time around Data Science forums, you will see the battle of R vs Python play out over and over again. My advice – take a little time to learn the fundamentals of both.

4. MS SQL Server

I chose MS SQL Server over Oracle for 2 reasons. 1) MS SQL Server is 100 times easier to install and configure for most computer users. 2) While Oracle’s PL\SQL is definitely a more robust language, Microsoft’s T-SQL is easier to pick up.

My one caveat here would be if your desire is to become a database design or database engineer. In that case, I would suggest learning Oracle first, as all the fundamental you develop there, will be easily translated to MS SQL Server.

Another great advantage of MS SQL Server is the developer’s edition. For around $70, you can purchase a developer’s license which gives you the whole suite of tools including: SSIS and SSRS.

5. SSIS

SSIS (SQL Server Integration Services) is a robust ETL (Extract Transform Load) tool build around the Microsoft Visual Studios platform. It comes included with the developer’s edition and provides a graphical interface for building ETL processes.

6. Tableau

Until further notice, Tableau is the reigning king of data visualizations. Just download a copy of free Tableau Public and you will wonder why you spent all that effort fighting with Excel charts for all these years.

While Tableau’s analytics tools leave a lot to be desired, the time it will save you in data exploration will have you singing its praises.

7. Qlik

Admittedly, Qlik is focused more as an end user BI tool, but it provides robust data modeling capabilities. Also, Qlik’s interactive dashboards are not only easy to construct, but leave the competition in the dust when it comes to ease of use.

8. Hadoop

It’s Big Data’s world, we just live here. If you want to be a data professional in this century, you are going to need to become familiar with Big Data platforms. My suggestion is to download the Hortonworks Sandbox edition of Hadoop. It is free and they provide hours worth of tutorials. Time spent learning Pig Script and Hive Script (Big Data query languages) will be well worth your effort.

What about SAS, SSPS, Cognos, blah, blah, blah…

While these packages do have a dominant position in the world of analytics, they are not very nice about offering free or low cost versions for people wanting to get into the profession to learn. I wanted to fill my list with software that could be obtain for little or no cost.

If you are currently a college student, you have a better position. Check with your college for software partnerships. Also, check with software vendors to see if they have student editions. As of this writing, I know Tableau, SAS, and Statistica offer student versions that you may want to look into.

Python: Numpy Part II

Beyond numpy’s usefulness in creating arrays and matrices, numpy also provides a great suite of math functions that – for anyone with any programming background – are fairly intuitive.

Here are some examples:

np.pi returns pi and np.sqrt() takes the square root of whatever value you feed it.

pyNp2.jpg

Trig Functions

numpy handles most common trigonometry functions.

pyNp2_1

Stats

Numpy handles many statistics functions.

Below we have mean and median. Unfortunately, just like in R, there is no mode command, but we can fake it using Set.

pyNp2_2

using set to fake mode

pyNp2_4

Numpy can also be used to find range, variance, and standard deviation.

pyNp2_5

Rounding

Numpy has rounding features for dealing with decimals. It also has floor() and ceil() functions that bring the number down to the “floor” or up to the “ceiling”

pyNp2_6

Use in creating graphs

use np.sin()

pyNp2_7.jpg

use np.log()

pyNp2_8

you can even put the two together

pyNp2_9

linspace()

My final one today is a function called linspace(). It lets you create a start and finish point, and how many elements you want. It then will create a even list of number between start and finish.

linspace(start,finish, num=numbers you want)

pyNp2_10


If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT. 

Follow this link for more Python content: Python

 

 

Python: Pandas Intro (Dataframes)

Before we continue onto Dataframes, I want to clear up something from the Series exercise. Note the line (from pandas import Series, DataFrame)

pandas1

Using that line, I can call upon Series or DataFrame directly in my code

pandas2

In this example below, I did not directly import the methods Series and DataFrame, so I when I tried x = Series() I go an error.

I had to use the full method name of pd.Series() for this to work.

pdDF

DataFrame

DataFrames provide another level of data management for Python. Those of you who come from a more data driven background with appreciate DataFrames.

Let’s start by creating dictionary.

pdDF1.jpg

Now, pass the dictionary to the method DataFrame()

Note, now you have a table looking structure with named columns

pdDF2

You can call up a list of indexes or columns using the methods below:

pdDF3.jpg

DataFrame.info() will return a summary of your DataFrame

pdDF4

Head and Tail

Create a new DataFrame from a dictionary

pdDF5.jpg

If you want just see a few of the first elements, you can use the head() method

pdDF6

The tail() method does the last few. You can even choose how many rows you want.

pdDF7

Describe

The describe() method gives you some quick statistics on any numeric column

pdDF8.jpg

Slice

You can slice a DataFrame just as you would a list

pdDF9

Choose What to Display

DataFrames allow you to filter what rows to display by value or column

pdDF10

There is a lot more you can do with Series and DataFrame in pandas, and we will be covering them in later lessons. For now though, I think you have a general idea.


If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT. 

Follow this link for more Python content: Python

 

 

Python: Pandas Intro (Series)

Pandas adds some great data management functionality to Python. Pandas allows you to create series and dataframes.

Series

Think of a series as combination of a list and a dictionary.

To use pandas, first we need to import it.

pandas1

To create as series with pandas, use the following syntax

  • variable = Series([item1, item2, … item_n])

** note Series is capitalized. This method is case sensitive

pandas Series can contain numbers or strings. You call elements in a Series like you do a list. variable[index]

pandas2

You can create you own index in a Series as well, making it function a lot like a dictionary.

Notice with the syntax leg[leg==4] I am able to filter by items equal to 4

pandas3

If you have an existing dictionary, you can convert it into a pandas Series

pandas4

You can call the index and values of a Series using the following methods:

pandas5


If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT. 

Follow this link for more Python content: Python

 

 

 

Python: Enumerate() and Sort

Enumerate

Enumerate is a useful function in Python as it returns both the indexes and values of Lists.

Let’s look at the code:

Running the list through a For loop, we assign the index to x and the value to y

pythonEnum

Let’s say we only wanted to print every other value in a list

To do that, I could run a simple nested an IF statement inside my loop stating if the index an even number, then print the value. (x%2==0 is an easy trick for finding even or odd values. % means modulus – it returns the remainder of a division problem. Any even number divided by 2 will not have a remainder – so the modulus will be 0)

pythonEnum1

Sort

Python lists come with a two built in sort functions.

sorted()

The first: sorted() – sorts the items in a list, but does not alter the list.

As you can see, sorted(lst) returns a sorted list, but calling lst again shows the list is still in its original order.

pythonEnum2

To sort the items in descending order use the reverse=True command:

pythonEnum3

list.sort()

list.sort() is a method – meaning it is a function found with the class List.

list.sort() sorts items in a list and alters position of the items in the actual list. So the next time you run the list, the items are sorted in the list.

Also note that reverse = True works in the list.sort() method as well.

pythonEnum4


If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT. 

Follow this link for more Python content: Python

 

Python: Generators

I apologize in advanced as this topic is going to get a little computer sciency. Unfortunately there is no way around it. At first look, generators are going to resemble all the other iterables we have covered so far. But unlike loops we have used so far, generators produce iterables “lazily”.

So what do I mean by “lazily”?

Let’s consider this infinite loop I created below. This will run forever, producing a List (at least until your computer’s memory runs out). And that is the issue. Eventually the computer will run out of memory.

pythonGen

**note if you actually decide to run the code above, you will need to force it to stop. Just closing the notebook won’t do it. Go to File>Close and Halt to stop this loop from running and eventually crashing your PC.

pythonGen1

This becomes are consideration when working with large data sets. The more memory you chew up, the worse you performance. So a work around is to use generators. Generators produce the data lazily, meaning they produce the iterator, yield it, and then forget it – they don’t put the values into a List like regular iterators do. They only yield one iterator at a time.

Notice my continued use of the world yield? There is a reason for that. Look at the code for a generator below:

Note that generators have to be functions. The yield command – which signifies a generator,  cannot be used outside of a function.

pythonGen2.jpg

Now, I know this looks like I am actually repeating work here. I mean I am using two loops to do what I could do with one loop. This issue arises when we have large numbers. If you are only iterating a few hundred numbers, you probably don’t need to worry about generators. However, if you are going to be iterating 100,000 elements, you will see some major performance improvements by using the code above.

Range() and Xrange()

Python has a built in generator function: xrange(). This will produce a range of numbers – just like range() does – but the xrange() function works “lazily” (i.e. doesn’t chew up memory)

In Python 3 – range() is now a generator as well.


If you enjoyed this lesson, click LIKE below, or even better, leave me a COMMENT. 

Follow this link for more Python content: Python