SUMIF() and COUNTIF()

SUMIF()

SumIf is a handy Excel function to have in your tool box. It is basically a combination of SUM() and IF(). What it is saying is, SUM up all numbers in a range that meet certain condition.

The syntax is as follows:

=SUMIF(range, criteria, [sum_range])

Range = range of numbers you want to test the criteria against

Criteria = the logical test (is the data in the range =, >, < a value)

[sum_range] = this is an optional value. It tells Excel which cells to add. If you don’t use this value, Excel automatic defaults to the first Range

COUNTIF()

CountIF counts the number of cells that meet a certain criteria.

Syntax is similar to SUMIF():

=COUNTIF(range, criteria)

** Note this example uses wildcards – symbols that can be used in place of any symbol. In Excel, the two most popular wildcards are * and ?. ? can be used in place of one character (B?d = Bad, but B?d <> Bead). * can be used in place of many characters. B* = Brain and B*d = Bead)

=If()

Conditional formulas can really up your Excel game. This can provide a lot more functionality to your workbooks.

So what are conditional formulas? Conditional formulas work based on the following logic: IF condition THEN action.

examples: IF gas tank is low THEN  go to gas station

IF age > or = to 21 THEN you can drink

The basic Excel formula for this is =If()

Syntax

=If(logical test,[value if true],[value if false])

Example 1

In the first example, we are going to see if a value in Column A is > 10. If it is, we will return High. If not, it will return Low.

It also works with strings.

Example 2

In this example, we are asking if any name in column D matches Bob, then Yes, else No.

Power Query: Pull data from a website into Excel

For this tutorial, you will need to have Power Query installed. If you are running Office 2016, Power Query should already be available. For Excel 2010 and 2013, here is a link to the download: Power Query

Power Query makes pulling data from a website quick and easy. In this example we will be extracting data from the Wikipedia page “List of NCAA Men’s Division I Basketball champions“. Here is a link to the website: Link to Wikipedia Site

If you open up the Wikipedia page and scroll down a bit, you will see a table:  Championship games, by year, showing winners and losers, final scores and venues. This the table we will be extracting into Excel.

1. Open up a new Excel workbook. Locate the Power Query Tab on the Ribbon Bar
2. Select Web Page
3. In the pop up window, copy and paste the URL to the Wikipedia Page
4.  Click Ok

5. Notice as you click through the list of tables on the left side of the screen, the tables             appear in the preview screen on the right.

6. Select Championship games, by year….

Congratulations. You now have the table from the Web in an Excel sheet.

Power Query: Merge Data Sets

For this tutorial, you will need to have Power Query installed. If you are running Office 2016, Power Query should already be available. For Excel 2010 and 2013, here is a link to the download: Power Query

Here is a link to the practice file for this Lesson: Data Cleaning Power Query

If you open the Excel file, you will see 3 sheets with 3 tables (Work Order, Vehicles, Prices). We want to populate a single sheet combining the three tables into one data set.

Open up a New Excel Workbook (A new file. Do not try working from the practice file). From the Ribbon bar select Power Query > From File > From Excel

Select the practice file: dataCleaningPQuery.xlsx and when the Navigator pops up, select Work Orders

**Note if the Query Editor window pops up, just click Close and Load. We will be working in the Query Editor window later.

Now go to the sheet with the Work Order table on it. Power Query > Merge

A) The top drop down box should already be populated. If not, select Work Orders for top and then select Prices for the bottom.

B) Now highlight the Service Columns in both tables.

C) Leave the Join Kind at Left Outer.

A Left Outer Join works as seen below. The left table is displayed in full and the the right table adds data based on columns that match up against data in the left table.

Note the text below “The selection has matched 29 out of the first 29 rows.” This is because if you look at the two columns we selected, they both have matching text. It is through this matching that we are able to line up the two tables into one.

Click OK and the Query Editor window will pop up.

Select the dual arrow icon in the NewColumn header and deselect Service from the list.

Right click the NewColumn and Rename it Price. Click on it and hold down the mouse button. Now drag it so that price sits in between Service and Mech.

Hit Close & Load and your new merged table should look like this:

Ok, now to load the Vehicle sheet.

Power Query > From File > From Excel. Select excel practice file and select Vehicles. DO NOT HIT LOAD this time. Instead, select Edit.

If you look at the original data set, the Lic Plate column in the Work Order sheet and Licence Plate in Vehicles are the two column we need to match up. Unfortunately, they currently do not match. In the second table, the licence plates are preceded by the letters Lic. We need to remove this.

Right click on the Licence Plate> Replace Values

Type Lic  into Value to  Find and leave Replace with blank. **Note — the Value to Find is “Lic ” with a space after it. Make sure you add the space.

Click OK, the Licence Plate column will remove the Lic. Click Close & Load

Now it is time to merge. Go back to the Work Order sheet and click Merge

Set the top table to Work Orders. Set the bottom to Vehicles. Highlight Lic Plates and Licence Plate columns. Set Join to Left Outer Join and click okay.

Click the double arrows on the new column and un-check column 1.

Rename the new columns Make and Model. Highlight both column and move them in between Lic Plate and Service. Click Close&Load

Congratulations. You have now successfully merged 3 tables into 1 using Power Query. Now you can go forth and analyze the data.

Logistic Regression with Gretl

One of the most popular machine learning algorithms, Logistic Regression is actually a classification algorithm. Broken down to its simplest terms, binary logistic regression (the one we will be focusing on here) is answering a yes or no question. Will the customer buy or not? Is the email SPAM or not?

Score Accepted 982          0 1304         1 1256         1 1562         1 703          0

Above is a small sample from the data set we will be using for this lesson. In this set, student scores for an entrance test are listed in the first column and whether they were Accepted (1) or Not(0) is in the second column.

I ran a scatter plot on the data with Scores on the X axis. As you can see the dots for 2 horizontal lines at 1 and 0. You may notice that the 1 (Accepted) dots seem to cluster towards higher scores and 0 (Not Accepted) dots cluster towards lower scores.

Well since the point of Logistic Regression is help us make predictions, here is how the predictions work. The Logistic Regression, represented by my crudely drawn red S, goes from 1 to 0. And just like with Linear Regression, if we take a value for X, to make our prediction, we look for the value of Y on the line at that point.

In the case of a 1200 score, if we check the value of Y on the line, we get .80. This roughly translates to mean, that with a score of 1200, a student has an 80% chance of being accepted.

Let’s meet Gretl

While there are third party add-ons you can download for Excel that will do Logistical Regression, in its native form, Excel does not do a good job in this area. So I thought this would be a great opportunity to introduce you to a neat piece of FREE software called Gretl.

So why Gretl? Why not R or Python? I mean those are the languages real data scientists use right?

That is true, and R and Python can easily do a Logical Regression. The problem is  however, in order to use R and Python, you need to know how to program. Gretl, on the other hand, is GUI based. Think of it as a point and click light weight R. It is no where near as robust as R, but for learning how to do Logistical Regression, Gretl does a fine job.

After you install and start Gretl, the next step is to load in the data. Go to File>Open Data>User File. Search for the Excel file you downloaded previously in this lesson. Make sure you then select Excel from the file type at the bottom of the screen.

Select logi.xlsx. Leave the Start Import at window at 1 and 1. This is where the data starts in our Excel file: 1rst column, 1rst row. You will get a message letting you know how much data was imported.

The next pop up will noted that the data is undated. Click No on this window.

You data columns (Score, Passed) will appear in the  Gretl window. If you click on one, the data from that column will appear in a pop-up window. **note in the file you download, column 2 will be Accepted not Passed.

Let’s Model

Without further ado, let us do some modelings. From the menu bar Model>Limited dependent variable>Logit>Binary…

Now you have to select you Dependent variable and Regressors. Here is a hint, the dependent variable is what we want to find. What are we looking for? Will the person be Accepted. So Accepted goes in Depentdent variable and Score goes in Regressors. Pick the Show p-values radio button and then click Okay.

Below are the results of your Logistic Regression model

I am not going to give a Stats lesson here, but I will cover the important points.

1. The top red box contains some important information. First the coefficients represent the b and m values from the linear equation we will be using later: y=mX+b =y=0.0105216X + -11.2757
2. The p-value of Score = 0.0009 This is important as the p-value is a probabilistic value  that determines whether or not the regressor variable truly affects the dependent variable. The most common p-value threshold you are likely to come across is 0.05. If your regressor variable has a p-value above 0.05, you will want to reconsider your model.
3. The matrix at the bottom of the screen. This shows you how successfully your model predicted outcomes from the training data set. It translates of the 0’s (not accepted) the model got 19 out of 21 right. For 1’s(accepted) the model got 19 out of 21 right. That is a 90% success rate. Not bad.

Let’s Use the Model

Okay, so maybe you jumped ahead and tried 1200 in the linear formula we developed above. It is 1.325?? How is that? Isn’t this supposed to be between 0 and 1.

Well the problem is, we are not looking for Y we are looking for probability (p). Y in this case is not the Y intercept, but instead:

Well, we know Y = 1.325 for a score of 1200, how do we find p from that? We solve  for p. Now feel free to go and do the math yourself if you want, but I already did the work for you. The equation below solves for p. If you don’t trust me and want to do it yourself, be my guest, but I assure you the equation below is good.

Let’s Make a Prediction

Let us put the formula’s we have found into Excel

Now you have a working prediction model. Any value you place in the score cell will be calculated to Y and p (probability). As the example above shows, a score of 1200 give us a probability of .79.

Turns out my crummy drawing wasn’t so bad after all.

Transpose Data with Excel

Video Link for Transposing Data with Excel

Transposing a table or matrix is a method of rotating or pivoting a table so the rows become the columns and the columns become the rows. Excel offers a couple of methods of performing this action without having to manually recreate the table.

Method 1: Paste Special

This method is very simple.

1. Highlight the table, Right Click > Copy
2. Select a cell where you want the Transposed Table
3. Right Click > Paste Special > Select the Transpose Clipboard

Closer look at the Transpose Clipboard Icon

Method 2: Transpose()

While the Paste Special method works well, it does require user interaction. If you want to automate the process, you can use the Transpose() function.

1. Highlight some blank cells. Make sure you highlight enough to fit you data. In this case, we need to highlight 4 down and 5 across.
2. Now click inside function bar (ƒx)  Type the following: =transpose(
3. Highlight the table you wish to transpose: =transpose(A1:D5
4. Close the function: transpose(A1:D5)
5. Now hold down Shift+Ctrl and press Enter

Argument for Method 2

While the Paste Special method is easy to use, here is a major draw back to it. Notice the red circled cells. After creating Transposed tables using the Transpose() method and the Paste Special method, I changed the Sensor 1 Qrt 3 value from 23 to 24. Notice the Transpose() table updates with the new information, while the Paste Special table does not.

Linear Regression using Excel

Link to video on Linear Regression using Excel

Regression Analysis is still the most popular method used in Predictive Analytics. The main reason is that it works. It is well known and understood. With its different flavors, regression analysis covers a width swath of problems. Another great reason to use it, is that regression tools are easy to find.

What is Linear Regression?

Linear Regression is a method of statistical modeling where the value of a dependent variable based can be found calculated based on the value of one or more independent variables. The general idea, as seen in the picture below, is finding a line of best fit through the data. Using that line, you can then predict the value of Y given X.

I am not going to go too deep into the math here. I highly the Khan Academy video posted below if you are looking to brush up on your statistics.

Lets Start by Looking at the Data

If you download the Excel file at the top of the page, you will find 2 columns labeled Years and Salary. This example data set shows us the years of service and salary of 39 employees for an imaginary company.

What we are going to attempt to do is to develop a model using Linear Regression that will allow us to predict the salary of an employee given their years of service.

Step 1: Build a Scatter Plot

The first thing we want to do is build a scatter plot. Excel makes this simple enough. Just highlight all of your data > select the Insert Tab from the Ribbon > Select Scatter from Charts:

What you will get should look something like this:

We have a scatter chart with Salary on the Y Axis and Years on the X Axis. **Excel scatter charts set the left most column of the data set to the X Axis by default.

Before we move on, I want to take a moment to look at the scatter plot. Do you see a pattern? Can you see where you might be able to draw a line through the data?

I am not trying to just fill space here. I am asking a serious question. Because the answer is sometimes you will not see a pattern. Sometimes the scattering of data will be so random that there will no need to go forward with a linear regression. Learning to look for patterns in data visualizations is skill worth developing.

In this example there is a general pattern, or more accurately, we see what looks like Positive Correlation. We call it positive because it appears that as X increases so does Y. So now that our scatter chart has passed the visual test, it is time perform our regression.

Trend Line

Performing a simple linear regression in Excel is ridiculously easy. Simply click on your scatter plot > from the Ribbon select Chart Tools – Design > Add Chart Element > Trendline > Linear

Your trendline appears on your chart. I personally find the line a little hard to see as is, so I am going to format it a bit.

Start by double clicking on the trendline and the Format Trendline window will open on the right.

Line: — Color: Red  — Width: 3pt  — Dash type: Solid Line

Trendline Options — Select Display Equation on chart and Display R-squared value on chart

Alright, that line is much easier to read. Now let us talk about the numbers in the circle. Now I know I said I was not going to get too deep into the math, but I feel I can’t do this subject justice without at least a cursory explanation of what is going on.

What exactly did Excel do when it added the trendline? Technically it performed a statistical function known as Ordinary Least Squares.  What does that mean? Well if you wanted to attempt this by hand, one approach you could take would be to start by drawing a line that looked best to you. You would then measure the Residuals (the distance from the actual data points and line you drew)

You then repeat the process (picking a new line and measuring residuals) until you find the line that results in the lowest overall residual.Once you have it, you get the equation for your line:  y = 1357.9x+50974 (Luckily for us Excel makes the process a lot easier)

Now a quick refresher on the line formula: Y= mX + b (where m = Slope and b = Y-Intercept). This equation is what you would use to make predictions. In our equation a person with 0 years in service would have a salary of 50974: Y = 1357.9(0) + 50974 — Y= 50974. And each year of service would add 1357.90 to the salary.

Before we go start using your equation to start making predictions, we still need to discuss the R² you see below your line equation. I won’t bore you with how R² is calculated. You don’t really need to know how it is calculated to use linear regression, but you do need to know how to read it.

The simplest explanation I can give you for R² is that a value of 1 means perfect fit – every point in your data matches up to your line. 0 on the other hand, means your line doesn’t match anything. Our R² is 0.4423, which really is not that great. I generally prefer to aim for a R² value above 0.6.

How can we improve our R² value? My preference would be to get more data. We currently only have 39 tuples. More data could improve our accuracy. If more data is not available though, you can look at your outliers as Linear Regression can be greatly affected by outliers. Unfortunately outliers are often tricky to deal with. A person with 1 year of service making 100,000 a year would definitely be an outlier, but it is not an impossibility. If this employee is a highly experienced individual who just transferred from another company, it is totally feasible they could be earning 100,000.

The hard truth is, considering only the data we have, we cannot rightfully develop a reliable model. This happens more often than you might think. That is okay though, we will chalk this up as a learning experience and move on.

Excel: Learning to Use Formulas

The first step towards being an Excel Power User is learning to use formulas. Pulling out a calculator to add up Excel columns is a sure fire way to let everyone in the office know your computer skills are lacking. Luckily for you, basic Excel formulas are pretty easy learn.

Lesson 1: Basic Arithmetic Formulas

Video Link: Excel Formulas Part 1

Excel supports all standard Arithmetic functions – addition, subtraction, multiplication, division, etc. **Excel comes standard with advanced engineering and statistical functions, but that is well beyond the scope of this article.

To start a formula in Excel: select the cell where the result belongs and type =. Using = as the first character tells Excel you want to create a formula.

In this example, we want to add the first row of column A and B and place the answer in C. Select cell C2, enter =A2+B2. Press enter and 7 will appear in the space. Try changing the values in cells A2 and B2. Notice the value in C2 changes in kind.

Notice how as you type in a formula, it populates in the formula bar above as well. You can work in the formula bar instead if you chose (as your formulas get longer and more complex, you will find yourself doing more work in the formula bar).

You can try out some other arithmetic functions following the example below: **note: make sure to start your formulas with =

Lesson 2: Replicating Formulas

Video Link: Excel Formulas Part 2

If you had to input each formula individually every time you wanted to use one, this would not be any more efficient then simply doing the calculations on your old desktop calculator. Fortunately, Excel makes replicating your formulas across many cells as simple as clicking a mouse button.

In the bottom right corner of an active cell, there is a small green box. You can click on this box and drag it over the cells you would like the formula to replicate. Or, you can just double click corner and it will automatically replicate the remaining rows in your table.

Lesson 3: Functions

Excel has hundreds of built in functions (far too many to cover in this article). These functions save you having to write complex formulas manually.

Below are examples of 3 commonly used functions (SUM, PRODUCT, AVERAGE)

Let us break these functions down:

1. SUM() — adds the values of all cells listed in the parenthesis.  **In the example above, the SUM function has C2:C5 in the parenthesis. The colon(:) means between. So C2:C5 actually means all the values C2 through and including C5 (C2,C3,C4,C5)
2. PRODUCT() — multiplies the values of all cells listed in the parenthesis.
3. AVERAGE() — takes the average of the values of all cells listed in the parenthesis.

If you are curious as to what other built in functions Excel offers, go to the Formulas tab in the ribbon cable.

Numbers don’t add up in Excel

With the popularity of Excel, it is no wonder many third party applications offer a feature to allow you to export data directly into Excel. This is convenient and makes it easy to share information as almost everyone has Excel loaded on their computer. Unfortunately, often these files (due to poor coding from the third party application) do not have full functionality.

The Problem: Numeric Columns Don’t Add Up

Your boss receives a file from your purchasing department. It is a monthly breakdown parts expenditures for your maintenance department. The problem is, when he tries performing some basic arithmetic functions on the spreadsheet, the numbers don’t add up(literally). **Note the SUM() function in the picture below returns \$0.00 even though there are positive values in the column above.

After attempting all the solutions he could think of, your boss sends the file to you hoping you can help him out.

The Solution: Hidden Characters

After receiving the file, you discover the cause of the problem is hidden characters. Whatever application this spreadsheet was exported with, added a few hidden gems to make sure the file is utterly useless to anyone trying do much more than just read the file.

So how do we fix the issue? First you will need a keyboard with a numeric keypad. Time to start digging through your desk drawers or making nice with the office hoarder. I am sure he has a few dozen keyboards squirreled away somewhere and can loan you one. Although it will probably be up to you to dust off the doughnut powder.

Okay, so you have your keyboard. Now what?

1. Highlight cells you want to work with
2. Go to Find&Select > Replace in the upper right corner of your ribbon bar

3. In the Find what: field – hold down the ALT key and type 0160 with the numeric keypad. ***note nothing will appear in the space as you type this

4. Leave Replace with: blank and hit the Replace All radio button

Now all of your numeric columns will function properly.You can send the file back to your boss, further cementing your reputation as the office Excel guru.

Extract Date from DateTime in Excel

Have you ever exported an Excel file from a program and the formatting was all off? Even worse, when you try to fix the formatting, Excel won’t let you. One of the most common problems I run up against is date time formatting issues when I am dealing with Excel file that originated somewhere else.

The Problem: Date Time Format cannot be Changed

Here we have an Excel file in which one of the columns contains a date time stamp. The problem is, we do not care about the time, we only want the date. No matter what we do though, we can not seem to alter the date format.

None of the standard fixes help here. Formatting the cells, copying cells to a new sheet, and even using Special Paste results in a date time stamp we still cannot alter.

The Solution: Text to Columns

First things first – Make sure to select the date column in question.

From the Ribbon Bar, go to the Data tab. You will see an Icon labeled Text to Columns.

We are going to use Fixed width as the elements in the date time column all maintain a fixed size.

Hit next. Excel will place bars in the data preview window to show you how the data will be segmented. **Notice Date, Time, and AM/PM each occupy a separate column now.

Since everything looks good, we will hit next.

Now highlight the date column, select Date from the Column data format box and chose your preferred date format from the drop down menu.

For the next two columns, since we do not not need them in this example, we will just highlight each column and select Do not import column (skip)

Now hit Finish and here are the results.