Python: Selenium – Setting Chrome browser size

There are times when running automation on a web browser that you will want to adjust the window size of the browser. The most obvious reason I can think of is some that websites, (mine included) act display differently based on the window size.

For example: Full Size

Reduced size

Notice in the minimized window, my menu list is replaced by accordion button. For the purposes of automation and webscraping, the accordion button is actually easy to navigate than my multi-layered menu.

The code for opening the browser in full screen mode is below: note the line –start-maximized

To open the window in a smaller scale try: window-size=width, length. Play around with the values to get one that works for your screen.

Python Web Scraping: Using Selenium to automate web

This is follow up to how to connect to Chrome using Selenium. If you do not know how to get to a website on Chrome using Selenium, go here

To refresh. here is the code we used to open up a web page (in this case Wikipedia’s home page)

If you run this code, you should find yourself on the home page for Wikipedia

Okay, so now lets learn how to interact with page, the first thing I am going to do is to select the English language version of the page. There are a few ways go about this, but one of the easier approaches is to look at the HTML code that creates the page and to use xpaths or titles to find the object you are looking at.

Right click on the link for English and click inspect from the drop down.

If you get a body link first, you might need to right click and hit inspect again

To check if you have the right element, hover your mouse over it, and it will be highlighted on the webpage

Once you have the right element, right click on it, go to copy>Copy Xpath

Chose Xpath, not full Xpath, it makes for easier coding. You XPath should look something like this: //*[@id=”js-link-box-en”]/strong * When you go to try this, your XPath may look different. As websites are constantly updated, many of the Xpaths get updated as well. Go with the one you find when you Inspect the HTML code yourself

Now we are going to use selenium to “Find” the element we want. The code is dr.find_element_by_XPath(‘//*[@id=”js-link-box-en”]/strong’) *Note the use of single quote around the XPath, it is better to use them as many XPaths will contain double quotes

Once you have run that code, Selenium knows what element you are looking at, you can interact with it now. Let’s “click” the link

Note something i did in the code, I added a link= before my find element command. This assigned the element now to a variable. I can now use the “click()” method the variable inherited from the selenium.webdriver object to click on the English link

I could have just done this: dr.find_element_by_xpath(‘//*[@id=”js-link-box-en”]/strong’).click()

But by assigning the variable it is a) cleaner code and b) the link can be reused by my code later. Remember, it is a law of programming that you will always have to go back and fix something you haven’t seen in 6 months, so make the code as clean as possible to make future you less likely to develop a drinking problem due to having to fix poorly written code.

If you run the code above, you will move to the home English page

Lets try one more thing, lets typing a search into the search bar:

Right click > inspect the search bar, then right click>copy>copy xpath the selection in the HTML code

Now that you have the XPath, lets use the find_element_by_xpath code and a new command, send_keys() to input characters into the search box

Finally, right click on the magnifying glass>inspect>copy>copy Xpath and let us click on it to finish our search. (remember to hover over to make sure you have the right link)

Now you should find yourself on the Data Science page of Wikipedia

Now remember — the xpaths I have on this page will likely be out of date by the time you try this, so make sure to inspect the elements and get the correct XPaths for this work for you.

Python Web Scraping / Automation: Connecting to Chrome with Selenium

Selenium is a Python package that allows you to control web browsers through Python. In this tutorial (and the following tutorials), we will be connecting to Googles Chrome browser, Selenium does work with other browsers as well.

First you will need to download Selenium, you can use the following commands depending on your Python distribution

c:\> Pip install selenium

c:\> Conda install selenium

If you are on a work computer or dealing with a restrictive VPN, the offline install option may help you: Selenium_Install_Offline

Next you need to download the driver that let’s you manage Chrome through Python.

Start by determining what version of Chrome you have on your computer

Click on the three dots in the right corner of your Chrome browser, select Help> About Google Chrome

Go to https://chromedriver.chromium.org/downloads to download the file that matches your Chrome version. (note, this is something you will need to do every time Chrome is updated, so get used to it.)

Open up the zipfile you downloaded, you will find a file called chromedriver.exe

Put it somewhere you can find, put in the following code to let Python know where to find it.

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
dr = webdriver.Chrome('C:/Users/larsobe/Desktop/chromedriver.exe',chrome_options=options)

Now to see if this works, use the following line, (you can try another website if you choose)   

Note the message Chrome is being controlled by automated test software.

You are now running a web browser via Python.

Python: Confusion Matrix

What is a confusion matrix?

A confusion matrix is a supervised machine learning evaluation tool that provides more insight into the overall effectiveness of a machine learning classifier. Unlike a simple accuracy metric, which is calculated by dividing the number of correctly predicted records by the total number of records, confusion matrices return 4 unique metrics for you to work with.

While I am not saying accuracy is always misleading, there are times, especially when working with examples of imbalanced data,  that accuracy can be all but useless.

Let’s consider credit card fraud. It is not uncommon that given a list of credit card transactions, that a fraud event might make up a little as 1 in 10,000 records. This is referred to a severely imbalanced data.  Now imaging a simple machine learning classifier running through that data and simply labeling everything as not fraudulent. When you checked the accuracy, it would come back as 99.99% accurate. Sounds great right? Except you missed the fraud event, the only reason to try to create the model in the first place.

A confusion matrix will show you more details, letting you know that you completely missed the fraud event. Instead of a single number result, a confusion matrix provides you will 4 metrics to evaluate. (note: the minority class – (in the case of fraud – the fraudulent events) – are labeled positive by confusion matrices. So a non-fraud event is a negative. This is not a judgement between the classes, only a naming convention)

TP = true positive – minority class (fraud) is correctly predicted as positive

FP = false positive – majority class (not fraud) is incorrectly predicted

FN = false negative – minority class (fraud) incorrectly predicted

TN = true negative – majority class (not fraud) correctly predicted

In matrix form:

confus

To run a confusion matrix in Python, Sklearn provides a method called confusion_matrix(y_test, y_pred)

y_test = actual results from the test data set

y_pred = predictions made by model on test data set

so in a pseudocode example:

model.fit(X,y)
y_pred = model.predict(X_test)

If this is at all confusing, refer to my Python SVM lesson where I create the training and testing set and run a confusion matrix (Python: Support Vector Machine (SVM))

To run a confusion matrix in Python, first run a model, then run predictions (as shown above) and then follow the code below:

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

Output looks like this:

Confu1

Now, if you want to capture the TP, TN, FP, FN into individual variables to work with, you can add the ravel() function to your confusion matrix:

TN,FP,FN,TP = confusion_matrix(y_test, y_pred).ravel()

Thank you for taking the time to read this, and good luck on your analytics journey.

Python: Support Vector Machine (SVM)

Support Vector Machine (SVM):

A Support Vector Machine, or SVM, is a popular binary classifier machine learning algorithm. For those who may not know, a binary classifier is a predictive tool that returns one of two values as the result, (YES – NO), (TRUE – FALSE), (1 – 0).  Think of it as a simple decision maker:

Should this applicant be accepted to college? (Yes – No)

Is this credit card transaction fraudulent? (Yes – No)

An SVM predictive model is built by feeding a labeled data set to the algorithm, making this a supervised machine learning model. Remember, when the training data contains the answer you are looking for, you are using a supervised learning model. The goal, of course, of a supervised learning model is that once built, you can feed the model new data which you do not know the answer to, and the model will give you the answer.

Brief explanation of an SVM:

An SVM is a discriminative classifier. It is actually an adaptation of a previously designed classifier called perceptron. (The perceptron algorithm also helped to inform the development of artificial neural networks).

The SVM works by finding the optimal hyperplane that can be used to discriminate between classes in the data set. (Classes refers to the label or “answer” column of each record.  The true/false, yes/no column in a binary set). When considering a two dimensional model, the hyperplane simply becomes a line that divides to the classes of data.

The hyperplane (or line in 2 dimensions) is informed by what are known as Support Vectors. A record from the data set is converted into a vector when fed through the algorithm (this is where a basic understanding of linear algebra comes in handy). Vectors (data records) closest to the decision boundary are called Support Vectors. It is on either side of this decision boundary that a vector is labeled by the classifier.

The focus on the support vectors and where they deem the decision boundary to be, is what informs the SVM as to where to place the optimal hyperplane. It is this focus on the support vectors as opposed to the data set as a whole, that gives SVM an advantage over a simple learner like a linear regression, when dealing with complex data sets.

Coding Exercise:

Libraries needed:

sklearn

pandas

This is the main reason I recommend the Anaconda distribution of Python, because it comes prepackaged with the most popular data science libraries.

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import metrics
from sklearn.metrics import confusion_matrix
import pandas as pd

Next, let’s look at the data set. This is the Pima Indians Diabetes data set. It is a publicly available data set consisting of 768 records. Columns are as follows:

  1. Number of times pregnant.
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
  3. Diastolic blood pressure (mm Hg).
  4. Triceps skinfold thickness (mm).
  5. 2-Hour serum insulin (mu U/ml).
  6. Body mass index (weight in kg/(height in m)^2).
  7. Diabetes pedigree function.
  8. Age (years).
  9. Class variable (0 or 1).

Data can be downloaded with the link below

pima_indians

Once you download the file, load it into python (you’re file path will be different)

df = pd.read_excel(‘C:\\Users\\blars\\Documents\\pima_indians.xlsx’)

now look at the data:

df.head()

svm1

Now keep in mind, class is our target. That is what we want to predict.

So let us start by separating the target class.

We use the pandas command .pop() to remove the Class column to the y variable, and the remained of the dataframe is now in the X

Let’s now split the data into training and test sets:

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33)

Now we will train (fit) the model. This example I am using Sklearns SVC() model for an SVM example. There are a lot of SVMs available to try if you would like to explore deeper.

Code for fitting the model:

model =SVC()
model.fit(X_train, y_train)

Now using the testing subset we withheld, we will test our model

y_pred = model.predict(X_test)

Now to see how good the model is, we will perform an accuracy test.  This simply takes all the correct guess and divides them by total guesses.

As, you can seen below, we compare the y_pred (predicted values) against y_test (actual values) and we get .7677 or 77% accuracy. Which is not a bad model for simply using defaults.

svm3

Let’s look at a confusion matrix to get just a little more in-depth info

svm4

For those not familiar with a confusion matrix, this will help you to interpret results:

First number 151 = True Negatives — this would be the number of 0’s or not diabetics correctly predicted

Second number 15 = False Positives — the number of 0’s (non-diabetics) falsely predicted to be a 1

Third number 44 = False negatives — the number of 1’s (diabetics) falsely predicted to be a 0

Fourth number 44 = True Positives — the number of 1 (diabetics) correctly predicted.

So, the model correctly identified 44 out of the 59 diabetics in the test data, and misdiagnoses 44 out the 195 non diabetics in the data sample.

To see a video version of this lesson, click the link here: Python: Build an SVM

Python: Simulate Blockchain Mining

In my earlier tutorial, I demonstrated how to use the Python library hashlib to create a sha256 hash function. Now, using Python, I am going to demonstrate the principle of blockchain mining. Again using BitCoin as my model, I will be trying to find a nonce value that will result in a hash value below a predetermined target.

We will start by simply enumerating an integer through our sha 256 hash function until we find a hash with 4 leading zeros.

I used a while loop, passing the variable “y” through my hashing function each time the loop runs. I then inspect the first 4 digits [:4] of my hash value. If the first four digits equal 0000 then I exit the loop by setting the found variable to 1

(*note, a hash value is a string – hence the need for quotes around ‘0000’)

2018-04-27_13-30-18

As you can see in the version above, it took 88445 iterations to find an acceptable hash value

Now, using the basic example of a blockchain I gave in an earlier lesson, let’s simulate mining a block

2018-04-04_12-39-34

You’ll see, I am now combining the block number, nonce, data, and previous hash of my simulated block and passing it through my encryption function.  Just like in BitCoin, the only value I change per iteration is the Nonce. I keep passing my block through the hashing function until I find the Nonce that gives me a hash below the target.

2018-04-27_13-36-38

Now, let’s lower the target value to 6 leading zeros. This should result in a longer runtime to get your hash

2018-04-27_13-37-16

To measure the run time difference, let’s add some time stamps to our code

2018-04-27_13-39-31.png

So, I am using the timestamp function twice. D1 will be our start time, d2 will be our end time, and I am subtracting d1 from d2 to get our elapsed time. In the example below, my elapsed time was 5 secs

2018-04-27_13-50-43

Now, let’s bump the target down to 7 leading zeros. Now this brings my elapsed computing time to 20 minutes. That is a considerable commitment of resources. You can see why they call it a “proof of work” now.

2018-04-27_14-26-04

 

Python 2.xx VS 3.xx

If you are trying to learn Python, especially for Data Science, you are going to come across a bunch a people (myself included) who have been very hesitant to move from Python 2.xx to Python 3.xx.

The reason?

Not Backwards Compatible

Well, when Python 3 first came out, it was made very clear that it was not backwards compatible. While most of the code remained the same, there were some changes that made programs coded in Python 2.xxx fail. A prime example is Print. In Python 2,

Print ‘Hello World!’

was perfectly acceptable, but in Python 3, it fails and throws up an exception. Python 3 had added the requirement for () with print statements

Print(‘Hello World’)   — Python 3 friendly

Well I have a whole bunch of code sitting on my hard drive I like to refer to, and having to go through it cleaning up all the new changes did not exactly seem like the best use of my time, since I am able to just continue using Python 2.xxx.

Libraries

Python is so great for Data Science because of the community of libraries out there providing the data horsepower we all love. (Pandas, Numpy, SciKit Learn, ect).  Well guess what, the changes to Python 3 made some of these libraries unstable. Strike 2, another reason not to waste my time with this new version.

Moving On

Well, it appears that enough time has passed and some people tell me all the bugs have been worked out with Python 3. While Python 2.7 will be supported til 2020 (those who love it, just keep using it I say), I have decided to try putting Python 3 through it’s paces.

My lessons will be reviewed one by one, with the 2.7 code being tested in a 3.4 environment. I will note any changes that need to made to the code to make it 3.4 compatible and add them to the  lesson. Each lesson that has been reviewed will have the following heading.

*Note: This lesson was written using Python 2.xx. If you are using Python 3.xxx any changes to the code will be annotated under headings: Python 3.xxx

2.7??

If you love 2, just keep using it. You’ve got 3 more years. By that point, whatever revision of 3 we are on may not even look like the current version. But if you want to look ahead, follow along with me as I update my code. (Note, all the original 2.7 code will remain on my site until the time that it is no longer supported)

Python: Pivot Tables with Pandas

Pandas provides Python with lots of advanced data management tools. Being able to create pivots tables is one of the cooler tools.

You can download the data set I will be working with here: KMeans1

First, let’s upload our data into Python

import pandas as pd
df = pd.read_excel("C:\Users\Benjamin\Documents\KMeans1.xlsx")
df.head()

pandasPivotTable.jpg

Let’s create our first pivot table.

The syntax is = pd.pivot_table(dataframe, index = Columns you want to group by, values = Columns you want to aggregate, aggfunc = type of aggregation)

pd.pivot_table(df, index='Department', values = 'AVG Labor', aggfunc = 'sum')

pandasPivotTable1.jpg

We can group by more than one column

pd.pivot_table(df, index=['Department','Model'], values = 'AVG Labor',  
aggfunc= 'mean')

pandasPivotTable2.jpg

You can also have multiple value columns

 pd.pivot_table(df, index='Department', values = ['AVG Labor','Labor Cost'],
  aggfunc= 'sum')

pandasPivotTable3.jpg

Now, one catch is, what if you want to have different aggregate functions. What if I want to get the mean of AVG Labor, but I want the sum of Labor Cost columns.

In this case we are going to use the groupby().aggregate()

import numpy as np
df.groupby('Department').aggregate({'AVG Labor':np.mean, 'Labor Cost': np.sum})

pandasPivotTable4.jpg

Python: An Interesting Problem with Pandas

I was writing a little tongue and cheek article for LinkedIn on fraud detection using frequency distributions (you can read the article here: LinkedIn). While this was a non-technical article, I wanted to use some histograms from a real data set, so I uploaded a spread sheet into Python and went to work.

While working with the data I ran into an interesting problem that had me chasing my tail for about 10 minutes before I figured it out. It is a fun little problem involving Series and Dataframes.

As always, you can upload the data set here: FraudCheck1

Upload the data.

import pandas as pd
df = pd.read_excel
("C:\\Users\\Benjamin\\OneDrive\\Documents\\article\\python\\FraudCheck1.xlsx")
df.head()

pandasProb.jpg

The data is pretty simple here. We are concerned with our answer column and the CreatedBy (which is the employee ID).  What I am trying to do is see if the “answer”  (a reading from an electric meter) are really random or if they have been contrived by someone trying to fake the data.

First, I want to get the readings for all the employees, so I used pop() to place the answer column into a separate list.

df1 = df

y = df1.pop("answer")

pandasProb1.jpg

Then, to make my histogram more pleasant looking, I decided to only use the last digit before the decimal. That way I will have 10 bars (0-9). (Remember, this is solely for making charts for an article. So I was not concerned with any more stringent methods of normalization)

What I am doing below is int(199.7%10). Remember % is the modulus – leaves you with the remainder and int converts your float to an integers. So 199.7 is cut to 199. The 199/10 remainder = 9.

a= []
i = 0 
while i < len(y):
     a.append(int(y[i]%10))
     i += 1
a[1:10]

pandasProb2

Then I created my histogram.

%matplotlib inline
from matplotlib import pyplot as plt
plt.hist(a)

pandasProb3.jpg

Now my problem

Now I want graph only the answers from employee 619, so first I filter out all rows but the ones for employee 619.

df2 = df.query('CreatedBy == 619')
y1 =df2.pop("answer")

Then I ran my loop to turn my answers into a single digit.

And I get an error.  Why?

pandasProb4.jpg

Well the answer lies in the datatypes we are working with. Pandas read_excel function creates a Dataframe.

When you pop a column from a dataframe, you end up with a Series. And remember a series is an indexed listing of values.

Let’s look at our Series. Check out the point my line is pointing to below. Notice how my index jumps from 31 to 62. My while loop counts by 1, so after 31, I went looking for y1[32] and it doesn’t exist.

pandasProb5.jpg

Using .tolist() converts our Series to a list and now our while loop works.

pandasProb6.jpg

And now we can build another histogram.

pandasProb7.jpg

The Code

import pandas as pd
df = pd.read_excel
("C:\\Users\\Benjamin\\OneDrive\\Documents\\article\\python\\FraudCheck1.xlsx")
df.head()

df1 = df
y =df1.pop("answer")

a= []
i = 0 
while i < len(y):
   a.append(int(y[i]%10))
   i += 1
a[1:10]

%matplotlib inline
from matplotlib import pyplot as plta1 = []
i = 0
while i < len(y2):
 a1.append(int(y2[i])%10)
 i = i+1
a1[1:10]

plt.hist(a)

df2 = df.query('CreatedBy == 619')
y1 =df2.pop("answer")

y2= y1.tolist()
type(y2)

a1 = []
i = 0
while i < len(y2):
    a1.append(int(y2[i])%10)
    i = i+1
a1[1:10]

plt.hist(a1)

 

Python: Naive Bayes’

Naive Bayes’ is a supervised machine learning classification algorithm based off of Bayes’ Theorem. If you don’t remember Bayes’ Theorem, here it is:

bayes

Seriously though, if you need a refresher, I have a lesson on it here: Bayes’ Theorem

The naive part comes from the idea that the probability of each column is computed alone. They are “naive” to what the other columns contain.

You can download the data file here: logi2

Import the Data

import pandas as pd
df = pd.read_excel("C:\Users\Benjamin\Documents\logi2.xlsx")
df.head()

nb.jpg

Let’s look at the data. We have 3 columns – Score, ExtraCir, Accepted. These represent:

  • Score – Student Test Score
  • ExtraCir – Was Student in an Extra Circular Activity
  • Accepted – Was the Student Accepted

Now the Accepted column is our result column – or the column we are trying to predict. Having a result in your data set makes this a supervised machine learning algorithm.

Split the Data

Next split the data into input(score and extracir) and results (accepted).

y = df.pop('Accepted')
X = df

y.head()

X.head()

nb1.jpg

Fit Naive Bayes

Lucky for us, scikitlearn has a bit in Naive Bayes algorithm – (MultinomialNB)

Import MultinomialNB and fit our split columns to it (X,y)

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X,y)

nb2.jpg

Run the some predictions

Let’s run the predictions below. The results show 1 (Accepted) 0 (Not Accepted)

#--score of 1200, ExtraCir = 1
print(classifier.predict([1200,1]))

#--score of 1000, ExtraCir = 0
print(classifier.predict([1000,0]))

nb3

The Code

import pandas as pd
df = pd.read_excel("C:\Users\Benjamin\Documents\logi2.xlsx")
df.head()

y = df.pop('Accepted')
X = df

y.head()
X.head()

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X,y)

#--score of 1200, ExtraCir = 1
print(classifier.predict([1200,1]))

#--score of 1000, ExtraCir = 0
print(classifier.predict([1000,0]))