Python: Webscraping using Requests and BeautifulSoup to identify website content

When learning any new skill, it is always helpful to see a practical application. I currently work as a data scientist in the Cyber-security department. One of our concerns is fake websites set up to look like a real website for my company. We search recently registered Domain names for names that might mimic our company brand or name. We need to know if these are possibly malicious sites looking to steal our employee or customer information.

The problem is, we get 100’s of new domains registered a day that fit our search criteria. Instead of looking at each on manually, I created a Python script that scans the list of new domains for key words. To do this, all you need is Requests and BeautifulSoup.

In this example, I am going to look at a couple of websites. I want to know if they discuss Python or podcasts or neither?

Let us start by looking at my site: Analytics4All.org

I imported requests and BeautifulSoup, then ran Request.get() on my website to pull the HTML code.

import requests as re
from bs4 import BeautifulSoup

r = re.get("https://analytics4all.org")
print(r.text)

If you are not sure how to do this, check out my earlier tutorials on this topic here:

Intro to Requests

Requests and BeautifulSoup

Now let us search the HTML for our keywords

Using .find() command, Python returns the first location in the HTML of the keywords. Notice podcast returned -1. That means podcasts are not mentioned in my site. But Python is listed at 35595. So I can label this site as mentioning Python, but not podcasts

r.text.find('Python')
r.text.find('podcast')

Let’s try another site, and use BeautifulSoup to search the code

In this example, we will look at iHeartRadio’s website: iheart.com

Using the same .find() from the last time we see that Podcasts are mentioned in this website

r1 = re.get("https://www.iheart.com/")
soup1 = BeautifulSoup(r1.content, 'html.parser')
print(soup1.prettify())  

r1.text.find('Podcasts')

Using BeautifulSoup we can limit our search to a more targeted element of the website, which is useful in reducing false positives, as sometimes you will find some weird stuff buried in the HTML that is honestly irrelevant to the website.

Above we just pulled the title from the website and looked for Podcasts. And we found it.

a = soup1.title.string
print(a)
str(a)
print(a.find('Podcasts'))

Finally, let us inspect the login page for Macy’s

Notice we are getting Access Denied when searching this site. Unfortunately Requests doesn’t work in all case. I will be explaining how to get around this problem in my lessons on Selenium.

r1 = re.get("https://www.macys.com/account/signin")
soup1 = BeautifulSoup(r1.content, 'html.parser')
print(soup1.prettify())    

But for now, just know Requests does work for most websites, and using it makes a simple way to automate scanning websites for keywords that can be used to categorize the site, and in my case, find bad operators and shut them down.

Python: Webscraping using BeautifulSoup and Requests

I covered an introduction to webscraping with Requests in an earlier post. You can check it out here: Requests

As a quick refresher, the requests module allows you call on a website through Python and retrieve the HTML behind the website. In this lesson we are going to add on to this functionality by adding the module BeautifulSoup.

BeautifulSoup

BeautifulSoup provides an useful HTML parser that makes it much easier to work with the HTML results from Requests. Let’s start by importing our libraries we will need for this lesson

Next we are going to use requests to call on my website.
We will then pass the HTML code to BeautifulSoup

The syntax is BeautifulSoup(HTML, ‘html.parser’)

The HTML I am sending to BeautifulSoup comes from my request.get() call. In the last lesson, I used r.text to print out the HTML to view, here I am passing r.content to BeautifulSoup and printing out the results.

Note I am also using the soup.prettify() command to ensure my printout is easier to read for humans

BeautifulSoup makes parsing the HTML code easier. Below I am asking to see soup.title – this returns the HTML code with the “title” markup.

To take it even another step, we can add soup.title.string to just get the string without the markup tags

soup.get_text() returns all the text in the HTML code without the markups or other code

In HTML ‘a’ and ‘href’ signify a link

We can use that to build a for loop that reads all the links on the webpage.

SQL: Drop, Delete, Truncate commands

When it comes to deleting data from a SQL server, you have 3 main options: Drop, Delete, and Truncate

Drop

The Drop command completely deletes a table from the database. Let’s take the table Employee

When I run the following code: Drop Table Employee;

You can see the entire table has been dropped from the database, data and all. There is no way to recover data from a dropped table, short of having a back up

Delete

Delete removes data from the table, not the table its. Also, Delete can be used in conjunction with a Where clause to choose exactly which data to delete

You see that only the row with ‘Chris’ was deleted

Without the Where clause, all data is deleted, however the table remains

Truncate

Truncate acts just like Delete but you can’t add a Where clause, it simply deletes all data from a table

Python: Create a Word Cloud

Word Clouds are a simple way of visualizing word frequency in a corpus of text. Word Clouds typically work by displaying frequently used words in a text corpus, with the most frequent words appearing in larger text.

Here is the data file I will be using in this example if you want to follow along:

As far as libraries go, you will need pandas, matplotlib, os, and wordcloud. If you are using the Anaconda python distribution you should have all the libraries but wordcloud. You can install it using PIP or Conda install.

Lets start by loading the data

import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import os

#Set working directory
os.chdir('C:\\Users\\blars\\Documents')

#Import CSV
df = pd.read_csv("movies.csv")

#First look at the Data
df.head()

** Note: if you are using Jupyter notebooks to run this, add %matplotlib inline to the end of the import matplotlib line, otherwise you will not be able to see the word cloud

import matplotlib.pyplot as plt %matplotlib inline

We can use df.info() to look a little closer at the data

We have to decide what column we want to build our word cloud from. In this example I will be using the title column, but feel free to use any text column you would like.

Let look at the title column

As you can see, we have 20 movie titles in our data set. Next thing we have to do is merge these 20 rows into one large string

corpus = " ".join(tl for tl in df.title)

The code above is basically a one line for loop. For every Row in the Column df.title, join it with the next row, separating by a space ” “

Now build the word cloud

wordcloud = WordCloud(width=640, height=480, max_words=20).generate(corpus)

You can change the width and height, number of words that will appear. Play around with the numbers, see how it changes your output

Finally, let’s chart it, so we can see the cloud

plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.show()

interpolation = “bilinear” is what lets the words so sideways and up and down

plt.axis(“off”) gets rid or axis markers (see below)

You can also go back to the word cloud and change the background color
wordcloud = WordCloud(width=640, height=480, background_color = 'white', max_words=25).generate(corpus)
plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.show()

SQL: Load (Insert) data into a table

Here are the steps to add data to existing table:

Let’s use the employee table created on the create table page: Create_table

To insert some data into this table, first you need to know what the data types of each of the columns in the table are. In this example I am using my MySQL. An easy way to see column data types is to Right Click on Table name> Send to SQL Editor > Create Statement

The results will appear in the query window. In this table we can see the columns are a integer (INT), string (varchar), DOB (date)

The syntax for inserting data into the table is as follows

insert into <table name> (column1, column2,...)
values (value1, value2,...)

In the example below, we are loading 3 rows in, separating each row by a comma:

If we run this, we can then check on the results with a Select statement

SQL: Create a Table

Create a table in MySql and Sql Server

Create table <table name> (
           <column 1> data type,
           <column 2> data type )

As an example, let’s create an employee table with 3 columns, ID, Name, and Date of Birth

create table Employee(
   Id int,
  Emp_NM varchar(255),
  DOB date);

** remember that unless you set your database to be case sensitive, most SQL databases are not

You may need to refresh your database explore view, but after running this command, you should now have a new table

If you need a reference table for the types of data types available, check out this page: data types

SQL: What is DDL and DML?

You might have heard of DDL and DML and been confused. Are they part of SQL or are they their own language? Actually Yes and No…

If you look at it as a purist computer programmer or an academic, then you will probably going to say they are all different languages, you would not be wrong. However for someone like me, I view them as a subset of commands used in the SQL language.

DDL: Data Definition Language is a set of commands used to create, modify, or drop databases, tables, views, indexes, schemas, and users.

DML: Data Manipulation Language is a set of commands used to add data to a table, move data around, read data, update data, or delete data.

SQL: Common Data Types in MySQL

Here is a table of the most commonly used data types in MySQL

Data TypeDescription
Char()Fixed length string, unused spaces get padded and eat up memory: size 0-255
Varchar()Variable length string, unused spaces don’t use memory: size 0 to 65535
MediumText()A string up to 16,777,215 characters long
LongText()A string up to 4,294,967,295 characters long
INTinteger (whole number, no decimals)
Double(x, d)floating point decimal number, x is size, d is number of places after the decimal
Bool or BooleanBinary choice, 0 = False and 1 = True
DateDate data type “YYYY-MM-DD” (if set to US settings)
DATETIMEdatetime data type “YYYY-MM-DD HH:MM:SS” (if set to US settings)
TIMETime “HH:MM:SS”
YEARyear in for digit representation (ex 1908,1965,2011)

SQL: Add, rename, or delete a column in an existing table

To add a column here is the syntax

alter table <table name>
add <column name> <datatype>;

example: (add a column named location to a table emp_info)

alter table emp_info
add location varchar(255);

To rename a column in sql, here is the syntax

alter table <table name>
rename column <old name> to <new name>;

example: (change name of column location to office)

alter table emp_info
rename column location to office;

To delete or remove a column, use the drop command

alter table <table name>
drop column <column name>;

Example:

alter table emp_info
drop column office;

Return to SQL page:

Python: Rename Pandas Dataframe Columns

Renaming columns is easy using pandas, first lets build a quick dataframe:

import pandas as pd
x= {'Job Title' :['Manager', 'Tech', 'Supervisor'],
    'Employee' : ['Jill', 'Will', 'Phil']}

df = pd.DataFrame(x)

now to rename, we have a few options

df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'}, inplace=True)
#or you can move it to a new dataframe if you want to keep the original intact
df1 = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
#note I left off the inplace=True argument on the second since I didn't want to 
#overwrite the original

Here are a few other ways to do it, each will give you the same results

df2 = df.rename({'Job Title': 'Job_title', 'Employee': 'Emp'}, axis=1)  
df3 = df.rename({'Job Title': 'Job_title', 'Employee': 'Emp'}, axis='columns')
df4 = df.rename(columns={'Job Title': 'Job_title', 'Employee': 'Emp'})   

Last Lesson: Pandas Working with DataFrames

Next Lesson: Python: Working with Rows in DataFrames

Return to: Python for Data Science Course

Follow this link for more Python content: Python