Python: Webscraping using Requests and BeautifulSoup to identify website content

When learning any new skill, it is always helpful to see a practical application. I currently work as a data scientist in the Cyber-security department. One of our concerns is fake websites set up to look like a real website for my company. We search recently registered Domain names for names that might mimic our company brand or name. We need to know if these are possibly malicious sites looking to steal our employee or customer information.

The problem is, we get 100’s of new domains registered a day that fit our search criteria. Instead of looking at each on manually, I created a Python script that scans the list of new domains for key words. To do this, all you need is Requests and BeautifulSoup.

In this example, I am going to look at a couple of websites. I want to know if they discuss Python or podcasts or neither?

Let us start by looking at my site:

I imported requests and BeautifulSoup, then ran Request.get() on my website to pull the HTML code.

import requests as re
from bs4 import BeautifulSoup

r = re.get("")

If you are not sure how to do this, check out my earlier tutorials on this topic here:

Intro to Requests

Requests and BeautifulSoup

Now let us search the HTML for our keywords

Using .find() command, Python returns the first location in the HTML of the keywords. Notice podcast returned -1. That means podcasts are not mentioned in my site. But Python is listed at 35595. So I can label this site as mentioning Python, but not podcasts


Let’s try another site, and use BeautifulSoup to search the code

In this example, we will look at iHeartRadio’s website:

Using the same .find() from the last time we see that Podcasts are mentioned in this website

r1 = re.get("")
soup1 = BeautifulSoup(r1.content, 'html.parser')


Using BeautifulSoup we can limit our search to a more targeted element of the website, which is useful in reducing false positives, as sometimes you will find some weird stuff buried in the HTML that is honestly irrelevant to the website.

Above we just pulled the title from the website and looked for Podcasts. And we found it.

a = soup1.title.string

Finally, let us inspect the login page for Macy’s

Notice we are getting Access Denied when searching this site. Unfortunately Requests doesn’t work in all case. I will be explaining how to get around this problem in my lessons on Selenium.

r1 = re.get("")
soup1 = BeautifulSoup(r1.content, 'html.parser')

But for now, just know Requests does work for most websites, and using it makes a simple way to automate scanning websites for keywords that can be used to categorize the site, and in my case, find bad operators and shut them down.

Python: Webscraping using BeautifulSoup and Requests

I covered an introduction to webscraping with Requests in an earlier post. You can check it out here: Requests

As a quick refresher, the requests module allows you call on a website through Python and retrieve the HTML behind the website. In this lesson we are going to add on to this functionality by adding the module BeautifulSoup.


BeautifulSoup provides an useful HTML parser that makes it much easier to work with the HTML results from Requests. Let’s start by importing our libraries we will need for this lesson

Next we are going to use requests to call on my website.
We will then pass the HTML code to BeautifulSoup

The syntax is BeautifulSoup(HTML, ‘html.parser’)

The HTML I am sending to BeautifulSoup comes from my request.get() call. In the last lesson, I used r.text to print out the HTML to view, here I am passing r.content to BeautifulSoup and printing out the results.

Note I am also using the soup.prettify() command to ensure my printout is easier to read for humans

BeautifulSoup makes parsing the HTML code easier. Below I am asking to see soup.title – this returns the HTML code with the “title” markup.

To take it even another step, we can add soup.title.string to just get the string without the markup tags

soup.get_text() returns all the text in the HTML code without the markups or other code

In HTML ‘a’ and ‘href’ signify a link

We can use that to build a for loop that reads all the links on the webpage.

Python: Create a Word Cloud

Word Clouds are a simple way of visualizing word frequency in a corpus of text. Word Clouds typically work by displaying frequently used words in a text corpus, with the most frequent words appearing in larger text.

Here is the data file I will be using in this example if you want to follow along:

As far as libraries go, you will need pandas, matplotlib, os, and wordcloud. If you are using the Anaconda python distribution you should have all the libraries but wordcloud. You can install it using PIP or Conda install.

Lets start by loading the data

import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import os

#Set working directory

#Import CSV
df = pd.read_csv("movies.csv")

#First look at the Data

** Note: if you are using Jupyter notebooks to run this, add %matplotlib inline to the end of the import matplotlib line, otherwise you will not be able to see the word cloud

import matplotlib.pyplot as plt %matplotlib inline

We can use to look a little closer at the data

We have to decide what column we want to build our word cloud from. In this example I will be using the title column, but feel free to use any text column you would like.

Let look at the title column

As you can see, we have 20 movie titles in our data set. Next thing we have to do is merge these 20 rows into one large string

corpus = " ".join(tl for tl in df.title)

The code above is basically a one line for loop. For every Row in the Column df.title, join it with the next row, separating by a space ” “

Now build the word cloud

wordcloud = WordCloud(width=640, height=480, max_words=20).generate(corpus)

You can change the width and height, number of words that will appear. Play around with the numbers, see how it changes your output

Finally, let’s chart it, so we can see the cloud


interpolation = “bilinear” is what lets the words so sideways and up and down

plt.axis(“off”) gets rid or axis markers (see below)

You can also go back to the word cloud and change the background color
wordcloud = WordCloud(width=640, height=480, background_color = 'white', max_words=25).generate(corpus)

Python: Rename Pandas Dataframe Columns

Renaming columns is easy using pandas, first lets build a quick dataframe:

import pandas as pd
x= {'Job Title' :['Manager', 'Tech', 'Supervisor'],
    'Employee' : ['Jill', 'Will', 'Phil']}

df = pd.DataFrame(x)

now to rename, we have a few options

df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'}, inplace=True)
#or you can move it to a new dataframe if you want to keep the original intact
df1 = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
#note I left off the inplace=True argument on the second since I didn't want to 
#overwrite the original

Here are a few other ways to do it, each will give you the same results

df2 = df.rename({'Job Title': 'Job_title', 'Employee': 'Emp'}, axis=1)  
df3 = df.rename({'Job Title': 'Job_title', 'Employee': 'Emp'}, axis='columns')
df4 = df.rename(columns={'Job Title': 'Job_title', 'Employee': 'Emp'})   

More Python tips click here:

Python: Convert Datetime to Date using Pandas

To convert a datetime in a pandas dataframe to date use the function:

df['column'] = pd.to_datetime(df['column'])

To demonstrate, first let’s build a dataframe

import pandas as pd
df = pd.DataFrame({'Job_Start': ['Demolition','Construction', 'Cleanup'],
                   'time': ['2022-05-20 08:07:22', '2022-05-27 07:34:01', 
                   '2022-06-01 09:12:11']})

Now lets convert the “time” column to date instead of datetime

df['time'] = pd.to_datetime(df['time'])

Python: Read all files in a folder

os.listdir() command will easily give you a list off all files in a folder.

So for this exercise I created a folder and threw a few files in it.

Using the following code, can iterate through the file list

import os
for files in os.listdir("C:/Users/blars/Documents/test_folder"):

Now if I wanted to read the files, I could use the Pandas command pd.read_excel to read each file in the loop

***Note, I made this folder with only Excel files on purpose for ease of demonstration. You could do this with multiple file types in a folder, it would however require some conditional logic to handle the different file types

To read all the Excel files in the folder:

import pandas as pd
import os

for files in os.listdir("C:/Users/blars/Documents/test_folder"):
    file = pd.read_excel(files)

Python Web Scraping: Get http code easily with the Requests module

The Requests module for Python makes capturing and working with HTML code from any website.

Requests comes installed in many of the Python distributions, you can test if it is installed on yours machine by running the command: import requests

If that command fails, then you’ll need to install the module using Conda or Pip

import requests
t = requests.get('')

As you can see, using just 3 lines of code you can return the HTML from any website

You can see that all the text found on the web page is found in the HTML code, so parsing through the text can allow you to scrape the information off of a website

Requests has plenty more features, here are couple I use commonly

t.status_code == returns the status of your get request. If all goes well, it will return 200, otherwise you will get error codes like 404


You can also extract your results into json


Python: Selenium – Setting Chrome browser size

There are times when running automation on a web browser that you will want to adjust the window size of the browser. The most obvious reason I can think of is some that websites, (mine included) act display differently based on the window size.

For example: Full Size

Reduced size

Notice in the minimized window, my menu list is replaced by accordion button. For the purposes of automation and webscraping, the accordion button is actually easy to navigate than my multi-layered menu.

The code for opening the browser in full screen mode is below: note the line –start-maximized

To open the window in a smaller scale try: window-size=width, length. Play around with the values to get one that works for your screen.

Python Web Scraping: Using Selenium to automate web

This is follow up to how to connect to Chrome using Selenium. If you do not know how to get to a website on Chrome using Selenium, go here

To refresh. here is the code we used to open up a web page (in this case Wikipedia’s home page)

If you run this code, you should find yourself on the home page for Wikipedia

Okay, so now lets learn how to interact with page, the first thing I am going to do is to select the English language version of the page. There are a few ways go about this, but one of the easier approaches is to look at the HTML code that creates the page and to use xpaths or titles to find the object you are looking at.

Right click on the link for English and click inspect from the drop down.

If you get a body link first, you might need to right click and hit inspect again

To check if you have the right element, hover your mouse over it, and it will be highlighted on the webpage

Once you have the right element, right click on it, go to copy>Copy Xpath

Chose Xpath, not full Xpath, it makes for easier coding. You XPath should look something like this: //*[@id=”js-link-box-en”]/strong * When you go to try this, your XPath may look different. As websites are constantly updated, many of the Xpaths get updated as well. Go with the one you find when you Inspect the HTML code yourself

Now we are going to use selenium to “Find” the element we want. The code is dr.find_element_by_XPath(‘//*[@id=”js-link-box-en”]/strong’) *Note the use of single quote around the XPath, it is better to use them as many XPaths will contain double quotes

Once you have run that code, Selenium knows what element you are looking at, you can interact with it now. Let’s “click” the link

Note something i did in the code, I added a link= before my find element command. This assigned the element now to a variable. I can now use the “click()” method the variable inherited from the selenium.webdriver object to click on the English link

I could have just done this: dr.find_element_by_xpath(‘//*[@id=”js-link-box-en”]/strong’).click()

But by assigning the variable it is a) cleaner code and b) the link can be reused by my code later. Remember, it is a law of programming that you will always have to go back and fix something you haven’t seen in 6 months, so make the code as clean as possible to make future you less likely to develop a drinking problem due to having to fix poorly written code.

If you run the code above, you will move to the home English page

Lets try one more thing, lets typing a search into the search bar:

Right click > inspect the search bar, then right click>copy>copy xpath the selection in the HTML code

Now that you have the XPath, lets use the find_element_by_xpath code and a new command, send_keys() to input characters into the search box

Finally, right click on the magnifying glass>inspect>copy>copy Xpath and let us click on it to finish our search. (remember to hover over to make sure you have the right link)

Now you should find yourself on the Data Science page of Wikipedia

Now remember — the xpaths I have on this page will likely be out of date by the time you try this, so make sure to inspect the elements and get the correct XPaths for this work for you.

Python Web Scraping / Automation: Connecting to Chrome with Selenium

Selenium is a Python package that allows you to control web browsers through Python. In this tutorial (and the following tutorials), we will be connecting to Googles Chrome browser, Selenium does work with other browsers as well.

First you will need to download Selenium, you can use the following commands depending on your Python distribution

c:\> Pip install selenium

c:\> Conda install selenium

If you are on a work computer or dealing with a restrictive VPN, the offline install option may help you: Selenium_Install_Offline

Next you need to download the driver that let’s you manage Chrome through Python.

Start by determining what version of Chrome you have on your computer

Click on the three dots in the right corner of your Chrome browser, select Help> About Google Chrome

Go to to download the file that matches your Chrome version. (note, this is something you will need to do every time Chrome is updated, so get used to it.)

Open up the zipfile you downloaded, you will find a file called chromedriver.exe

Put it somewhere you can find, put in the following code to let Python know where to find it.

from selenium import webdriver
options = webdriver.ChromeOptions()
dr = webdriver.Chrome('C:/Users/larsobe/Desktop/chromedriver.exe',chrome_options=options)

Now to see if this works, use the following line, (you can try another website if you choose)   

Note the message Chrome is being controlled by automated test software.

You are now running a web browser via Python.