Python: Webscraping using Requests and BeautifulSoup to identify website content

Ben Larson Ph.D.

4 years ago

Advertisements

When learning any new skill, it is always helpful to see a practical application. I currently work as a data scientist in the Cyber-security department. One of our concerns is fake websites set up to look like a real website for my company. We search recently registered Domain names for names that might mimic our company brand or name. We need to know if these are possibly malicious sites looking to steal our employee or customer information.

The problem is, we get 100’s of new domains registered a day that fit our search criteria. Instead of looking at each on manually, I created a Python script that scans the list of new domains for key words. To do this, all you need is Requests and BeautifulSoup.

In this example, I am going to look at a couple of websites. I want to know if they discuss Python or podcasts or neither?

Let us start by looking at my site: Analytics4All.org

I imported requests and BeautifulSoup, then ran Request.get() on my website to pull the HTML code.

import requests as re
from bs4 import BeautifulSoup

r = re.get("https://analytics4all.org")
print(r.text)

If you are not sure how to do this, check out my earlier tutorials on this topic here:

Intro to Requests

Requests and BeautifulSoup

Now let us search the HTML for our keywords

Using .find() command, Python returns the first location in the HTML of the keywords. Notice podcast returned -1. That means podcasts are not mentioned in my site. But Python is listed at 35595. So I can label this site as mentioning Python, but not podcasts

r.text.find('Python')
r.text.find('podcast')

Let’s try another site, and use BeautifulSoup to search the code

In this example, we will look at iHeartRadio’s website: iheart.com

Using the same .find() from the last time we see that Podcasts are mentioned in this website

r1 = re.get("https://www.iheart.com/")
soup1 = BeautifulSoup(r1.content, 'html.parser')
print(soup1.prettify())  

r1.text.find('Podcasts')

Using BeautifulSoup we can limit our search to a more targeted element of the website, which is useful in reducing false positives, as sometimes you will find some weird stuff buried in the HTML that is honestly irrelevant to the website.

Above we just pulled the title from the website and looked for Podcasts. And we found it.

a = soup1.title.string
print(a)
str(a)
print(a.find('Podcasts'))

Finally, let us inspect the login page for Macy’s

Notice we are getting Access Denied when searching this site. Unfortunately Requests doesn’t work in all case. I will be explaining how to get around this problem in my lessons on Selenium.

r1 = re.get("https://www.macys.com/account/signin")
soup1 = BeautifulSoup(r1.content, 'html.parser')
print(soup1.prettify())

But for now, just know Requests does work for most websites, and using it makes a simple way to automate scanning websites for keywords that can be used to categorize the site, and in my case, find bad operators and shut them down.

Share this: