I covered an introduction to webscraping with Requests in an earlier post. You can check it out here: Requests
As a quick refresher, the requests module allows you call on a website through Python and retrieve the HTML behind the website. In this lesson we are going to add on to this functionality by adding the module BeautifulSoup.
BeautifulSoup provides an useful HTML parser that makes it much easier to work with the HTML results from Requests. Let’s start by importing our libraries we will need for this lesson
The syntax is BeautifulSoup(HTML, ‘html.parser’)
The HTML I am sending to BeautifulSoup comes from my request.get() call. In the last lesson, I used r.text to print out the HTML to view, here I am passing r.content to BeautifulSoup and printing out the results.
Note I am also using the soup.prettify() command to ensure my printout is easier to read for humans
BeautifulSoup makes parsing the HTML code easier. Below I am asking to see soup.title – this returns the HTML code with the “title” markup.
To take it even another step, we can add soup.title.string to just get the string without the markup tags
soup.get_text() returns all the text in the HTML code without the markups or other code
In HTML ‘a’ and ‘href’ signify a link
We can use that to build a for loop that reads all the links on the webpage.