Python Web Scraping: Get http code easily with the Requests module

The Requests module for Python makes capturing and working with HTML code from any website.

Requests comes installed in many of the Python distributions, you can test if it is installed on yours machine by running the command: import requests

If that command fails, then you’ll need to install the module using Conda or Pip

import requests
t = requests.get('http://aiwithai.com')
print(t.text)

As you can see, using just 3 lines of code you can return the HTML from any website

You can see that all the text found on the web page is found in the HTML code, so parsing through the text can allow you to scrape the information off of a website

Requests has plenty more features, here are couple I use commonly

t.status_code == returns the status of your get request. If all goes well, it will return 200, otherwise you will get error codes like 404

t.headers

You can also extract your results into json

t.json()

Python Web Scraping: Using Selenium to automate web

This is follow up to how to connect to Chrome using Selenium. If you do not know how to get to a website on Chrome using Selenium, go here

To refresh. here is the code we used to open up a web page (in this case Wikipedia’s home page)

If you run this code, you should find yourself on the home page for Wikipedia

Okay, so now lets learn how to interact with page, the first thing I am going to do is to select the English language version of the page. There are a few ways go about this, but one of the easier approaches is to look at the HTML code that creates the page and to use xpaths or titles to find the object you are looking at.

Right click on the link for English and click inspect from the drop down.

If you get a body link first, you might need to right click and hit inspect again

To check if you have the right element, hover your mouse over it, and it will be highlighted on the webpage

Once you have the right element, right click on it, go to copy>Copy Xpath

Chose Xpath, not full Xpath, it makes for easier coding. You XPath should look something like this: //*[@id=”js-link-box-en”]/strong * When you go to try this, your XPath may look different. As websites are constantly updated, many of the Xpaths get updated as well. Go with the one you find when you Inspect the HTML code yourself

Now we are going to use selenium to “Find” the element we want. The code is dr.find_element_by_XPath(‘//*[@id=”js-link-box-en”]/strong’) *Note the use of single quote around the XPath, it is better to use them as many XPaths will contain double quotes

Once you have run that code, Selenium knows what element you are looking at, you can interact with it now. Let’s “click” the link

Note something i did in the code, I added a link= before my find element command. This assigned the element now to a variable. I can now use the “click()” method the variable inherited from the selenium.webdriver object to click on the English link

I could have just done this: dr.find_element_by_xpath(‘//*[@id=”js-link-box-en”]/strong’).click()

But by assigning the variable it is a) cleaner code and b) the link can be reused by my code later. Remember, it is a law of programming that you will always have to go back and fix something you haven’t seen in 6 months, so make the code as clean as possible to make future you less likely to develop a drinking problem due to having to fix poorly written code.

If you run the code above, you will move to the home English page

Lets try one more thing, lets typing a search into the search bar:

Right click > inspect the search bar, then right click>copy>copy xpath the selection in the HTML code

Now that you have the XPath, lets use the find_element_by_xpath code and a new command, send_keys() to input characters into the search box

Finally, right click on the magnifying glass>inspect>copy>copy Xpath and let us click on it to finish our search. (remember to hover over to make sure you have the right link)

Now you should find yourself on the Data Science page of Wikipedia

Now remember — the xpaths I have on this page will likely be out of date by the time you try this, so make sure to inspect the elements and get the correct XPaths for this work for you.

Python Web Scraping / Automation: Connecting to Chrome with Selenium

Selenium is a Python package that allows you to control web browsers through Python. In this tutorial (and the following tutorials), we will be connecting to Googles Chrome browser, Selenium does work with other browsers as well.

First you will need to download Selenium, you can use the following commands depending on your Python distribution

c:\> Pip install selenium

c:\> Conda install selenium

If you are on a work computer or dealing with a restrictive VPN, the offline install option may help you: Selenium_Install_Offline

Next you need to download the driver that let’s you manage Chrome through Python.

Start by determining what version of Chrome you have on your computer

Click on the three dots in the right corner of your Chrome browser, select Help> About Google Chrome

Go to https://chromedriver.chromium.org/downloads to download the file that matches your Chrome version. (note, this is something you will need to do every time Chrome is updated, so get used to it.)

Open up the zipfile you downloaded, you will find a file called chromedriver.exe

Put it somewhere you can find, put in the following code to let Python know where to find it.

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
dr = webdriver.Chrome('C:/Users/larsobe/Desktop/chromedriver.exe',chrome_options=options)

Now to see if this works, use the following line, (you can try another website if you choose)   

Note the message Chrome is being controlled by automated test software.

You are now running a web browser via Python.

Power Query: Pull data from a website into Excel

For this tutorial, you will need to have Power Query installed. If you are running Office 2016, Power Query should already be available. For Excel 2010 and 2013, here is a link to the download: Power Query

Power Query makes pulling data from a website quick and easy. In this example we will be extracting data from the Wikipedia page “List of NCAA Men’s Division I Basketball champions“. Here is a link to the website: Link to Wikipedia Site

If you open up the Wikipedia page and scroll down a bit, you will see a table:  Championship games, by year, showing winners and losers, final scores and venues. This the table we will be extracting into Excel.

powerQuery-web.jpg

  1. Open up a new Excel workbook. Locate the Power Query Tab on the Ribbon Bar
  2. Select Web Page
  3. In the pop up window, copy and paste the URL to the Wikipedia Page
  4.  Click Ok

powerQuery-web1

5. Notice as you click through the list of tables on the left side of the screen, the tables             appear in the preview screen on the right.

6. Select Championship games, by year….

7. Click Load 

powerQuery-web2.jpg

Congratulations. You now have the table from the Web in an Excel sheet.

powerQuery-web3