Languages used: Python
Python packages used: BeautifulSoup4
There are many reasons why you'd want to scrape a web page. First things first – what is web scraping?
Web scraping: Extracting large amounts of data from websites in order to store / use that information.
Personally, I've used it for a few projects:
- A film showtimes portal that shows you all the films playing across the city, by reading film showtimes from ~10 individual cinema websites
- Checking availability on multiple online retailers' websites of a particular wireless keyboard that I wanted, and sending me an email as soon as it was available (very useful in beating out the WFH computer accessories shortage!)
- Downloading all the listings of cars from AutoTrader.co.uk into an Excel that helped me decide which car to buy (and which year's model, based on when the depreciation curve flattened out enough)
As you can see, it's a useful skill for anyone keen to make useful applications using Python.
That said – if you're looking to do something super simple, e.g. just check if a web page has been updated, there are probably easier ways to do it with tools people have built already. Nevertheless, the solution described here is quite versatile and will help you build applications that use the scraped data.
Before you begin, you should know this process scrapes an HTML web page, but this isn't the only way to get data from a website. You can also use JSON (usually harder to find the correct URL, but easier to interpret once you've found it).
If you don't know Python, you can still follow Steps 1 and 2, and skim the rest to (hopefully) see how easy the whole process is
Example: Scraping Amazon prices
Scraping a web page involves 4 simple steps:
- Understanding the structure of the web page
- Identifying the parts of the web page you wish to extract
- Loading up the web page in Python
- Parsing the web page with BeautifulSoup
Step 1: Understanding the web page structure
Every web page is written in HTML, which is a "language" that the web browser understands. If you're not familiar with HTML, right-click on any web page and click "Inspect"; this shows you the underlying HTML that your browser translates into the webpage you are seeing.
Look through the structure. You'll see lots of HTML "tags" such as <div>, <h1> and <p>. Each opening tag has a complementary closing tag which has a slash. So a paragraph of text might be defined by the following HTML:
<p>This is a sample text.</p>
Output: This is a sample text.
Tags often sit within tags, giving rise to a hierarchical structure for each web page. See this introduction by w3schools if you're interested in learning more.
In short, the HTML is the underlying information that your browser receives, and then translates (or "renders") into a useful format.
Step 2: Identifying the useful parts of a page
To scrape information, you need to know where it sits within a web page. You can do this using the "Inspect" feature discussed earlier. For example, the price of this light bulb on Amazon sits in a "span" tag with id="priceblock_ourprice".
This is already a great starting point. If you use Python to search an Amazon product page for a span tag with id="priceblock_ourprice", you would find the price. This might not work every time, for example if there are multiple tags with that id (though that's unlikely) or if the id keeps changing (websites can use random strings as id's to minimise scraping). In that case, you might have to use the structure of the web page and navigate within the HTML to find your target, e.g. look for the first span within the first div within the second div, etc.
Identify the location within the HTML structure of the target data
Step 3: Loading up the page in Python
OK, here's the easy bit. In Python, it is super simple to request a web page. Make sure you have the requests module installed:
$ pip install requests
Then write a script like the below:
>>> import requests
>>> url = "https://www.google.com"
>>> r = requests.get(url)
>>> print(r.status_code)
200
>>> print(r.text)
'<html><head>...'
This requests the web page you want, and prints out the HTML of that webpage.
Common pitfall: The status code is 200 for a successful request. Typically a request fails because the server recognised you as a bot rather than as a human – in most cases, it's actually quite easy to fake being a web browser by adding the right headers to the request. For example, the below code works for Amazon:
>>> import requests
>>> url = "https://www.amazon.co.uk/Philips-Classic-Non-Dimmable-Light-Bulb/dp/B07667LFR4/"
>>> headers = {
'authority': 'www.amazon.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-dest': 'document',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
>>> r = requests.get(url, headers=headers)
>>> print(r.status_code)
200
>>> print(r.text)
'<html><head>...'
Request the page in Python using the Requests package
Step 4: Parsing the page with BeautifulSoup
BeautifulSoup is a package that takes in an HTML page and provides an object that is easy to navigate, so that instead of searching for '<span id="priceblock_ourprice"' and then dealing with the HTML syntax of what you find, you can simply specify your criteria and get the £10.32 returned to you.
Make sure you have BeautifulSoup and HTML5lib installed:
$ pip install bs4 html5lib
Then add the following code to your script:
>>> from bs4 import BeautifulSoup
.
.
.
>>> if r.status_code == 200:
s = BeautifulSoup(r.text, 'html5lib')
You've now created a BeautifulSoup object (stored in s), which is really easy to search and navigate. Now, getting the price is as simple as:
>>> price = s.find("span", {"id":"priceblock_ourprice"}).text
>>> print(price)
'£10.32'
Parse the received HTML with BeautifulSoup using html5lib as the parser, then locate the required data by specifying your criteria of tags and HTML attributes
And that's it! For more information, see the BeautifulSoup documentation.