Given a http://swansonvitamins.com product page the goal is to extract or “scrape” the product brand, name, sku, price and size / quantity. We’ve been given http://www.swansonvitamins.com/swanson-premium-turmeric-720-mg-240-caps as an example URL.

The first thing we will do is to open the page in our browser and view the Inspector tab. I’ve done it here by right-clicking on the page and selecting Inspect Element.

We can then use the Selector tool (the first button on the panel to the left of Inspector) to click on a specific element on the page to display the HTML.

Do note that the Inspector tab shows your browser’s representation of the page after it has parsed the source HTML and as such it may differ from the actual source HTML.

Looking at the HTML we can see each piece of data seems to have its own unqiue itemprop value apart from the size / quantity which has its own unique class value.

<h2 class="item-brand hidden-xs" itemprop="brand">
  <a href="/swanson-premium" title="See all Swanson Premium products">Swanson Premium</a>
</h2>
<h1 class="item-detail-item-name hidden-xs" itemprop="name">Turmeric</h1>
<p>
  Item:&nbsp;<span itemprop="sku"><b>SW1075</b></span>
</p>
[...]
 <p class="id-size-and-quantity">720 mg 240 Caps<br>
                                Caps
[...]
<meta itemprop="availability" itemtype="http://schema.org/ItemAvailability" content="In Stock " />
<b itemprop="priceCurrency" content="USD"></b><b itemprop="price" content="9.99">$9.99</b>

Code

We’ll be using requests to fetch the HTML and BeautifulSoup with html5lib to parse it. You can install these using pip install beautifulsoup4 requests html5lib --user if you have not already.

>>> import requests
>>> from   bs4 import BeautifulSoup
>>>
>>> url = 'http://www.swansonvitamins.com/swanson-premium-turmeric-720-mg-240-caps'
>>> r   = requests.get(url, headers={'user-agent': 'Mozilla/5.0'})
>>>
>>> soup = BeautifulSoup(r.content, 'html5lib')

We’re setting the User-Agent header to Mozilla/5.0 as the default requests header is commonly blocked.

To find any tag with an attribute containing a specific value we can use the CSS Selector [name=value] if we wanted to specify a specific tag we could use tag[name=value]

In this case the itemprop value is specific enough without needing to use the tag name.

>>> soup.select_one('[itemprop=brand]').text
'\nSwanson Premium\n'
>>> soup.select_one('[itemprop=brand]').text.strip()
'Swanson Premium'

We can use the same process to extract the name, sku and price.

>>> soup.select_one('[itemprop=name]').text.strip()
'Turmeric'
>>> soup.select_one('[itemprop=sku]').text.strip()
'SW1075'
>>> soup.select_one('[itemprop=price]').text.strip()
'$9.99'

It wasn’t asked for but we could also extract the availability

>>> soup.select_one('[itemprop=availability]')['content'].strip()
'In Stock'

Do note that the In Stock value is contained in the content attribute as opposed to the .text content. We can use dict indexing to access the value of a tag’s attribute.

For the size / quantity we can match against the class attribute using the CSS Selector .id-size-and-quantity

The . is a shortcut for matching against the class attribute.

>>> soup.select_one('.id-size-and-quantity').text
'720 mg 240 Caps\n\t\t\t\tCaps\n\t\t\t\t\tSize:\n\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t00\n\n\n\n'

However there is a lot of extra garbage we do not need. We could use the .splitlines() method to split it into lines then take the first one.

>>> soup.select_one('.id-size-and-quantity').text.splitlines()[0]
'720 mg 240 Caps'

There is also the .strings attribute on a BeautifulSoup tag object which gives you a generator that yields each individual string of its text content (whereas .text gives you all the strings joined together).

>>> for string in soup.select_one('.id-size-and-quantity').strings:
...     string
... 
'720 mg 240 Caps'
'\n\t\t\t\tCaps\n\t\t\t\t\tSize:\n\t\t\t\t\t'
'\n'
'\n'
'\n'
'\n\t\t\t\t\t\t\t00'
'\n'
'\n'
'\n'
'\n'

We could call next() to get just the first string (or line in this case).

>>> next(soup.select_one('.id-size-and-quantity').strings)
'720 mg 240 Caps'

Now that we can “scrape” all the data we want let’s combine it all together.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
import csv, requests, sys from bs4 import BeautifulSoup url = 'http://www.swansonvitamins.com/swanson-premium-turmeric-720-mg-240-caps' writer = csv.writer(sys.stdout) writer.writerow( ['Brand', 'Name', 'SKU', 'Size', 'Price', 'Availability'] ) with requests.session() as s: s.headers['user-agent'] = 'Mozilla/5.0' r = s.get(url) soup = BeautifulSoup(r.content, 'html5lib') brand = soup.select_one('[itemprop=brand]').text.strip() name = soup.select_one('[itemprop=name]').text.strip() sku = soup.select_one('[itemprop=sku]').text.strip() price = soup.select_one('[itemprop=price]').text.strip() avail = soup.select_one('[itemprop=availability]')['content'].strip() size = next(soup.select_one('.id-size-and-quantity').strings) writer.writerow([brand, name, sku, size, price, avail])

The output it produced.

$ python swansonvitamins.com.py
Brand,Name,SKU,Size,Price,Availability
Swanson Premium,Turmeric,SW1075,720 mg 240 Caps,$9.99,In Stock

We’re assuming this needs to be done for multiple product pages so we’re using session object to maintain “state” and keep track of cookies.

It also allows us to specify the User-Agent header once for all requests.

It was not stated what the final goal for the output was but we’ve printed the output to sys.stdout using the csv module as an example.