The goal is to extract or “scrape” the word of the day and its definitions from http://www.merriam-webster.com/word-of-the-day

The first thing we will do is to open the page in our browser and view the Inspector tab. I’ve done it here by right-clicking on the page and selecting Inspect Element.

We can then use the Selector tool (the first button on the panel to the left of Inspector) to click on a specific element on the page to display the HTML.

Do note that the Inspector tab shows your browser’s representation of the page after it has parsed the source HTML and as such it may differ from the actual source HTML.

Looking at the HTML we can see that the word itself is inside a <h1> tag but that tag is inside a <div> tag with class="word-and-pronunciation" which makes it easy to locate.

<div class="word-header">
  <div class="word-and-pronunciation">
  <h1>microcosm</h1>

The definitions are slightly more tricky as they are a sequence of 1 or more <p> tags that follow the <h2>Definition</h2> tag.

<div class="wod-definition-container">
  <h2>Definition</h2>
  <p><strong>1 :</strong> a little world; ...
  <p><strong>2 :</strong> a community or ...
  <h2>Examples</h2>
  <p>...
  <p>...

This is very similar to what we faced in the wikipedia.com article.

Code

We’ll be using requests to fetch the HTML and BeautifulSoup with html5lib to parse it. You can install these using pip install beautifulsoup4 requests html5lib --user if you have not already.

>>> import requests
>>> from   bs4 import BeautifulSoup
>>>
>>> url = 'http://www.merriam-webster.com/word-of-the-day'
>>> r   = requests.get(url, headers={'user-agent', 'Mozilla/5.0'})
>>>
>>> soup = BeautifulSoup(r.content, 'html5lib')

We’re setting the User-Agent header to Mozilla/5.0 as the default requests header is commonly blocked.

There are several ways to approach isolating particular items. In this instance the <h1> tag that contains the word is in the first <h1> tag on the page.

We can BeautifulSoup’s find() method to return the first matching tag.

>>> soup.find('h1')
<h1>microcosm</h1>

BeautifulSoup also makes tags available as attributes meaning soup.find('h1') is equivalent to soup.h1

>>> soup.h1
<h1>microcosm</h1>
>>> soup.h1.text
'microcosm'

We could be more specific and match the class attribute of the parent div tag using a CSS Selector

>>> soup.select_one('.word-and-pronunciation')
<div class="word-and-pronunciation">\n<h1>microcosm</h1>\n...
>>> soup.select_one('.word-and-pronunciation').h1.text
'microcosm'

The . here means to match against the class attribute.

To locate the <p> tags we could first find the <h2> tag and work from there. Like the <h1> example it’s the first <h2> tag on the page.

>>> soup.h2
<h2>Definition</h2>

However if we needed to be more explicit with our match it would probably make sense to use the string argument.

>>> soup.find('h2', string='Definition')
<h2>Definition</h2>

From there we can jump to the next sibling tag using find_next_sibling()

>>> soup.h2.find_next_sibling()
<p><strong>1 :</strong> a little world; ...
>>> soup.h2.find_next_sibling().name
'p'

Just as we did in the wikipedia.com article we can loop over all the sibling <p> tags.

>>> tag = soup.h2.find_next_sibling()
>>> while tag.name == 'p':
...     print(tag.text)
...     tag = tag.find_next_sibling()
... 
1 : a little world; especially : the human race or human nature seen as an epitome of the world or the universe
2 : a community or other unity that is an epitome of a larger unity

If we combine it all together…

1 2 3 4 5 6 7 8 9 10 11 12 13 14
import requests from bs4 import BeautifulSoup url = 'http://www.merriam-webster.com/word-of-the-day' r = requests.get(url, headers={'user-agent': 'Mozilla/5.0'}) soup = BeautifulSoup(r.content, 'html5lib') word = soup.h1.text print(word) tag = soup.h2.find_next_sibling() while tag.name == 'p': print(tag.text) tag = tag.find_next_sibling()

… and the output it produced.

$ python word-of-the-day.py
microcosm
1 : a little world; especially : the human race or human nature seen as an epitome of the world or the universe
2 : a community or other unity that is an epitome of a larger unity