The goal is to extract or “scrape” the links to the podcasts on http://scriptnotes.net and save the files to disk using Python.

We’ll firstly open the page in our browser and take a look at the Inspector tab. I’ve done it here by right-clicking on the first download link and selecting Inspect Element. This will leave the HTML of the element selected directly in focus in the display window.

Do note that the Inspector tab shows your browser’s representation of the page after it has parsed the source HTML and as such it may differ from the actual source HTML.

With the Inspector tab open you can also use the Selector tool (the first button on the panel to the left of Inspector) to browse to the source of a specific element.

So now with it open we can see the the link to the podcast episode is inside the href attribute and it ends with .mp3

Code

We’ll be using requests to fetch the HTML and BeautifulSoup with html5lib to parse it. You can install these using pip install beautifulsoup4 requests html5lib --user if you have not already.

So what we want to do is to find all <a> tags that contain an href attribute whose value ends with .mp3

Perhaps the simplest way to do that is to use a CSS Selector with BeautifulSoup’s select() method.

CSS Selector’s are like a “language” for selecting items. A brief subset includes:

  • tag - matches <tag>
  • tag[attr] - matches <tag> that has attr="" regardless of the value of attr
  • tag[attr=string] - matches <tag> that has attr whose value is exactly string
  • tag[attr$=string] - matches <tag> that has attr whose value endswith string

This means we can use the CSS Selector a[href$=.mp3] to match an <a> tag whose href attribute ends with the string .mp3

select() returns a list of all matching tags whereas select_one() returns the first tag.

>>> import requests
>>> from   bs4 import BeautifulSoup
>>> 
>>> url = 'http://scriptnotes.net'
>>>
>>> r = requests.get(url, headers={'user-agent': 'Mozilla/5.0'})
>>> soup = BeautifulSoup(r.content, 'html5lib')
>>>
>>> mp3s = soup.select('a[href$=.mp3]')
>>> len(mp3s)
10
>>> mp3s[0]
<a href="http://traffic.libsyn.com/scriptnotes/scriptnotes_ep_299.mp3">here</a>
>>> mp3s[0]['href']
'http://traffic.libsyn.com/scriptnotes/scriptnotes_ep_299.mp3'
>>> mp3s[-1]['href']
'http://traffic.libsyn.com/scriptnotes/scriptnotes_ep_292.mp3'

So we have the links to the 10 podcasts on the page. Could we get them using regex?!?!?!

>>> import re
>>>
>>> len(re.findall(r'http\S+\.mp3', r.text))
20

The problem using this is that the link also appears inside another <a> tag.

<a href= "http://traffic.libsyn.com/scriptnotes/scriptnotes_ep_299.mp3">here</a>
[...]
<a href="http://traffic.libsyn.com/scriptnotes/scriptnotes_ep_299.mp3?dest-id=145488" >Download this Episode</a>

We could use a positive lookahead assertion (?=) to state that a " must follow but because it is inside (?=) it is not captured or included in the match.

>>> len(re.findall(r'http\S+\.mp3(?=")', r.text))
10

So now we know how to get all the links from a single page, what about multiple pages?

If we scroll down to the bottom of the page and click on the Page 2 link we will see the address that opens is http://scriptnotes.net/page/2/size/10

This means we can manually build the URL for a specific page by changing page/n which would allow us to loop (or “crawl”) through pages of results.

We also note the size/10 and we have 10 mp3 links in our result so this suggests we can change the “parameter”. I tested with 20 and with 100 and both appear to work. This would allow us to get more podcast links with fewer requests however it does seem to that increasing the size slows down each request. There also may be a limit on the number of returned results you would have to test around to find that out.

Let’s show an example of how we would loop through multiple pages.

First, the example output.

$ python scriptnotes.net.py 
http://traffic.libsyn.com/scriptnotes/scriptnotes_ep_299.mp3
http://traffic.libsyn.com/scriptnotes/scriptnotes_ep_298.mp3
http://traffic.libsyn.com/scriptnotes/scriptnotes_ep_297.mp3
http://traffic.libsyn.com/scriptnotes/scriptnotes_ep_296.mp3
http://traffic.libsyn.com/scriptnotes/scriptnotes_ep_strike_vote.mp3
http://traffic.libsyn.com/scriptnotes/scriptnotes_ep_295.mp3
http://traffic.libsyn.com/scriptnotes/scriptnotes_ep_294.mp3
http://traffic.libsyn.com/scriptnotes/scriptnotes_ep_99-2.mp3
http://traffic.libsyn.com/scriptnotes/scriptnotes_ep_293.mp3
[...]
http://traffic.libsyn.com/scriptnotes/scriptnotes_ep_219.mp3
http://traffic.libsyn.com/scriptnotes/scriptnotes_ep_218.mp3
http://traffic.libsyn.com/scriptnotes/scriptnotes_ep_217.mp3
http://traffic.libsyn.com/scriptnotes/scriptnotes_ep_216.mp3

The code.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
import requests from bs4 import BeautifulSoup size = 50 url = 'http://scriptnotes.net/page/{}/size/' + str(size) with requests.session() as s: s.headers['user-agent'] = 'Mozilla/5.0' for page in range(1, 3): r = s.get(url.format(page)) soup = BeautifulSoup(r.content, 'html5lib') for mp3 in soup.select('a[href$=.mp3]'): print(mp3['href'])

When making multiple requests with requests you’ll usually want to use a session object to maintain “state” and keep track of cookies.

You’ll also pretty much always want to change the default User-Agent header which we set here to Mozilla/5.0 as the default requests header tends to be blocked.

We’ve set size to 50 here and looped through the first 2 pages as an example.

Also, instead of just printing the URL you could fetch it and save the result to disk.