Somebody was asking for assistance in scraping all of the comic images from a particular “chapter” on a mangastream.com comic using Python so they could read it offline.

The example URL given was http://mangastream.com/r/demons_plan/010/3997/

Let’s open up in our browser and check out the Inspector tab.

To do that I right-click on the page and select Inspect Element

We can also use the Selector Tool (the button to the left of Inspector) to bring the focus to a particular element of the page.

Do note that the Inspector tab shows your browser’s representation of the page after it has parsed the source HTML and as such it may differ from the actual source HTML.

So the HTML for the comic image looks like:

<a href="http://mangastream.com/r/demons_plan/010/3997/2">
  <img id="manga-page" src="http://img.mangastream.com/cdn/manga/138/3997/01.png"></a>

It is contained inside an <img> tag whose id attribute is manga-page which should make it trivial to isolate.

Code

We’ll be using Python’s requests to fetch the HTML and BeautifulSoup with html5lib to parse it which you can install with pip install requests beautifulsoup4 html5lib --user if you do not have them already installed.

When “scraping” with requests you’ll usually want to use a Session Object.

It keeps track of cookies, etc and it means we can set headers once for all our requests without having to pass them manually each time e.g. requests.get(url, headers={...})

Let’s open an interactive Python session and test it out:

>>> import requests
>>> from   bs4 import BeautifulSoup
>>>
>>> s = requests.session()
>>> s.headers['user-agent'] = 'Mozilla/5.0'
>>>
>>> r    = s.get('http://mangastream.com/r/demons_plan/010/3997/')
>>> soup = BeautifulSoup(r.content, 'html5lib')
>>>
>>> img = soup.find('img', id='manga-page')
>>> img
<img id="manga-page" src="http://img.mangastream.com/cdn/manga/138/3997/01.png"/>
>>> img.attrs
{'src': 'http://img.mangastream.com/cdn/manga/138/3997/01.png', 'id': 'manga-page'}
>>> img['src']
'http://img.mangastream.com/cdn/manga/138/3997/01.png'

User-Agent

The first thing you’ll normally want to do is set the value of the User-Agent header. It’s most common for requests to be blocked by filtering on the value of this header so we set it to something that matches an “actual” browser.

find()

find() returns the first matching Tag. It looks like we just got a string back but we can see by checking type() that is it a bs4.element.Tag object.

>>> type(img)
<class 'bs4.element.Tag'> 

I tend to choose the name of the tag as the variable name to store it in e.g. img = find('img', ...) but feel free to choose your own naming scheme.

BeautifulSoup provides “shorthand” syntax for several operations e.g. passing id='foo' to match a particular value for the id attribute.

This is shorthand for {'id': 'foo'} i.e. you can pass a dict of attribute names/values to match against when searching.

There is class_= (note the underscore) because class is special to Python, however if you pass a 2nd argument (not by name) e.g. find('tag', 'value') it will default to matching against class. That is to say:

find('tag', 'value') is the same as find('tag', {'class': 'value'})

.attrs is a dict that holds the attribute names and values however you can use dict-indexing on the Tag object itself to access values.

find_all()

How specific you need to be when isolating particular items depends on the structure of the HTML you’re working with. In this case the <img> tags we want are the only tags on the page that match so we can omit the tagname from the find(). Each page also appears to have a single <img> tag which we can check by using find_all() which returns a “list” of Tag objects.

>>> soup.find(id='manga-page')
<img id="manga-page" src="http://img.mangastream.com/cdn/manga/138/3997/01.png"/>
>>> soup.find_all(id='manga-page')
[<img id="manga-page" src="http://img.mangastream.com/cdn/manga/138/3997/01.png"/>]

Downloading the image

img['src'] holds the URL pointing to the comic image we want so we can just pass that to s.get() and write(r.content) out to file. Do note we are opening the file using b for binary-mode which is important.

>>> comic_page = s.get(img['src'])
>>> with open('demons_plan_01.png', 'wb') as fh:
...     fh.write(comic_page.content)
...

If you want to steam the file (perhaps it’s too large to fit into memory) you can use the example given in the docs here.

Let’s check what we saved is a valid PNG file:

>>> import imghdr
>>> imghdr.what('demons_plan_01.png')
'png'
>>> import subprocess
>>> subprocess.check_output(['file', 'demons_plan_01.png'])
'demons_plan_01.png: PNG image data, 887 x 1300, 8-bit colormap, non-interlaced\n'

Navigating the pages

We’ve successfully downloaded Page 1 so how do we get all of the pages?

Let’s check again with the Inspector tab:

And what it looks like on the inside:

<div class="btn-group btn-reader-page">
<a class="btn btn-primary dropdown-toggle" data-toggle="dropdown" href="#">
Page 1 <span class="caret"></span>
</a>
<ul class="dropdown-menu">
<li><a href="http://mangastream.com/r/demons_plan/010/3997/1">First Page (1)</a></li>
<li><a href="http://mangastream.com/r/demons_plan/010/3997/2">Page 2</a></li>
<li><a href="http://mangastream.com/r/demons_plan/010/3997/3">Page 3</a></li>
<li><a href="http://mangastream.com/r/demons_plan/010/3997/4">Page 4</a></li>
<li><a href="http://mangastream.com/r/demons_plan/010/3997/5">Page 5</a></li>
<li><a href="http://mangastream.com/r/demons_plan/010/3997/6">Page 6</a></li>
<li><a href="http://mangastream.com/r/demons_plan/010/3997/7">Page 7</a></li>
<li><a href="http://mangastream.com/r/demons_plan/010/3997/8">Page 8</a></li>
<li><a href="http://mangastream.com/r/demons_plan/010/3997/9">Page 9</a></li>
<li><a href="http://mangastream.com/r/demons_plan/010/3997/10">Page 10</a></li>
<li><a href="http://mangastream.com/r/demons_plan/010/3997/11">Page 11</a></li>
<li><a href="http://mangastream.com/r/demons_plan/010/3997/12">Page 12</a></li>
<li><a href="http://mangastream.com/r/demons_plan/010/3997/13">Page 13</a></li>
<li><a href="http://mangastream.com/r/demons_plan/010/3997/14">Page 14</a></li>
<li><a href="http://mangastream.com/r/demons_plan/010/3997/15">Page 15</a></li>
<li><a href="http://mangastream.com/r/demons_plan/010/3997/16">Page 16</a></li>
<li><a href="http://mangastream.com/r/demons_plan/010/3997/20">Last Page (20)</a></li>
</ul>
</div>

pls y u no seventeen?

So firstly we note that the links are inside <div class="btn-group btn-reader-page">. The second thing to note is that not all of the pages are there. It jumps from Page 16 to Page 20.

Each of the links follows the same naming pattern so let’s open up http://mangastream.com/r/demons_plan/010/3997/17 to see if it exists?

The answer is that yes, it does. This means that instead of extracting the links we will just have to extract the number of the last page and contruct the URLs manually.

The last page number is available in both the URL of the href attribute and the text content of the tag itself. The particular <a> we’re after is the last one inside the <div>.

find().find_all()

So we could first find() the <div> and then find_all() the <a> tags taking the last one:

>>> div = soup.find('div', 'btn-group btn-reader-page')
>>> div.find_all('a')[-1]
<a href="http://mangastream.com/r/demons_plan/010/3997/20">Last Page (20)</a>

Note that we’re calling div.find_all() and not soup.find_all(). Calling soup.find_all() would search the whole document whereas calling it on the div variable will only search starting from the tag it represents i.e. <div class="btn-group btn-reader-page">

We could have also chained the calls without the need for the div variable:

>>> soup.find('div', 'btn-group btn-reader-page').find_all('a')[-1]

select()

Another option is select() which allows you to use CSS Selectors:

>>> soup.select('.btn-reader-page a')[-1]
<a href="http://mangastream.com/r/demons_plan/010/3997/20">Last Page (20)</a>

.btn-reader-page matches any tag that has btn-reader-page as a “word” inside their class attribute.

  • . is for matching against class
  • # is for matching against id

We could have been explicit and used div.btn-reader-page to specify the tag type. Also one two means that two must be a descendant of one. a is two in our selector meaning that the a tags must be descendants.

string=

Another useful tool is the string= named argument of the find methods that tests against the string/text content of a tag. You can give a string (which will test for an exact match) or you can also pass an re Pattern object:

>>> soup.find('a', string=re.compile(r'^Last Page \(\d+\)$'))
<a href="http://mangastream.com/r/demons_plan/010/3997/20">Last Page (20)</a>

extracting the last page number

We’ll choose select() here because it’s less typing and I’m lazy:

>>> soup.select('.btn-reader-page a')[-1]
<a href="http://mangastream.com/r/demons_plan/010/3997/20">Last Page (20)</a>
>>> soup.select('.btn-reader-page a')[-1].text
'Last Page (20)'
>>> soup.select('.btn-reader-page a')[-1]['href']
'http://mangastream.com/r/demons_plan/010/3997/20'
>>> soup.select('.btn-reader-page a')[-1]['href'].split('/')
['http:', '', 'mangastream.com', 'r', 'demons_plan', '010', '3997', '20']
>>> soup.select('.btn-reader-page a')[-1]['href'].split('/')[-1]
'20'

It’s simpler to extract the page number from the URL than it is from the .text. We could then create a for loop to build the URLs:

building the URL list

>>> last = soup.select('.btn-reader-page a')[-1]['href'].split('/')[-1]
>>> for n in range(2, int(last) + 1):
...     'http://mangastream.com/r/demons_plan/010/3997/{}'.format(n)
... 
'http://mangastream.com/r/demons_plan/010/3997/2'
'http://mangastream.com/r/demons_plan/010/3997/3'
'http://mangastream.com/r/demons_plan/010/3997/4'
[...]
'http://mangastream.com/r/demons_plan/010/3997/20'

Note that we need to call int() on last and I’ve also chopped the output for brevity, it prints all pages 2 - 20.

THERE IS A LOT OF NEW CODE HERE

So we know how to save a page image and how to build a list of all the remaining pages. Let’s put it all together:

mangastream.com.py

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
from __future__ import print_function import errno, os, requests, sys from bs4 import BeautifulSoup url = sys.argv[1] dirname = os.path.join(*url.strip('/').split('/')[-3:-1]) try: print('MKDIR:', dirname) os.makedirs(dirname) except OSError as e: if e.errno == errno.EEXIST and os.path.isdir(dirname): pass else: raise # Python >=3.2 #print('MKDIR: ', dirname) #os.makedirs(dirname, exist_ok=True) with requests.session() as s: s.headers['user-agent'] = 'Mozilla/5.0' print('GET: ', url) r = s.get(url) soup = BeautifulSoup(r.content, 'html5lib') img = soup.find(id='manga-page') image_url = img['src'] filename = image_url.split('/')[-1] path = os.path.join(dirname, filename) with open(path, 'wb') as fh: print('GET: ', image_url) image = s.get(image_url).content print('CREATE:', path) fh.write(image) last = soup.select('.btn-reader-page a')[-1]['href'].split('/')[-1] for n in range(2, int(last) + 1): next_page = url.strip('/') + '/{}'.format(n) print('GET: ', next_page) r = s.get(next_page) soup = BeautifulSoup(r.content, 'html5lib') img = soup.find(id='manga-page') image_url = img['src'] filename = image_url.split('/')[-1] path = os.path.join(dirname, filename) with open(path, 'wb') as fh: print('GET: ', image_url) image = s.get(image_url).content print('CREATE:', path) fh.write(image)

Let’s run it:

$ python mangastream.com.py http://mangastream.com/r/demons_plan/010/3997/
MKDIR:  demons_plan/010
GET:    http://mangastream.com/r/demons_plan/010/3997/
GET:    http://img.mangastream.com/cdn/manga/138/3997/01.png
CREATE: demons_plan/010/01.png
GET:    http://mangastream.com/r/demons_plan/010/3997/2
GET:    http://img.mangastream.com/cdn/manga/138/3997/01a.jpg
CREATE: demons_plan/010/01a.jpg
[...]
GET:    http://mangastream.com/r/demons_plan/010/3997/20
GET:    http://img.mangastream.com/cdn/manga/138/3997/19.png
CREATE: demons_plan/010/19.png

The output has been trimmed for brevity but you may notice that the filename on comic #2 is 01a.jpg and the last filename is 19.png. You could generate your own filenames instead if you wanted to.

funkytime!

So great, it works! However the code can be cleaned up a bit. What we have inside the for loop is essentially the same as the code before it. Having duplicated code like this is usually a sign to put it inside a function, so let’s give it a try:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
from __future__ import print_function import errno, os, requests, sys from bs4 import BeautifulSoup def makedirs(dirname): try: print('MKDIR: ', dirname) os.makedirs(dirname) except OSError as e: if e.errno == errno.EEXIST and os.path.isdir(dirname): pass else: raise def get(url): print('GET: ', url) r = s.get(url) return BeautifulSoup(r.content, 'html5lib') def save_page(url): soup = get(url) url = soup.find(id='manga-page')['src'] filename = url.split('/')[-1] path = os.path.join(dirname, filename) with open(path, 'wb') as fh: print('GET: ', url) image = s.get(url).content print('CREATE:', path) fh.write(image) if __name__ == '__main__': url = sys.argv[1] dirname = os.path.join(*url.strip('/').split('/')[-3:-1]) makedirs(dirname) with requests.session() as s: s.headers['user-agent'] = 'Mozilla/5.0' soup = get(url) last = soup.select('.btn-reader-page a')[-1]['href'].split('/')[-1] for n in range(1, int(last) + 1): page = url.strip('/') + '/{}'.format(n) save_page(page)

The if __name__ line isn’t needed but I like using it as it forces an extra level of indentation which I think gives separation from the functions.

So we are extracting the comic and chapter titles from the URL so we create our own directory structure (using makedirs()) to save the image files into. If we were to download multiple chapters of even different comics we could have duplicate filenames so it’s a good idea to keep them organized.

The reason for our own makedirs() function and the try / except inside it is that os.makedirs() will raise an exception if any item in the path already exists. Starting with Python 3.2 you can pass exist_ok=True to prevent this which would mean you can remove the def makedirs() completely and just use os.makedirs(dirname, exist_ok=True)

You may notice here that we’re requesting the first page twice. This was just to avoid code duplication.

splat

Whilst on the topic let’s give a quick explanation of the dirname line:

dirname = os.path.join(*url.strip('/').split('/')[-3:-1])

So we split the URL on / and take the 3rd last and 2nd last items.

>>> url = 'http://mangastream.com/r/demons_plan/010/3997'
>>> url.split('/')[-3:-1]
['demons_plan', '010']

If the URL had a trailing / however we would have an extra item in the list so our indexing would break:

>>> url = 'http://mangastream.com/r/demons_plan/010/3997/'
>>> url.split('/')[-3:-1]
['010', '3997']

This is why we use strip('/') to remove and trailing slashes. strip() also removes leading matches too and there are lstrip() and rstrip() to target one side specifically. Leading slashes would be an error in this case so it doesn’t matter if we use strip() or rstrip() and strip() is less typing.

So the result of our split() is a 2 element list. os.path.join() however, doesn’t expect a list.

>>> os.path.join(['foo', 'bar'])
['foo', 'bar']
>>> os.path.join(*['foo', 'bar'])
'foo/bar'

The * (sometimes called “splat”) unpacks the list passing in 2 arguments instead as if we called:

>>> os.path.join('foo', 'bar')
'foo/bar'

Incidentally, ** exists for unpacking dicts.

What now?

There’s no error checking for the sys.argv[1] you could use argparse or click to build a command-line interface.

There is a dropdown for the chapter list however it doesn’t contain them all.

<div class="btn-group btn-reader-chapter">
...
  <li><a href="http://mangastream.com/manga/demons_plan">Full List</a></li

The full list is available from an <a> tag in containing <div>. You could build a URL list of each chapter. That is moving into “crawling” territory though and in such cases using something like Scrapy might be a good idea.

Perhaps we’ll discuss such an approach in a future part of this article.

mangastream.com.py

In Part 2 we’ll convert the Python code to a bash script/function that uses curl and grep aka doing it “The Wrong Way”.