The goal is to automate the filling out of the form and clicking of the Generate new button on http://mazegenerator.net/ and then downloading the resulting maze image using Python’s requests library.

“Developer Tools”

Normally what happens when you click on a button like this is that a POST request is sent by the browser.

To view the requests being made by the browser we can use the Network tab from what is commonly referred to as “Developer Tools”.

To do so I “right-click” on the page and select Inspect Element which opens up the Inspector tab.

The Inspector tab shows the DOM representation of the page which may be of use to us later on.

From there I select the Network tab.

We then click on the Generate new button on the page to submit the form and the Network tab populates with 3 requests.

We can select individual requests which will bring up an Information Panel over on the right and we’ve selected the Params tab to view the POST data that was sent along with the request.

We can also “right-click” on a request to get access to that information via the clipboard.

Copy as cURL

You may notice Copy as cURL in that list which gives you the full curl command used to replicate the request.

curl is a command-line tool which can be used as a “HTTP client” and is a very useful for testing (or indeed for “scraping” itself).

Infact, we will go ahead and use the Copy as cURL functionality and run the generated command in the shell. We’ve added -o maze.html to the end of the command-line to store the output in the file named maze.html as curl will print the output to “the screen” by default.

I’ve also used the grep command to search for ImageGenerator in the resulting maze.html file. If you look at the 3rd request made in the Network tab above it is a GET request made to ImageGenerator.ashx and its return type is of svg.

If we actually select that specific request and view the Response tab from the Information Panel on the right we can see that it is the maze image we want to save.

So we knew a request was being made to ImageGenerator.ashx which is where I got the search term ImageGenerator from.

Searching in the source HTML for values sent in requests can be a useful step in the “debugging process” and as we can see the search returned us an <img> tag whose id attribute is MazeDisplay

<img id="MazeDisplay" src="ImageGenerator.ashx?Tag=20170602062723&amp;MazeType=1&amp;Solution=0" alt="20 by 20 orthogonal maze" />

This means we can extract the URL from the src attribute of this tag (which is contained in the HTML generated from the POST request) and use it to make a GET request to the actual image file.

__VIEWSTATE

Let’s take a closer look at the Params that were sent in the POST request.

These “double-underscore” params are used in ASP.net web-applications and they appear to be used for “validating” requests as well as tracking “state” (similar to cookies?).

As they are used for validation that means we must send them in our request but where do their values come from?

Well you may recall what we did with ImageGenerator in that we searched the source HTML so let’s try it again this time with __VIEWSTATE

$ grep __VIEWSTATE maze.html
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="stO5jq/0Fm1h...

The actual value is a rather large string of data so we have just shown the first few characters.

This means that we must first make a GET request to http://mazegenerator.net/ and extract the values of all the “double-underscore” params to then use in our POST request.

If we use the Copy POST data from earlier we can see the full list of Params sent.

__EVENTTARGET
__EVENTARGUMENT
__LASTFOCUS
__VIEWSTATE=stO5jq/0Fm1h...
__VIEWSTATEGENERATOR=CA0B0334
__EVENTVALIDATION=B/F/jIHl...
ShapeDropDownList=1
S1TesselationDropDownList=1
S1WidthTextBox=20
S1HeightTextBox=20
S1InnerWidthTextBox=0
S1InnerHeightTextBox=0
S1StartsAtDropDownList=1
AlgorithmParameter1TextBox=50
AlgorithmParameter2TextBox=100
GenerateButton=Generate

Once again the __VIEWSTATE and __EVENTVALIDATION values are usually very large strings of data so we have just displayed the first few characters here.

The rest of the params sent have to do with the options used in the form for customizing the maze image to be generated.

We can use the “Selector” tool (the button to the left of Inspector) to show display us the DOM representation of a particular element from the page in the Inspector tab.

It is worth noting that what we see in the Inspector tab is the result after our browser (I’m using Firefox here) has parsed the source HTML and as such it may differ from the actual source HTML

Here is what the Shape section looks like.

<div class="ShapeSection">
  <label for="ShapeDropDownList" id="ShapeLabel">Shape:</label>
    <select name="ShapeDropDownList" onchange="..." id="ShapeDropDownList">
      <option selected="selected" value="1">Rectangular</option>
      <option value="2">Circular</option>
      <option value="3">Triangular</option>
      <option value="4">Hexagonal</option>

So if we wanted to generate a Circular maze we would pass a value of 2 instead of the default value 1 for a Rectangular maze.

Code

As mentioned at the beginning the goal was to use the requests library as our “HTTP Client” and we will also be using BeautifulSoup with html5lib to “scrape” the HTML to extract our needed data.

You can install these using pip install beautifulsoup4 requests html5lib --user if you have not already.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
import requests from bs4 import BeautifulSoup url = 'http://mazegenerator.net/' with requests.session() as s: s.headers['user-agent'] = 'Mozilla/5.0' r = s.get(url) soup = BeautifulSoup(r.content, 'html5lib') state = { tag['name']: tag['value'] for tag in soup.select('input[name^=__]') } data = dict( ShapeDropDownList=1, S1TesselationDropDownList=1, S1WidthTextBox=20, S1HeightTextBox=20, S1InnerWidthTextBox=0, S1InnerHeightTextBox=0, S1StartsAtDropDownList=1, AlgorithmParameter1TextBox=50, AlgorithmParameter2TextBox=100, GenerateButton='Generate' ) data.update(state) r = s.post(url, data=data) soup = BeautifulSoup(r.content, 'html5lib') img = soup.find('img', id='MazeDisplay') print(url + img['src']) with open('maze.svg', 'wb') as f: maze = s.get(url + img['src']).content f.write(maze)

We have created the s variable here on line 6 which is a requests.session() object.

A session object “allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance.”

We want our headers and cookies to persist between our multiple requests so we must use one.

Generally speaking, if you’re making more than a single request you will want to use a session object.

This is why we are using s.get() and s.post() as opposed to requests.get() and requests.post() as we are calling them from our “Session instance”.

We’re setting the User-Agent header for our requests to Mozilla/5.0 as the default requests value is commonly blocked.

On line 12 we’re creating the state variable using a dict comprehension.

This is just shorthand syntax for creating a dict which we could have created using a regular for loop.

1 2 3 4 5
state = {} for tag in soup.select('input[name^=__]') key = tag['name'] value = tag['value'] state[key] = value

BeautifulSoup

BeautifulSoup’s select() method takes a CSS Selector and will return all matches found.

There is also select_one() which will return the first match.

>>> soup.select_one('input[name^=__]')
<input id="__EVENTTARGET" name="__EVENTTARGET" type="hidden" value=""/>

The CSS Selector input[name^=__] matches <input> tags whose name attribute startswith __

The ^= here being the startswith condition.

When used with select() this will extract all of the “double-underscore” params we need.

We can then use regular dict-indexing on the found Tag objects to extract the values of the tag attributes as we do here to extract the id attribute.

>>> soup.select_one('input[name^=__]')['id'] 
'__EVENTTARGET'

So we end up with state being a dict of the param names and param values to be used with our POST request.

We then create another dict on line 16 containing the default values for the maze generation.

We are using dict() to create it as opposed to the “regular” syntax.

>>> dict(foo=1, bar=2)
{'foo': 1, 'bar': 2}

The reason for this using dict() is that is allows us to omit quotes around the key names and it was less work to modify the output of the Copy POST data result.

On line 29 we call data.update(state) which merges the 2 dicts together.

>>> state = { '__VIEWSTATE': 'abc' }
>>> data  = { 'shape': 1, 'width': 20 }
>>> state.update(data)
>>> state
{'width': 20, 'shape': 1, '__VIEWSTATE': 'abc'}

We then perform our POST request and extract the src URL from the <img> tag.

find('img') will find the first <img> tag which in this case returns the tag we’re after as there are no other <img> tags in the result.

It can be useful however to be more explicit and we can pass id='MazeDisplay' (which is shorthand for {'id': 'MazeDisplay'}) to also match on the value of the id attribute of a particular tag.

>>> soup.find('img')
<img alt="20 by 20 orthogonal maze" id="MazeDisplay" src="ImageGenerator.ashx?Tag=20170602062723&amp;MazeType=1&amp;Solution=0"/>
>>> soup.find('img', id='MazeDisplay')
<img alt="20 by 20 orthogonal maze" id="MazeDisplay" src="ImageGenerator.ashx?Tag=20170602062723&amp;MazeType=1&amp;Solution=0"/>
>>> soup.find('img', id='MazeDisplay')['src']
'ImageGenerator.ashx?Tag=20170602062723&MazeType=1&Solution=0'

Finally we can use the dict-indexing as we did earlier to extract the value of the src attribute.

It is a relative URL so we prefix it with http://mazegenerator.net/ to get the full URL to the image which we then fetch and write() it to a file on lines 38-40.

Note that as it is in image we’re saving we use the b mode to open() for “binary mode”.

The resulting maze image is small in size however when dealing with large downloads you will want to stream the download and write to file in chunks.

Finally, we can run the code and check the output file using the file command.

$ python mazegenerator.net.py 
http://mazegenerator.net/ImageGenerator.ashx?Tag=20170604063729&MazeType=1&Solution=0
$ file maze.svg
maze.svg: SVG Scalable Vector Graphics image

Summary

To use mazes from mazegenerator.net for commercial purposes you need a commercial license.

The final code can also be found on github.