Given the URL http://www.privataaffarer.se/borsguiden/analyser/ the goal is to extract or “scrape” the stock data from the table on the page using Python. There are multiple pages of results so we would like to loop or “crawl” through multiple pages of the results.

First we’ll open up the URL in our browser and view the Inspector tab. I’ve done this here by “right-clicking” on the page and selecting Inspect Element.

Do note that the Inspector tab shows your browser’s representation of the page after it has parsed the source HTML and as such it may differ from the actual source HTML.

Network

To debug HTTP requests we can view the Network tab. So let’s switch to the Network tab and then click on the “page 2” button under the stock data table.

Usually we’re looking for POST requests (the Method column) and luckily in this case it’s the first request that was made. We select the request and then use the Params tab (over on the right) to view the POST params that were sent.

The first thing we see is this __EVENTTARGET “variable”.

It appears these variables used by webapps written in ASP.net for determining “state” (__VIEWSTATE) and also for validating requests (__EVENTVALIDATION)

Let’s search for __EVENTTARGET inside the HTML source and see what we find.

If possible I normally fetch the HTML with the curl command (or Python’s requests library) and save it locally to perform searches using my editor e.g.

$ curl -A Mozilla/5.0 'http://blah.com' -o page.html

You can also “right-click” on a request in the Network tab and select Copy as cURL which will give you the full curl command to replicate the request.

You can add -o filename to the end of the command to save the output to filename

Searching for __EVENTTARGET in the HTML finds the following

<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__LASTFOCUS" id="__LASTFOCUS" value="" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKLTk4NDEz...

__VIEWSTATE and __EVENTVALIDATION contain HUGE strings of what appears to be some form of “hashed data”. What they contain isn’t exactly important we just know we need to extract their value and send them with our request.

__EVENTTARGET has an empty value in the HTML which doesn’t match what our POST request sent but we’ll address that in a moment.

If we look at the rest of the POST params being sent we see these ctl00 type variables.

Let’s search through the HTML for the pattern <input type="hidden" name="ctl00 and see what we find.

The first result we get is

<input type="hidden" name="ctl00$FhMainContent$FhContent$ctl00$analysisFilter$filterParams" id="ctl00.." value="InstrumentIdFilter:...

It appears (for this page at least) the ctl00 variables have their value attibute just like the __ type variables which we can extract directly from the HTML.

We mentioned that __EVENTTARGET had an empty value in the HTML but in the POST params it had a value beginning with ctl00 so what exactly is going on?

If we view the HTML for the “page 2” button or we right-click it and “Copy Link Location” we will see the href value contains

javascript:__doPostBack('ctl00$FhMainContent$FhContent$ctl00$AnalysesCourse$CustomPager$dataPager$ctl01$ctl01','')

This ctl00 value matches the contents of __EVENTTARGET in the POST request.

If we inspect the “page 3” button we see

javascript:__doPostBack('ctl00$FhMainContent$FhContent$ctl00$AnalysesCourse$CustomPager$dataPager$ctl01$ctl02','')

All that changes here is 01 to 02 - this is the “pagination” works.

  • Page 1 of the results is 00
  • Page 2 of the results is 01
  • Page 3 of the results is 02

… and so on.

When we click on a page button it sets the __EVENTTARGET variable to the corresponding ctl00 page value (using Javascript) and then posts the form.

Looking again at the POST params sent we also notice couple of interesting values.

ctl00$FhMainContent$FhContent$ctl00$AnalysesCourse$CustomPager$max:   30
ctl00$FhMainContent$FhContent$ctl00$AnalysesCourse$CustomPager$total: 360

This suggests there are 360 pages of results and there are 30 items per result page. We could try sending a value higher than 30 to see if we can more results per page (meaning we could send less requests). There are usually limits in place though so the only way to find it out would be by trial and error. We will just stick to using the default value of 30.

Code

We’ll be using requests to fetch the HTML and BeautifulSoup with html5lib to parse it. These can be installed using pip e.g.

pip install requests beautifulsoup4 html5lib --user

We will start with some (truncated) example output.

Behåll,Electrolux B,Danske Bank Markets,263,1,,02-maj
Sälj,Hexpol B,SEB Equities,98,45,82,02-maj
Köp,Swedish Orphan Biovitrum,Pareto Securities,136,8,200,02-maj
Behåll,Electrolux B,Pareto Securities,263,1,280,02-maj
Behåll,Hexpol B,Kepler Cheuvreux,98,45,102,02-maj
Sälj,Hexpol B,DNB Markets,98,45,89,02-maj
Behåll,Nobia,Danske Bank Markets,91,65,100,02-maj
Köp,Intrum Justitia,UBS,348,392,28-apr
Behåll,SCA B,Danske Bank Markets,297,7,315,28-apr
Behåll,SKF B,Danske Bank Markets,191,2,,28-apr
[...]

The code that generated it.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
import requests from bs4 import BeautifulSoup url = 'http://www.privataaffarer.se/borsguiden/analyser/' with requests.session() as s: s.headers['user-agent'] = 'Mozilla/5.0' r = s.get(url) soup = BeautifulSoup(r.content, 'html5lib') target = ( 'ctl00$FhMainContent$FhContent$ctl00' '$AnalysesCourse$CustomPager$dataPager$ctl01$ctl{:02d}' ) # unsupported CSS Selector 'input[name^=ctl00][value]' data = { tag['name']: tag['value'] for tag in soup.select('input[name^=ctl00]') if tag.get('value') } state = { tag['name']: tag['value'] for tag in soup.select('input[name^=__]') } data.update(state) # data['ctl00$FhMainContent$FhContent$ctl00$AnalysesCourse$CustomPager$total'] last_page = int(soup.find('div', 'custom_pager_total_pages').input['value']) # for page in range(last_page + 1): for page in range(3): data['__EVENTTARGET'] = target.format(page) r = s.post(url, data=data) soup = BeautifulSoup(r.content, 'html5lib') # unsupported CSS Selector 'tr:not(.tr_header)' for tr in soup.select('.analysis_table tr'): row = [ td.text.strip() for td in tr('td')[1:-1] ] if row: print(','.join(row))

Code breakdown

When making multiple requests with requests you’ll usually want to use a session object to maintain “state” and keep track of cookies.

You’ll also pretty much always want to change the default User-Agent header which we set here to Mozilla/5.0 as the default requests header tends to be blocked.

We must first send a GET request to the page so that we can extract the needed parameters to send in our subsequent POST requests.

target is just a string.

target = (
    'ctl00$FhMainContent$FhContent$ctl00'
    '$AnalysesCourse$CustomPager$dataPager$ctl01$ctl{:02d}'
)

Python will implictily join adjacent “strings”

>>> 'this'    'is'      'one'      'string'
'thisisonestring'

To split it over multiple lines we need to add the surrounding () which we do just for code formatting reasons to keep under a certain line length.

>>> ( 'this'     'is'
...   'a'               'large'  '   string'
... )
'thisisalarge   string'

The {:02d} at the end of target is a .format() string specifier which will be used to “zero-pad” the page number in __EVENTTARGET

>>> '{:02d}'.format(1)
'01'
>>> '{:02d}'.format(100)
'100'

The 2 specifies the minimum length of the result and the leading 0 will mean it will pad with 0 instead of a space character.

>>> '{:03d}'.format(1)
'001'
>>> '{:3d}'.format(1)
'  1'

The d specifies we want it to be outputted as a Decimal Integer. For more information about format() see the docs.

The declaration of data may look “odd” if you’ve not seen the syntax before.

1 2
data = { tag['name']: tag['value'] for tag in soup.select('input[name^=ctl00]') if tag.get('value') }

It is called a dict comprehension which is just “shorthand” syntax for building a dict. We could use a “regular” for loop instead.

1 2 3 4 5
data = {} for tag in soup.select('input[name^=ctl00]'): if tag.get('value'): key, value = tag['name'], tag['value'] data[key] = value

Feel free to use this form if you prefer but it’s good to be aware of what comprehensions look like.

We’re using BeautifulSoup’s select() here to find all <input> tags whose name attribute startswith ctl00 i.e. select('input[name^=ctl00]')

The ^= here represents the “startswith” operator within a CSS Selector.

There are some matching ctl00 tags that do not have a value attribute and we want to skip these. We can use the selector [attribute] to specify that attribute must exist.

Combined with input[name^=ctl00] we would get input[name^=ctl00][value] which is a valid selector however BeautifulSoup doesn’t support it.

For “more complete” CSS Selector support popluar choices include lxml.html.cssselect() and parsel.

As we cannot use the unsupported selector we add the tag.get('value') check to the code. What does it do?

You can access the values of an attribute by using dict indexing on a beautifulsoup tag object (you can also access .attrs dict directly e.g. tag.attrs[name])

If we try to index the value key and it’s not present we will raise a KeyError exception.

>>> print({'name': 'input'}['value'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'value'

If you use the dict.get() method however it will just return None (you can however supply get() with a default return if desired).

>>> print({'name': 'input'}.get('value'))
None
>>> print({'name': 'input'}.get('value', 'o hai'))
o hai

None is a “falsey” value so if tag.get('value') will evaluate to False if there is no value attribute present.

find(find_all()) is False

Some people may ask why dont’t you just use find() or find_all() instead of using select()?

It is certainly possible to use find_all() in this instance.

soup.find_all(lambda tag: tag and tag.name == 'input' and tag.get('name') and tag['name'].startswith('ctl00'))

But who wants to be typing all of that, amirite?

We’re extracting all of the __ variables (e.g. __VIEWSTATE etc) and storing them in the state dict.

state = { tag['name']: tag['value'] for tag in soup.select('input[name^=__]') }

It’s identical to the data declaration apart from the startswith condition and there is no need to test if there is a value attribute present as they always are (although it wouldn’t hurt to add it).

We’re then using data.update(state) to add all the items from the state dict into the data dict. You could think of this as a “merge” operation.

>>> data  = {'a': 1, 'b': 2}
>>> state = {'c': 3, 'd': 4}
>>> data.update(state)
>>> data
{'a': 1, 'c': 3, 'b': 2, 'd': 4}

We don’t use the state variable anywhere else so we could have passed the dict comprehension directly to update() instead.

data.update(
    { tag['name']: tag['value'] for tag in soup.select('input[name^=__]') }
)

BeautifulSugar

The total number of pages is contained in one of the ctl00 variables but we could also extract it from the HTML using find() as shown in the code.

soup.find('div', 'custom_pager_total_pages').input['value']

This is using some syntactic sugar provided by BeautifulSoup and is shorthand for

find('div', {'class': 'custom_pager_total_pages'}).find('input')['value']

If you supply a second argument to find() or find_all() it defaults to matching it against the class attribute. find('tag') can be replaced with .tag as BeautifulSoup makes it available as an attribute. Also, the default method called is find_all() if no name is supplied meaning find_all('table') can be shortened to just ('table')

Now that we have all our POST params set up we just need to loop through the page numbers and set __EVENTTARGET then make our request. We’ve just looped through the first 3 pages here as an example but you could loop through them all by using last_page + 1 in the range() instead just like the commented loop line in the code. It all depends on how far you want to go back.

To send form-encoded data along with our request (POST params) we pass a dictionary to the data argument a lá s.post(url, data=data)

You may wonder why we’re sending a request for Page 1 as we already have that data from our initial .get() call.

The reason for this is that we would have to duplicate the code for “scraping” the stock data; once for the first page and then again inside the for loop for the subsequent pages.

Scraping tables

The stock data is contained inside a <table> tag.

<table class=" table_header analysis_table table">

We want to get “all” <tr> tags inside of this table. We’ve used soup.select('.analysis_table tr') to do so.

In a CSS Selector .word tests if word exists inside the class attribute of an item. We’ve omitted the tag name here but we could have used table.analysis_table which would state we only want to search for <table> tags but analysis_table is unique in the HTML so it’s not needed. How explicit / specific you need to be with your selectors depends on the HTML you’re dealing with.

There are 2 tables on the page, the data we want is in the first one. This means we could also use soup.table('tr') (which is equivalent to soup.find('table').find_all('tr')) to get the same results as the select() in this instance.

Obviously this assumes the data is always in the first <table> found which may be considered a more “fragile” approach compared to explicitly testing for the unique class word.

We said earlier we wanted all <tr> tags however that is not entirely true. The first <tr> in the table is inside the <thead> which contains the “headers” in <th> tags.

<tr>
  <th class="stock_status" data-columns="data_stock_course_0">

The “headers” are also repeated in multiple rows throughout the table.

<tr class="tr_header">
  <th class="stock_status" data-columns="data_stock_course_0" data-priority="1">

So when we try to find <td> tags inside those <tr> tags we will get an empty list as the result.

>>> [ td.text for td in soup.table.tr('td') ]
[]
>>> [ td.text.strip() for td in soup.table('tr')[-1]('td') ]
['', 'Neutral', 'Investor B', 'Swedbank', '399,6', '405', '25-apr', '']

An empty list is “falsey” meaning that our if row: check will only be True if there are any <td> tags found which in turn has the side-effect of skipping the the unwanted header “rows”.

The [ ... for ... ] syntax used here is called a list comprehension. It is just shorthand syntax for building lists similar to dict comprehensions for building dicts.

It can be written using “regular” for loop instead.

1 2 3 4
for tr in soup.select('.analysis_table tr'): row = [] for td in tr('td'): row.append(td.text.strip())

So we’re processing each <tr> tag then finding all the <td> tags it contains and extracting their .text content and calling strip() to remove the surrounding whitespace.

You may have noticed from the previous result both the first and last entries in the resulting list are empty strings. This is why we have [1:-1] in the code.

>>> row = ['', 'Neutral', 'Investor B', 'Swedbank', '399,6', '405', '25-apr', '']
>>> row[1:-1]
['Neutral', 'Investor B', 'Swedbank', '399,6', '405', '25-apr']

The [start:end] syntax is called slicing. It gives us back a new list starting and ending with the given indexes.

Finally we use a simple print(','.join(row)) to produce the final output. Please note however that you should use the csv module if you actually want to produce proper CSV data.

That’s it!

So what we have works (the code is also on github) however when needing to loop or “crawl” through multiple pages of results you may want to consider using Scrapy as it handles a lot of things for you such as parallel requests, error handling, request retries, etc.