The goal is to “scrape” media posts from an Instagram page using Python’s BeautifulSoup and requests libraries however only the “first page” of results is being displayed. Why is this?

We were not given an example URL so we will use https://www.instagram.com/thefatfoxcamden/ for testing purposes (as they post awesome pictures).

You should know that Instagram has an api.

  • You use it, taking the blue pill—the article ends.
  • You take the red pill—you stay in Wonderland, and I show you how deep a JSON response goes.

Remember: all I’m offering is the truth. Nothing more.

“Developer Tools”

So the first thing I do is open the URL in my browser (Firefox with Javascript disabled) and I see the following.

This tells us that Instagram doesn’t function without Javascript enabled. We were told however that using requests (which does not execute Javascript) returned the first page of results. This tells us that there is some data contained in the source HTML.

So what we want to do is to debug the HTTP requests being made to fetch the “Next Page” of results. To do this we can use the Network tab of the browser which is located in what is commonly referred to as the “Developer Tools”.

I right-click on the page and select Inspect Element to open up the Inspector tab. (Although you can access it directly from the menu or using keyboard shortcuts.)

With the Inspector tab open I then select the Network tab. I also filter the type of requests displayed to only XHR which stands for XMLHttpRequest. These are the type of requests made by Javascript and we already know Javascript is doing the work so we want to look at these requests specifically.

I then re-enable Javascript in my browser and refresh the page.

With the Network tab open we can select an individual request and an Information Panel will appear on the right. We can then select the Response tab within that panel to view information about the response received.

Let’s take a closer look at the response.

So the response is JSON data and we can see it contains start_cursor and end_cursor which look like they have something to do with “pagination”.

We see count: 402 which is the same number of Posts on the page and we also see nodes which appears to be a list containing the information of each media item.

Interesting values for each node include

  • code - this is used in the “post” URL e.g. https://www.instagram.com/p/BUGg5r6gf5T/
  • display_src - this appears to be a direct URL to the “source” (an image in this case)
  • id - the “node id”

You may notice that the id of the first “node” is the same value as the page start_cursor

  • 1515190529653195341

Scrolling down to look at the id of the last “node”..

.. we can see that it matches the value of the page end_cursor

  • 1511430693194883315

Let’s take a look at the Params tab to see what data was sent in the request.

In the q param we see ig_user(3612106348) which suggests the 3612106348 number is the id of this instagram user’s account.

There is also media.after(AQDJ-....,+12 which suggests the AQDJ-... string is an id for a particular media item and ,+12 looks like it’s saying “Give me the next 12 items after this id”.

This makes sense because as you may have noticed 12 is the number of media items on a page of results. The AQDJ-... string however, doesn’t seem to follow the naming pattern of the “node” ids we got in the response.

Looking back at the main Network tab we can see that a second POST request was also made so let’s take a closer look at that.

Let’s take a look at the Params sent with the request.

We can see ig_user is the same but the media.after() string is now 1511430693194883315,+12 and this 1511430693194883315 value matches the page end_cursor value from the first request made.

So it looks like we can extract the end_cursor value from a Response and use it in the q parameter to fetch the next page of results.

If we right-click on a request we get some useful options.

We can

  • Copy Response which will copy the Response tab data (i.e. the JSON data from above)
  • Copy POST Data which will copy the Params tab data (i.e. q, ref, query_id)
  • Copy as cURL which will give the full curl command to replicate the request

Copy as cURL

curl is a command-line tool for sending requests similar to how we use requests in Python.

Let’s copy the command and run it in our shell.

The actual command-line generated is rather large so [...] here just represents the remainder of the command.

$ curl 'https://www.instagram.com/query/' --2.0 -H 'Host: www.instagram.com' [...]
curl: option --2.0: is unknown
curl: try 'curl --help' or 'curl --manual' for more information

Oops, my version of curl doesn’t support --2.0, let’s try again with that option removed.

$ curl 'https://www.instagram.com/query/' -H 'Host: www.instagram.com' [...]
{"media": {"page_info": {"start_cursor": "1515043026307776083", "end_cursor": "1511400656827605604" ...

It works and we get our JSON response. I re-ran curl without the ref param and it still worked. I then ran it without the query_id param and it still worked. So it seems that q is the only param that gets checked by the query endpoint.

curl by default prints to stdout and we can use -o filename to output to filename instead or we could use shell redirection e.g. > filename

The -H option to curl is for setting HTTP Headers. You may have noticed when we right-clicked on our request above there was also a Copy Request Headers option which gives us a nice view of them.

POST /query/ HTTP/1.1
Host: www.instagram.com
User-Agent: Mozilla/5.0 (...
Accept: */*
Accept-Language: en-GB,en;q=0.5
Accept-Encoding: gzip, deflate, br
DNT: 1
X-CSRFToken: vzHPj1VZ7ga7LkszmFc1u7e6CqwI2Bvv
X-Instagram-AJAX: 1
Content-Type: application/x-www-form-urlencoded
X-Requested-With: XMLHttpRequest
Referer: https://www.instagram.com/thefatfoxcamden/
Content-Length: 559
Cookie: rur=ATN; csrftoken=vzHPj1VZ7ga7LkszmFc1u7e6CqwI2Bvv; ...
Connection: keep-alive

We could also view them in Headers tab of the request Information Panel.

Through trial and error (simply retrying the request omitting a value at a time) I discovered that the request fails without the Referer header set to an http://instagram.com URL.

It will also fail without a valid X-CSRFToken Header and the corresponding csrftoken Cookie entry.

So where did this csrftoken value come value come from?

We mentioned in the beginning that it looked like there was some data of importance contained in the source HTML so let’s go look there.

We’ll use curl to save fetch the HTML of the profile page which allows us to search through it easier using our favourite editor.

$ curl -A Mozilla/5.0 https://www.instagram.com/thefatfoxcamden/ -o page.html

-A sets the User-Agent header which we do as the default curl value is commonly blocked.

If we do a case-insensitive search for csrf inside page.html we find

<script type="text/javascript">window._sharedData = {
  "activity_counts": null, "config": {"csrf_token": "VCyFuVr7P1pZzFkxQmkA1z9GpNo1ZOub",
  ...

So there is a csrf_token contained in the source HTML inside this window._sharedData (Javascript) variable. If we take a closer look at the contents of this variable we can see that it also seems to contain the JSON response data from the first POST request we looked at.

While we are searching let’s search for the ig_user() id number from earlier.

"id": "3612106348"

Hopefully this demonstrates that searching in the source HTML for data that was sent along with the request you’re debugging can be an important step in solving the puzzle.

If we search start_cursor we find no match however we do find a match for end_cursor

{"has_next_page": true, "end_cursor": "AQBB1hheP9G_e4Ui95IhF1zt5jbd2KloyjJj...

… but it does seem to differ from the media.after() param that was sent in the very first request.

This does suggest though, that we can:

  • make an initial GET request to the profile page
  • extract id, csrf_token and end_cursor
  • use those values in our POST request to https://www.instagram.com/query/

Code

We can use the Copy POST Data functionality mentioned earlier to populate our code with the needed data and then make the appropriate edits to turn it into a Python dict for passing along to the data argument of requests.post()

Although in this instance we’re actually only using the q param and we want to “inject” the value of ig_user and end_cursor into it so we use the %s style string formatting to allow us to do so.

I mentioned earlier that through trial and error I discovered what headers were needed but you could of course just copy ALL the headers from the Network tab instead of trying to figure that out.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
import re, requests profile = 'https://www.instagram.com/thefatfoxcamden/' api = 'https://www.instagram.com/query/' q = '''\ ig_user(%s)+{+media.after(%s,+12)+{ ++count, ++nodes+{ ++++__typename, ++++caption, ++++code, ++++comments+{ ++++++count ++++}, ++++comments_disabled, ++++date, ++++dimensions+{ ++++++height, ++++++width ++++}, ++++display_src, ++++id, ++++is_video, ++++likes+{ ++++++count ++++}, ++++owner+{ ++++++id ++++}, ++++thumbnail_src, ++++video_views ++}, ++page_info } +}\ ''' with requests.session() as s: s.headers['user-agent'] = 'Mozilla/5.0' r = s.get(profile) ig_user = re.search(r'"id": "([^"]+)"' , r.text).group(1) csrftoken = re.search(r'"csrf_token": "([^"]+)"', r.text).group(1) end_cursor = re.search(r'"end_cursor": "([^"]+)"', r.text).group(1) data = { 'q': q % (ig_user, end_cursor) } headers = { 'Referer' : r.url, 'X-CSRFToken': csrftoken } r = s.post(api, cookies={'csrftoken': csrftoken}, data=data, headers=headers) print(r.json())

We have created the s variable here which is a requests.session() object.

A session object “allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance.”

Generally speaking, if you are making more than a single request (as we are doing) you will want to use a session object.

This is why we are using s.get() and s.post() as opposed to requests.get() and requests.post() as we are calling them from our “Session instance”.

We’re setting the User-Agent header to Mozilla/5.0 as the default requests value is commonly blocked.

Yes!!! To extract id, csrf_token and end_cursor we’re using Regular Expressions!!!!! which according to the Internet we’re never supposed to use. Zomg!!!

We will have to let the h8rz h8 as they say.

(?:problems){99}(?!regex)

Each regex follows the same pattern:

  • "id": " matches literally the string "id": "
  • ( delimits the start of a Capture Group
  • [^"] matches any characters that is not " (this is called a Character Class)
  • + means the “previous atom” 1 or more times which in this case applies to [^"]
  • ) closes the Capture Group and the final " matches the closing "

So [^"]+ matches up to (but not including) the closing " (i.e. a sequence of 1 or more characters that are not ") thus matching the value contained inside the "" and because we have “captured” it we can refer to it using group(1)

>>> text = '"id": "hello"'
>>> re.search(r'"id": "([^"]+)"', text).group(1)
'hello'

.group(1) gives us the contents of the first Capture Group.

The other patterns are identitical apart from the key name changing.

It can be common for the whitespace around : to change so we could replace "key": " with "key"\s*:\s*" to be more “robust”.

\s* here matches a sequence of 0 or more “whitespace” characters.

So after all of that let’s hope it works.

$ python post.py
{'status': 'fail', 'message': 'syntax error'}

FUUUUUUUUUU!

So it looks like our request “worked” but the error message suggests that there is something wrong with our q param.

When passing a dict to the data argument the values are form-encoded by requests however you can also pass a string which will go through untouched.

We know from debugging the POST request with netcat that passing a string to the data argument bypasses the setting of the Content-Type header so we must set it manually.

1 2 3
headers = { 'Content-Type': 'application/x-www-form-urlencoded', ...

Also note that because we’re using a string we must manually add the param name in our string data (i.e. the q=)

1 2 3 4 5
q = '''\ q=ig_user(%s)+{... ... ...\ '''

When we pass data={'foo': 'bar'} it gets turned into foo=bar i.e. foo is the param name and bar is the param value.

data = q % (ig_user, end_cursor) # no longer a dict

As we are bypassing this functionality by passing a string we must manually add q= to the start of our content as previously shown.

With these 2 changes let’s re-run the code. We’re using -i this time to drop us into an interactive session.

$ python -i post.py
{'status': 'ok', 'media': {'count': 402, 'page_info': {'has_previous_page':...

So it works! and now that we are in an interactive session we can take a closer look at r the repsonse object.

>>> r.json().keys()
['status', 'media']
>>> r.json()['media'].keys()
['count', 'page_info', 'nodes']
>>> print(r.json()['media']['nodes'][0]['caption'])
Cure that Monday fear ☺️

We know that we’re receiving a JSON response from this request so we use the .json() method on a Response object which turns a JSON “string” into a Python structure (also see json.loads())

To see a pretty-printed version of the JSON data we can use json.dumps() with its indent argument.

>>> import json
>>> print(json.dumps(r.json(), indent=2, sort_keys=True))
{
  "media": {
    "count": 402, 
    "nodes": [
      {
        "__typename": "GraphImage", 
        "caption": "Cure that Monday fear \u263a\ufe0f", 
        "code": "BUHCcJHA8JN", 
        "comments": {
          "count": 4
        }, 
        "comments_disabled": false, 
        "date": 1494844808, 
        "dimensions": {
          "height": 1080, 
          "width": 1080
        }, 
        "display_src": "https://scontent.cdninstagram.com/...",
        "id": "1515190529653195341", 
        "is_video": false, 
        "likes": {
          "count": 160
        }, 
        "owner": {
          "id": "3612106348"
        }, 
        "thumbnail_src": "https://scontent.cdninstagram.com/..."
      }, 

I’ve truncated the output here due to its size. It can be useful in such cases to write the pretty-printed output to a file instead for further inspection.

Now that we have our JSON response we can extract the end_cursor value to use in a POST request to fetch the next “page” of results.

>>> data = q % r.json()['media']['page_info']['end_cursor']
>>> r = s.post(api, cookies={'csrftoken': csrftoken}, data=data, headers=headers)
>>> print(r.json()['media']['nodes'][0]['caption'])
Blurry eyes ?

… and again?

>>> data = q % r.json()['media']['page_info']['end_cursor']
>>> r = s.post(api, cookies={'csrftoken': csrftoken}, data=data, headers=headers)
>>> print(r.json()['media']['nodes'][0]['caption'])
Lunch in the sun what more could you ask for!

GREAT SUCCESS!

So now that we have our “pagination” working we could use another regex to “scrape” all the URLs contained in the display_src values to give us direct links to the images.

src_urls = re.findall(r'"display_src": "([^"]+)"', r.text)

But, but, but… what about muh videoz?!?!?!?!?!?!!?

Extracting JSON from HTML

You may have noticed the "is_video": false in the JSON output above. In the case of a video it would obviously be true and not false however the display_src will be the “preview image” of the video and not a link to the video itself.

In order to get the direct link to the video more work is needed.

We know that code is used to build the the URL to an individual “post” page and we know that Instagram is storing JSON data in the source HTML via this window._sharedData variable so let’s fetch a video post page and take a look.

If we didn’t know the name of window._sharedData we could always search for some of the JSON keys instead e.g. is_video

We’ll be using https://www.instagram.com/p/BSfdmLugXjd/ as our URL for testing.

$ curl -A Mozilla/5.0 https://www.instagram.com/p/BSfdmLugXjd/ -o post.html

Locating the sharedData we can see the following.

<script type="text/javascript">window._sharedData = {
  "activity_counts": null, "config": 
  {"csrf_token": "Ht46CMuWhTX5ohHVf38i2JYDukekxTnj", "viewer": null}, 
  "entry_data": {"PostPage": 
    [{"graphql": {"shortcode_media": {"__typename": "GraphVideo", "id": "1486036569335888093", 
     "shortcode": "BSfdmLugXjd", "dimensions": {"height": 800, "width": 640}, 
     "video_url": "https://scontent.cdninstagram.com/t50.2886-16/17815456_1265625920187368_2182862195560284160_n.mp4", 
     "video_view_count": 856, "is_video": true

So we have a video_url in there with the direct link to the .mp4 file which we could extract by using the regex approach from earlier.

This leaves us with the question of how do we process each “node” from our results page to

  • test if is_video is true
  • if it is: extract code - fetch URL - extract video_url
  • else, extract display_src

This means we would need to be able to group code, is_video and display_src together for each “node” entry.

We know that the data is contained inside the window._sharedData variable and it looks like its in JSON format.

<script type="text/javascript">window._sharedData = {
...
"show_app_install": true};</script>

If we could isolate this part of the HTML and just keep the {...} contents we could try passing it to json.loads() to turn it into a Python structure.

Let’s return to page.html we saved earlier for some testing.

>>> with open('page.html') as f:
...     text = f.read()
...
...     start = 'window._sharedData = {'
...     end   = ';</script>'
...
...     data = text[text.find(start) + len(start) - 1:]
...     data = data[:data.find(end)]
...
...     j = json.loads(data)
...     print(j.keys())
...
['show_app_install', 'platform', 'activity_counts', 'hostname', ...

For “teh lulz” we’ve used str.find() in combination with slicing to extract the needed data however if you’re not regexphobic you could have also gone that route.

data = re.search(r'window._sharedData = (\{.+?});</script>', text).group(1)

The str.find() obviously works with exact string matching meaning the regex approach can be considered more “flexible” as we could make use of things like \s* to match any optional whitespace.

You may see both techniques on your travels so it just serves as an example here.

So our plan seems to have worked…

Let’s use json.dumps() to get a pretty-printed version.

>>> print(json.dumps(j, indent=2, sort_keys=True))
{
  "config": {
    "csrf_token": "Fa4xUEE8aohXO7UEgZEXPAvyobAOQvVF", 
  }, 
  "country_code": "IE", 
  "entry_data": {
    "ProfilePage": [
      {
        "logging_page_id": "profilePage_3612106348", 
        "user": {
          "biography": "7-4 weekdays.", 
          "followed_by": {
            "count": 7423
          }, 
          "followed_by_viewer": false, 
          "follows": {
            "count": 94
          }, 
          "follows_viewer": false, 
          "full_name": "The Fat Fox", 
          "id": "3612106348", 
          "media": {
            "count": 402, 
            "nodes": [
              {

So we can see that nodes is buried down inside j['entry_data']['ProfilePage'][0]['user']['media']['nodes']

Let’s try iterate over it and extract some values.

>>> for node in j['entry_data']['ProfilePage'][0]['user']['media']['nodes']:
...     print(node['code'], node['is_video'])
BUHCcJHA8JN False
BUGg5r6gf5T False
BUCMkOvgIQM False
[...]

This means we now have a way to process each “node” and deal with both images and videos correctly which we can incorporate into our code to process each page of results.

But wait, there’s more!

We mentioned at the start that Instagram had an API so let’s look at the docs for the Get most recent media endpoint.

If we look at the PARAMETERS we see

  • max_id Return media earlier than this max_id.
  • min_id Return media later than this min_id.

This sounds similar to what’s been going on with end_cursor and start_cursor.

Out of curiosity I tried to replace our POST request with a GET request to https://www.instagram.com/thefatfoxcamden/?max_id=end_cursor which seemed to work however we get back HTML instead of JSON.

The HTML contains all of the JSON data just like the other pages and we just saw how to extract it and load it up into json.loads() so let’s give that approach a try here.

1 2 3 4 5 6 7 8 9 10 11 12 13
import re, requests profile = 'https://www.instagram.com/thefatfoxcamden/' with requests.session() as s: s.headers['user-agent'] = 'Mozilla/5.0' r = s.get(profile) caption = re.search(r'"caption": "([^"]+)"' , r.text).group(1) end_cursor = re.search(r'"end_cursor": "([^"]+)"', r.text).group(1) print(caption)

Do note we’re using a regex to just extract the first caption from the JSON data as we’re only testing if the “pagination” is working.

Once again we’ll use Python’s -i option to drop us into an interactive session.

$ python -i get.py
Cure that Monday fear ☺️

Now let’s try to get the next page of results. Note that as it’s now a GET request we use params as opposed to the data argument.

>>> r = s.get(profile, params={'max_id': end_cursor})
>>> caption = re.search(r'"caption": "([^"]+)"'   , r.text).group(1)
>>> print(caption)
Blurry eyes ?

One more for good luck.

>>> end_cursor = re.search(r'"end_cursor": "([^"]+)"', r.text).group(1)
>>> r = s.get(profile, params={'max_id': end_cursor})
>>> caption = re.search(r'"caption": "([^"]+)"'   , r.text).group(1)
>>> print(caption)
Lunch in the sun what more could you ask for!

EVEN GREATER SUCCESS!

This means we have now have no need for the disgusting q= POST param, csrftoken or Cookie and all we need to do now is add in the code from earlier that processed each “node”.

instagram.py

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
import json, re, requests user = 'thefatfoxcamden' profile = 'https://www.instagram.com/' + user with requests.session() as s: s.headers['user-agent'] = 'Mozilla/5.0' end_cursor = '' for count in range(1, 3): print('PAGE: ', count) r = s.get(profile, params={'max_id': end_cursor}) data = re.search( r'window._sharedData = (\{.+?});</script>', r.text).group(1) j = json.loads(data) media = j['entry_data']['ProfilePage'][0]['user']['media'] for node in media['nodes']: if node['is_video']: page = 'https://www.instagram.com/p/' + node['code'] r = s.get(page) url = re.search(r'"video_url": "([^"]+)"', r.text).group(1) print('VIDEO:', url) else: print('IMAGE:', node['display_src']) end_cursor = re.search(r'"end_cursor": "([^"]+)"', r.text).group(1)

So the code should be easy enough to follow as we’ve discussed all the parts already.

We are looping over the first 2 pages of results as an example. end_cursor is initially empty so we will get the first page of results.

We then extract the JSON data and process each “node” printing display_src if it’s an image and fetching the individal post page to extract the video_url if it’s a video.

Finally we extract the end_cursor which is used to fetch the next page of results, and repeat.

Here is what the output looks like.

We’re obviously just printing the URLs here and the goal is probably to fetch them and save the result to disk instead.

Summary

  • Instagram has an api and if you’re attempting to do anything remotely resembling “large-scale” scraping you should probably use it and abide by their API guidelines.

  • Initially having Javascript disabled can be helpful.

  • The Network tab is “teh total r0x0r”

  • Searching the source HTML for param names and values is a worthwhile exercise.

  • (?:problems){99}(?!regex)

  • You just “built an instagram scraper in 30 lines of Python!!!! Zomg!!!!”

Source code is also available on github.