The goal is to extract or “scrape” information from the posts on the front page of a subreddit e.g. http://reddit.com/r/learnpython/new/

You should know that Reddit has an api and PRAW exists to make using it easier.

  • You use it, taking the blue pill—the article ends.
  • You take the red pill—you stay in Wonderland, and I show you how deep a JSON response goes.

Remember: all I’m offering is the truth. Nothing more.

Reddit allows you to add a .json extension to the end of your request and will give you back a JSON response instead of HTML.

We’ll be using requests as our “HTTP client” which you can install using pip install requests --user if you have not already.

>>> import requests
>>>
>>> url = 'http://reddit.com/r/learnpython/new/.json'
>>> r   = requests.get(url, headers={'user-agent': 'Mozilla/5.0'})
>>>
>>> len(r.json()['data']['children'])
25

We’re setting the User-Agent header to Mozilla/5.0 as the default requests value is blocked.

r.json()

We know that we’re receiving a JSON response from this request so we use the .json() method on a Response object which turns a JSON “string” into a Python structure (also see json.loads())

>>> type(r.json())
<type 'dict'>
>>> type(r.json()['data'])
<type 'dict'>
>>> r.json()['data'].keys()
['modhash', 'children', 'after', 'before']

To see a pretty-printed version of the JSON data we can use json.dumps() with its indent argument.

>>> import json
>>> print(json.dumps(r.json(), indent=2, sort_keys=True))
{
  "data": {
    "after": "t3_6bst7o", 
    "before": null, 
    "children": [
      {
        "data": {
          "approved_by": null, 
          "archived": false, 
          "author": "openflask", 
          "author_flair_css_class": null, 
          "author_flair_text": null, 
          "banned_by": null, 
          "brand_safe": true, 
          "can_gild": false, 
          "clicked": false, 
          "contest_mode": false, 
          "created": 1495117016.0, 
[...]

The output generated for this particular response is quite large so it makes sense to write the output to a file for further inspection.

>>> with open('reddit.json', 'w') as f:
...     print(json.dumps(r.json(), indent=2, sort_keys=True), file=f)

Note if you’re using Python 2 you’ll need from __future__ import print_function to have access to the print() function that has the file argument (or you could just use json.dump()).

Upon further inspection we can see that r.json()['data']['children'] is a list of dicts and each dict represents a submission or “post”.

There is also some “subreddit” information available.

>>> r.json()['data']['after']
't3_64o6gh'

These before and after values are used for result page navigation just like when you click on the next and prev buttons.

To get to the next page we can pass after=t3_64o6gh as a GET param.

>>> next_page_url = url + '?&after=' + r.json()['data']['after']
>>> next_page = requests.get(page2_url, headers={'user-agent': 'Mozilla/5.0'})
>>> next_page.json()['data']['children'][0]['data']['url']
'https://www.reddit.com/r/learnpython/comments/64o5yx/help_breakdown_list_comprehension_example/'

When making multiple requests however, you will usually want to use a session object.

So as mentioned each submission is a dict and the important information is available inside the data key:

>>> posts = r.json()['data']['children']
>>> post  = posts[0]
>>>
>>> print(json.dumps(post, indent=2, sort_keys=True))
{
  "data": {
    "approved_by": null, 
    "archived": false, 
    "author": "HolyCoder", 
    "author_flair_css_class": null, 
    "author_flair_text": null, 
    "banned_by": null, 
    "brand_safe": true, 
    "clicked": false, 
    "contest_mode": false, 
    "created": 1491943248.0, 
    "created_utc": 1491914448.0, 
    "distinguished": null, 
    "domain": "self.learnpython", 
[...]

I’ve truncated the output here but important values include author, selftext, title and url

>>> post['data']['url']
'https://www.reddit.com/r/learnpython/comments/64qkav/efficiency_of_an_algorithm/'
>>> post['data']['title']
'Efficiency of an algorithm'

It’s pretty annoying having to use ['data'] all the time so we could have instead declared posts using a list comprehension.

>>> posts = [ post['data'] for post in r.json()['data']['children'] ]
>>> posts[0]['url']
'https://www.reddit.com/r/learnpython/comments/64qkav/efficiency_of_an_algorithm/'

One example of why you may want to do this perhaps is to “scrape” the links from one of the “image posting” subreddits to access the images.

r/aww

One such subreddit is r/aww home of “teh cuddlez”.

>>> r = requests.get('http://www.reddit.com/r/aww/new/.json', headers={'user-agent': 'Mozilla/5.0'})
>>> for post in r.json()['data']['children']:
...     post['data']['url']
... 
'https://youtu.be/nJRf-fJNdJ4'
'https://i.redd.it/jmctfmktixqy.jpg'
'http://imgur.com/gallery/k5UvK'
'https://i.redd.it/q0ybf3nlixqy.jpg'
'http://i.imgur.com/JoF5FNd.jpg'
'http://i.imgur.com/NI5GuJf.gifv'
'http://i.imgur.com/UHg5RbU.jpg'
'https://i.redd.it/5jwksp64hxqy.jpg'
'http://i.imgur.com/Bninome.gifv'
'http://i.imgur.com/gs8rRg4.jpg'
'https://i.redd.it/m6xpbtuogxqy.jpg'
'https://i.redd.it/gwenjc8dgxqy.jpg'
'http://i.imgur.com/zLmZJTc.gifv'
'http://i.imgur.com/7ihgzyx.gifv'
'http://imgur.com/gallery/jjr6C'
'http://imgur.com/AkdxXuT.gifv'
'https://gfycat.com/UnpleasantBothJunco'
'https://i.redd.it/hk9y3kb8fxqy.jpg'
'http://imgur.com/ADLfEwY'
'https://i.redd.it/wfn3t1b5fxqy.jpg'
'https://i.redd.it/wa44h0zaexqy.jpg'
'https://i.redd.it/viy7gp1cexqy.jpg'
'https://i.redd.it/a1nw9bcydxqy.png'
'https://i.redd.it/wmbn7lf2owqy.jpg'
'https://i.redd.it/yestv4i5dxqy.jpg'

Some of these URLs would require further processing though as not all of them are direct links to images and not all of them are images.

In the case of the direct image links we could fetch them and save the result to disk.

BeautifulSoup

You could of course just request the regular URL, processing the HTML with BeautifulSoup and html5lib which you can install using pip install beautifulsoup4 html5lib --user if you do not already have them.

>>> r = requests.get('http://reddit.com/r/aww/new/', headers={'user-agent': 'Mozilla/5.0'})
>>> soup = BeautifulSoup(r.content, 'html5lib')
>>> for div in soup.select('div.thing'):
...     div['data-url']
... 
'https://youtu.be/nJRf-fJNdJ4'
'https://i.redd.it/jmctfmktixqy.jpg'
'http://imgur.com/gallery/k5UvK'
'https://i.redd.it/q0ybf3nlixqy.jpg'
'http://i.imgur.com/JoF5FNd.jpg'
'http://i.imgur.com/NI5GuJf.gifv'
'http://i.imgur.com/UHg5RbU.jpg'
'https://i.redd.it/5jwksp64hxqy.jpg'
'http://i.imgur.com/Bninome.gifv'
'http://i.imgur.com/gs8rRg4.jpg'
'https://i.redd.it/m6xpbtuogxqy.jpg'
'https://i.redd.it/gwenjc8dgxqy.jpg'
'http://i.imgur.com/zLmZJTc.gifv'
'http://i.imgur.com/7ihgzyx.gifv'
'http://imgur.com/gallery/jjr6C'
'http://imgur.com/AkdxXuT.gifv'
'https://gfycat.com/UnpleasantBothJunco'
'https://i.redd.it/hk9y3kb8fxqy.jpg'
'http://imgur.com/ADLfEwY'
'https://i.redd.it/wfn3t1b5fxqy.jpg'
'https://i.redd.it/wa44h0zaexqy.jpg'
'https://i.redd.it/viy7gp1cexqy.jpg'
'https://i.redd.it/a1nw9bcydxqy.png'
'https://i.redd.it/wmbn7lf2owqy.jpg'
'https://i.redd.it/yestv4i5dxqy.jpg'

BeautifulSoup’s select() method locates items using CSS Selectors and div.thing here matches <div> tags that contain thing as a class name e.g. class="thing"

We can then use dict indexing on a BeautifulSoup Tag object to extract the value of a specific tag attribute.

In this case the URL is contained in the data-url="..." attribute of the <div> tag.

As already mentioned Reddit does have an API with rules / guidelines and if you’re wanting to do any type of “large-scale” interaction with Reddit you should probably use it via the PRAW library.