The task is to extract the message text from a forum post using Python’s BeautifulSoup library.

The problem is that within the message text there can be quoted messages which we want to ignore.

Here is the example HTML structure we are given.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
<div class="message-container" id="m179492397"> <div class="message-top"> <!--############I need USER and DATE###########--> <b>From:</b> <a href="">User</a> | <b>Posted:</b> 5/23/2017 11:39:34 PM | <!--#########IM USING THE PIPES GIVEN TO SPLIT THIS STRING--> <a href="">Filter</a> | <a href="">Message Detail</a> | <a href="" onclick="return QuickPost.publish('quote', this);">Quote</a> </div> <table class="message-body"> <tbody> <tr> <td msgid="t,9659435,179492397@0" class="message"> <!--#############MESSAGE TOP DIV I WANT TO IGNORE############--> <div class="quoted-message" msgid="t,9659435,179492364@0"> <div class="message-top"> From: <a href="">username</a> | Posted: 5/23/2017 11:37:42 PM <a href="jump-arrow"></a> </div> <!--#############MESSAGE TOP DIV I WANT TO IGNORE############--> <div class="quoted-message" msgid="t,9659435,179492344@0"> <div class="message-top"> From: <a href="">user</a> | Posted: 5/23/2017 11:36:36 PM <a href="" class="jump-arrow"></a> </div> BUMPPPPPPPPPPP </div> </div> <br> <!--#########################--> <!--#############JUST WANT THIS MESSAGE###########--> for example, say you were trying to scrape this thread, but ignore the quotes. <br> how would you do it?<br> ---<br> <!--##################################################--> <!--USING THESE THREE DASHES AS A WAY TO SPLIT THE MESSAGE--> <!--EVERYTHING BELOW THESE DASHES IS THE SIGNATURE--> <!--I WANT TO IGNORE THIS TOO --> <!--##################################################--> LLs number one fan<br> </td> <td class="userpic"> <div class="userpic-holder"> <a href=""> <span class="img-loaded" style="width:150px;height:107px" id="u0_8"> <img src="/" width="150" height="107"> </span> </a> </div> </td> </tr> </tbody> </table> </div>

As well as the message text we’ve also been asked to extract the “User” and “Posted date” of each message.

BeautifulSoup

We’ve condensed the sample HTML down to use in our code example.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
from bs4 import BeautifulSoup html = ''' <div class="message-container"> <div class="message-top"> <b>From:</b> <a href="">User</a> | <b>Posted:</b> 5/23/2017 11:39:34 PM | </div> <table class="message-body"> <tr> <td> <div class="quoted-message"> quote1 </div> <div class="quoted-message"> quote2 </div> text i want here ---<br> i dont want this </td> </tr> </table> </div> ''' soup = BeautifulSoup(html, 'html5lib')

We’re using BeautifulSoup with html5lib to parse the HTML which you can install using pip install beautifulsoup4 html5lib if you do not already have them.

We’ll use python -i to execute our code and leave us in an interative session.

$ python -i extract-forum-messages.py 
>>> print(soup.find('div', 'message-container').find('td').text)

        
        quote1
        
        
        quote2
        
        text i want here
        ---
        i dont want this

As you can see using .text (or .get_text()) on the <td> tag that contains the full message text also includes the quote messages.

One approach could be to remove the <div class="quoted-message"> tags from within the post body which we can do using the decompose method.

>>> body = soup.find('div', 'message-container').find('td')
>>> for quote in body.find_all('div', 'quoted-message'):
...     quote.decompose()
... 
>>> print(body.text)

        
        
        text i want here
        ---
        i dont want this

We could then use the --- marker to split() on keeping only the text we want.

>>> print(body.text.strip().split('---'))
['text i want here\n        ', '\n        i dont want this']
>>> print(body.text.strip().split('---')[0].strip())
text i want here

With the message text extracted we can then move onto the “User” which is simple enough as it is conainted inside the first <a> tag inside the message-container

>>> soup.find('div', 'message-container').find('a')
<a href="">User</a>
>>> soup.find('div', 'message-container').find('a').text
'User'

The “Posted date” is not as simple however one approach could be to target the 2nd <b> tag then use .next to naviagate to our destination.

>>> soup.find('div', 'message-container').find_all('b')[1].next.next
' 5/23/2017 11:39:34 PM | \n  '
>>> soup.find('div', 'message-container').find_all('b')[1].next.next.strip('\n |')
'5/23/2017 11:39:34 PM'

regex

Another option could be to use a Regular Expression on the text content of the message-container to extract it.

>>> soup.find('div', 'message-top').text
'\n    From: \n    User | \n    Posted: 5/23/2017 11:39:34 PM | \n  '
>>> re.search(r'Posted:\s*(.+?)\s*\|', soup.find('div', 'message-top').text)
<_sre.SRE_Match object at 0x7fdd5d485648>
>>> re.search(r'Posted:\s*(.+?)\s*\|', soup.find('div', 'message-top').text).group(1)
'5/23/2017 11:39:34 PM'

We’re using the pattern Posted:\s*(.+?)\s*\| which breaks down as follows:

  • Posted: matches the literal string
  • \s matches a “whitespace” character
  • * means the previous atom (\s in this case) 0 or more times
  • ( starts a Capture Group
  • .+? matches “anything”
  • ) ends our Capture Group
  • \s* matches 0 or more “whitespace” characters
  • \| matches the literal | character

| has special meaning in regex so needs to be escaped to be matched literally.

.+? matching “anything” may need some further explanation.

  • . matches “any character”
  • + means the previous “atom” 1 or more times
  • ? makes the .+ non-greedy

By non-greedy we mean that it matches as little as possible (shortest match) as opposed to the default of as much as possible (longest match).

>>> re.search(r'(.+) |', 'one | two | three').group(1)
'one | two |'
>>> re.search(r'(.+?) |', 'one | two | three').group(1)
'one'

This means the (.+?)\s*\| part of our pattern says to Capture “anything” up until we encounter a sequence of “whitespace” followed by a | character.

We can then use the .group(1) call to access the contents of the first Capture Group that we matched.

Code

Now that we know how to extract each needed item let’s combine them together.

First, we have the output.

$ python extract-forum-messages.py
User #1
5/23/2017 11:39:34 PM
text i want here #1
User #2
5/23/2017 11:39:34 PM
text i want here #2
User #3
5/23/2017 11:39:34 PM
text i want here #3

Then, the code.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
from bs4 import BeautifulSoup template = ''' <div class="message-container"> <div class="message-top"> <b>From:</b> <a href="">User #{0}</a> | <b>Posted:</b> 5/23/2017 11:39:34 PM | </div> <table class="message-body"> <tr> <td> <div class="quoted-message"> quote1 </div> <div class="quoted-message"> quote2 </div> text i want here #{0} ---<br> i dont want this </td> </tr> </table> </div> ''' html = ''.join(template.format(n) for n in range(1, 4)) soup = BeautifulSoup(html, 'html5lib') for message in soup('div', 'message-container'): for quote in message.table('div', 'quoted-message'): quote.decompose() user = message.a.text date = message('b')[1].next.next.strip(' \n|') text = message.table.text.split('---')[0].strip() print(user) print(date) print(text)

We’ve used template here to build multiple messages as they would appear on a “forum page”.

You may notice the lack of any find() or find_all() calls in the code.

If you omit a method name it defaults to calling find_all() meaning that the following are equivalent.

  • soup.find_all('div')
  • soup('div')

There is also a “shortcut” for find()

  • soup.find('div')
  • soup.div

If a second argument is passed to either of the find methods it defaults to matching against the class attribute meaning these are all equivalent.

  • soup('div', 'message-container')
  • soup('div', {'class': 'message-container'})
  • soup.find_all('div', {'class': 'message-container'})

Not everybody appreciates this kind of “API” provided by BeautifulSoup which is why some people may recommend the use of parsel or lxml.html instead.