The goal is given the wikipedia page of a film we need to extract the text content of the Plot section using Python. We’ve been given https://en.wikipedia.org/wiki/Dangal_(film) as an example URL.

The first thing we will do is to open the page in our browser and view the Inspector tab. I’ve done it here by right-clicking on the page and selecting Inspect Element.

I then use the Selector tool (the first button on the panel to the left of Inspector) and click the Plot header to display that specific section of the HTML.

Do note that the Inspector tab shows your browser’s representation of the page after it has parsed the source HTML and as such it may differ from the actual source HTML.

So we can see that the Plot header is inside a <span> tag with id="Plot" which is itself inside a <h2> tag.

<h2>
  <span class="mw-headline" id="Plot">Plot</span>
<h2>
<p>
  plot 1
</p>
<p>
  plot 2
</p>
[...]

The actual Plot contents is contained in 1 or more <p> tags that come after the <h2> tag. In this particular example there are 6 “sibling” <p> tags.

Code

We’ll be using requests to fetch the HTML and BeautifulSoup with html5lib to parse it. You can install these using pip install beautifulsoup4 requests html5lib --user if you have not already.

We will start by first isolating the <span> tag.

>>> import requests
>>> from   bs4 import BeautifulSoup
>>> 
>>> r    = requests.get('https://en.wikipedia.org/wiki/Dangal_(film)')
>>> soup = BeautifulSoup(r.content, 'html5lib')
>>> 
>>> soup.select_one('#Plot')
<span class="mw-headline" id="Plot">Plot</span>
>>> soup.find('span', {'id': 'Plot'})
<span class="mw-headline" id="Plot">Plot</span>

There are various ways for selecting particular elements and we’ve shown 2 examples here. First we use the CSS Selector #Plot which will match anything that has id="Plot". The # here matches against the id attribute.

We’ve omitted the tag name because there are no other elements that match that id, we could have been more explicit and used span#Plot to search only <span> tags.

There is also find() which we pass the tag name and a dict of attribute names and values to match against.

We will use the select_one() approach for now as it’s less typing.

From the <span> tag we can then search “upwards” using find_parent()

>>> soup.select_one('#Plot').find_parent('h2')
<h2><span class="mw-headline" id="Plot">Plot</span></h2>

We can then navigate to the sibling <p> tag using find_next_sibling()

>>> soup.select_one('#Plot').find_parent('h2').find_next_sibling()
<p>Mahavir Singh Phogat is an amateur wrestler ...

This gives us the first <p> tag but there can be multiple <p> tags containing the plot text. This means we want to keep moving to the next sibling tag until we reach a tag that is not a <p> tag.

We can use the .name attribute to access the name of a tag.

>>> tag = soup.select_one('#Plot').find_parent('h2').find_next_sibling()
>>> tag.name
'p'

We could combine this with a while loop to fetch the .text content of each individual <p> tag and store them all in a list.

>>> plot = []
>>> tag  = soup.select_one('#Plot').find_parent('h2').find_next_sibling()
>>> 
>>> while tag.name == 'p':
...     plot.append(tag.text)
...     tag = tag.find_next_sibling()
... 
>>> len(plot)
6

The actual goal was to extract or “scrape” the Plot summary for multiple films so here is an example of how one could create a loop to do so.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
import requests from bs4 import BeautifulSoup films = 'Udta Punjab', 'Dangal (film)' url = 'http://en.wikipedia.org/wiki/' with requests.session() as s: s.headers['user-agent'] = 'Mozilla/5.0' for film in films: r = s.get(url + film) soup = BeautifulSoup(r.content, 'html5lib') plot = [] tag = soup.select_one('#Plot').find_parent('h2').find_next_sibling() while tag.name == 'p': plot.append(tag.text) tag = tag.find_next_sibling() # do something with plot

We’re setting the User-Agent header here to Mozilla/5.0 as the default requests value is commonly blocked. In this case Wikipedia does not block it so it’s not needed.

When making multiple requests it usually makes sense to use a session object to maintain “state” and keep track of cookies.

It also allows us to specify the User-Agent header once for all requests.