Given a http://fightmetric.com profile page the goal is to extract or “scrape” the fighter’s details i.e. name, record and stats. We’ve been given http://www.fightmetric.com/fighter-details/2f5cbecbbe18bac4 as an example URL.

The first thing we will do is to open the page in our browser and view the Inspector tab. I’ve done it here by right-clicking on the page and selecting Inspect Element.

We can then use the Selector tool (the first button on the panel to the left of Inspector) to click on a specific element on the page to display the HTML.

Do note that the Inspector tab shows your browser’s representation of the page after it has parsed the source HTML and as such it may differ from the actual source HTML.

So it looks like the name, record and nickname can be easily extracted.

<h2 class="b-content__title">
  <span class="b-content__title-highlight">
    Shamil Abdurakhimov
  </span>
  <span class="b-content__title-record">
    Record: 16-4-0
  </span>
</h2>
<p class="b-content__Nickname">
  Abrek
</p>

They each have a unique class attribute which we can target however the “stats” sections look rather messy.

<div class="b-list__info-box b-list__info-box_style_small-width js-guide">
  <ul class="b-list__box-list">
    <li class="b-list__box-list-item b-list__box-list-item_type_block">
      <i class="b-list__box-item-title b-list__box-item-title_type_width">
        Height:
      </i>
      6' 3"
    </li>
    <li class="b-list__box-list-item b-list__box-list-item_type_block">
      <i class="b-list__box-item-title b-list__box-item-title_type_width">
        Weight:
      </i>
      235 lbs.
    </li>
    [...]

Each “stat” name is inside an <i> tag that has a class attribute which contains b-list__box-item-title. The <i> tag is itself inside an <li> tag. If we isolate the <i> tag we can then use .next to navigate to the stat value.

Code

We’ll be using requests to fetch the HTML and BeautifulSoup with html5lib to parse it. You can install these using pip install beautifulsoup4 requests html5lib --user if you have not already.

>>> import requests
>>> from   bs4 import BeautifulSoup
>>>
>>> url = 'http://www.fightmetric.com/fighter-details/2f5cbecbbe18bac4'
>>> r   = requests.get(url, headers={'user-agent': 'Mozilla/5.0'})
>>>
>>> soup = BeautifulSoup(r.content, 'html5lib')

We’re setting the User-Agent header to Mozilla/5.0 as the default requests header is commonly blocked.

We mentioned that the name had a unique class attribute and we can use the CSS Selector .b-content__title-highlight to match it. The . here matches against the class attribute.

We could be more explicit and use span.b-content__title-highlight i.e. by specifying the tag name but there are no other tags with that class so it can be omitted.

We’re using the select_one() method which takes a CSS Selector and returns the first matching tag.

>>> soup.select_one('.b-content__title-highlight')
<span class="b-content__title-highlight">\n     Shamil Abdurakhimov\n\n         </span>
>>> soup.select_one('.b-content__title-highlight').text
'\n                Shamil Abdurakhimov\n\n            '
>>> soup.select_one('.b-content__title-highlight').text.strip()
'Shamil Abdurakhimov'

We can use the same process to extract the record and nickname.

>>> soup.select_one('.b-content__title-record').text.strip()
'Record: 16-4-0'
>>> soup.select_one('.b-content__title-record').text.strip().split()[-1]
'16-4-0'
>>> soup.select_one('.b-content__Nickname').text.strip()
'Abrek'

Moving on to the stats we can use the CSS Selector .b-list__box-item-title to isolate each <i> tag inside the <li> tag.

Again, we could be explicit and specify the tag name using i.b-list__box-item-title but there are no other matches here so it can be omitted.

You may notice that the class attribute does not contain that exact value.

<i class="b-list__box-item-title b-list__box-item-title_type_width">

When a class attribute has multiple class names or “words” we only need to specify one of them in order to match.

From here we can chain multiple .next calls to navigate to the value.

>>> soup.select_one('.b-list__box-item-title')
<i class="b-list__box-item-title b-list__box-item-title_type_width">\n   Height:\n   </i>
>>> soup.select_one('.b-list__box-item-title').next
'\n     Height:\n   ' 
>>> soup.select_one('.b-list__box-item-title').next.next
'\n      6\' 3"\n    '

As mentioned select_one() returns the first match.

We can use select() to get all matches.

>>> for i in soup.select('.b-list__box-item-title'):
...     i.text.strip(), i.next.next.strip()
... 
('Height:', '6\' 3"')
('Weight:', '235 lbs.')
('Reach:', '76"')
('STANCE:', 'Orthodox')
('DOB:', 'Sep 02, 1981')
('Career statistics:', '')
('SLpM:', '2.48')
('Str. Acc.:', '45%')
('SApM:', '2.50')
('Str. Def:', '58%')
('', '')
('TD Avg.:', '1.40')
('TD Acc.:', '22%')
('TD Def.:', '77%')
('Sub. Avg.:', '0.3')

However we have 2 unwanted results here.

  • ('Career statistics:', '')
  • ('', '')

Career statistics matches because of the class attribute.

<div class="b-list__info-box-left">
  <i class="b-list__box-item-title">
    Career statistics:
  </i>
  <ul class="b-list__box-list b-list__box-list_margin-top">
    <li class="b-list__box-list-item b-list__box-list-item_type_block">

Do note though that this matching tag is not inside an <li> tag. This means we can avoid this unwanted match by specifying that it must be inside an <li> tag.

We can use the selector li .b-list__box-item-title to specify that.

This states that .b-list__box-item-title must match anywhere inside an <li> tag i.e. one two states that two must be a descendent of one.

The 2nd unwanted result is due to a “blank” <i> tag in the HTML.

<i class="b-list__box-item-title b-list__box-item-title_type_width">\n\n          </i>

An empty string is a Falsey value so we can use if to filter it out. We will also use a dictionary to store the stat names and values.

>>> fighter = {}
>>> for i in soup.select('li .b-list__box-item-title'):
...    stat, value = i.text.strip(), i.next.next.strip()
...    if stat: 
...        fighter[stat] = value
... 

We can use the json module to pretty-print the resulting dict.

>>> import json
>>> 
>>> print(json.dumps(fighter, indent=2, sort_keys=True))
{
  "DOB:": "Sep 02, 1981", 
  "Height:": "6' 3\"", 
  "Reach:": "76\"", 
  "SApM:": "2.50", 
  "SLpM:": "2.48", 
  "STANCE:": "Orthodox", 
  "Str. Acc.:": "45%", 
  "Str. Def:": "58%", 
  "Sub. Avg.:": "0.3", 
  "TD Acc.:": "22%", 
  "TD Avg.:": "1.40", 
  "TD Def.:": "77%", 
  "Weight:": "235 lbs."
}

We could also add the name, record, and nickname that we extracted earlier to the dict.

Here’s the full code.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
import json, requests from bs4 import BeautifulSoup url = 'http://www.fightmetric.com/fighter-details/2f5cbecbbe18bac4' with requests.session() as s: s.headers['user-agent'] = 'Mozilla/5.0' r = s.get(url) soup = BeautifulSoup(r.content, 'html5lib') name = soup.select_one('.b-content__title-highlight').text.strip() record = soup.select_one('.b-content__title-record').text.strip().split()[-1] nick = soup.select_one('.b-content__Nickname').text.strip() fighter = dict(Name=name, Record=record, Nickname=nick) for i in soup.select('li .b-list__box-item-title'): stat, value = i.text.strip(), i.next.next.strip() if stat: fighter[stat] = value print(json.dumps(fighter, indent=2, sort_keys=True))

The output it produced.

$ python fightmetric.com.py 
{
  "DOB:": "Sep 02, 1981", 
  "Height:": "6' 3\"", 
  "Name": "Shamil Abdurakhimov", 
  "Nickname": "Abrek", 
  "Reach:": "76\"", 
  "Record": "16-4-0", 
  "SApM:": "2.50", 
  "SLpM:": "2.48", 
  "STANCE:": "Orthodox", 
  "Str. Acc.:": "45%", 
  "Str. Def:": "58%", 
  "Sub. Avg.:": "0.3", 
  "TD Acc.:": "22%", 
  "TD Avg.:": "1.40", 
  "TD Def.:": "77%", 
  "Weight:": "235 lbs."
}

We’re assuming this needs to be done for multiple fighter pages so we’re using session object to maintain “state” and keep track of cookies.

It also allows us to specify the User-Agent header once for all requests.