The goal is to perform a YouTube search and to extract or “scrape” the video URL and title of the first page of results using Java’s jsoup library.

This is the user’s first time using an HTML parser so we will try to be as verbose as possible with the explanation.

The details given in this article are not specific to Java and there is also a solution offered using requests, BeautifulSoup and html5lib if you’re using Python.

GET request

When we search for something we can see the resulting URL in the address bar is https://www.youtube.com/results?search_query=dj liquid raving

However if we copy the URL and paste it somewhere we see the space characters have been replaced with %20

This is called Percent or URL encoding which in the most basic form simply replaces a character with a % followed by the character’s corresponding hexidecimal value.

In our URL everything before the ? character is the actual “location” and everything after is called the “query string”.

The “query string” is used to pass data through and it’s passed along in the URL when we make a GET request (as opposed to the other common form of request which is called POST).

The query string consists of name=value pairs which are separated by & e.g.

  • name=me&age=539

So because ? and & are part of the “URL syntax” what would we do if one of our “param” names or values contained one of those characters?

This is the reason for the URL encoding certain “reserved” characters are encoded so they can be passed along and not interpreted as part of the URL / query string themselves.

So back to our task it looks like we need to:

  • make a GET request to https://www.youtube.com/results
  • passing the param name search_query
  • that has a value dj liquid raving

“Developer Tools”

To see what is happening with our webpage we can use the Inspector tab in our browser from what is commonly referred to as its “Developer Tools”.

I’ve done it here by right-clicking on the page and selecting Inspect Element.

We can then use the Selector tool (the first button on the panel to the left of Inspector) to click on a specific element on the page to display the HTML.

You can also right-click directly on an element when opening the Inspect Element option to have that element directly in focus when it opens.

Do note that the Inspector tab shows your browser’s representation of the page after it has parsed the source HTML and as such it may differ from the actual source HTML.

HTML

So if we take a closer look at the HTML structure of the first result we can see

<h3 class="yt-lockup-title ">
  <a href="/watch?v=fLnFHbmyd_I" class="yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2 yt-uix-sessionlink spf-link "
     data-sessionlink="itct=CEcQ3DAYACITCJGukrLA-9MCFQopFgodvQQGWSj0JFIQZGogbGlxdWlkIHJhdmluZw"
     title="DJ Liquid - I Can't Stop Raving" rel="spf-prefetch" aria-describedby="description-id-801577" dir="ltr">
    DJ Liquid - I Can't Stop Raving
  </a>
  <span class="accessible-description" id="description-id-801577"> - Duration: 4:26.</span>
</h3>

The h3 here is called a tag and the class is called an attribute of that tag.

<h3> represents the opening of the tag (i’ve omitted the attribute definition here) and </h3> is the closing of the tag.

HTML could be classified as a “tree like” structure.

<A>
  <B>
    <C></C>
  </B>
<A>

In this example:

  • A is the parent
  • B is a child of A
  • C is a child of B
  • C is also a grandchild of A
  • B and C are both descendents of A

This means that in the HTML of the first search result the <a> tag is a direct child of the <h3> tag which will be important in helping us “scrape” the needed data from the results.

Jsoup

We’re going to just start with the code.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import java.io.IOException; public class YouTubeSearch { public static void main (String[] args) throws IOException { String url = "http://www.youtube.com/results"; String query = "dj liquid raving"; Document doc = Jsoup.connect(url) .data("search_query", query) .userAgent("Mozilla/5.0") .get(); for (Element a : doc.select(".yt-lockup-title > a[title]")) { System.out.println(a.attr("href") + " " + a.attr("title")); } } }

It (hopefully) should be relatively easy to follow.

We’re using the Json.connect(String url) method.

The .data() call sets up our query string (it also handles the URL encoding).

With .userAgent() we’re setting the User-Agent header to Mozilla/5.0 as the default value is commonly blocked.

Finally we call .get() which sends a GET request and the result is stored in our doc variable.

We then use the select() to “select” the particular tags we are interested in (which are the <a> tags in this case) and extract the watch URL and the video title.

Running it from the command-line we get the following.

$ java YouTubeSearch
/watch?v=fLnFHbmyd_I DJ Liquid - I Can't Stop Raving
/watch?v=AxFL35Gfwo8 Dj liquid- Tetris (Rave Mix)
/watch?v=AaNlSOZMTvs DJ Liquid | I Can't Stop Raving
/watch?v=iAVJ9gModh0 Dj Liquid - I Cant Stop Raving
/watch?v=fLnFHbmyd_I&list=RDfLnFHbmyd_I Mix - DJ Liquid - I Can't Stop Raving
/watch?v=vSioSwQH_84 Trance-Rave-Jungle--Dj Liquid- Final Fantasy
/watch?v=X9eR0st-kZQ DUNE - I Can't Stop Raving
/watch?v=EJjQEDYcWSg RAVE GENERATION 2 - DISC 2
/watch?v=vcSlX7n6VOA Dj Liquid I cant stop raving
/watch?v=d_zxv0UWowM Old Skool Hardcore Breakbeat Rave Mix - 1992-1993 Classics
/watch?v=XLjGsLIAOSY Dj Liquid - Birth Of Liquid Dreams
/watch?v=r-D5jGMuBtI 1992 Rave in 7 Minutes
/watch?v=g67WJ1f_W54 Jericho Liquid Rave
/watch?v=XbC0S1UtbNs DJ Liquid   Trance Rave Jungle   Japanese Techno
/watch?v=YnZaXfWH1sA Liquid Sky & Free Trance | Vegas | By Up Team Audiovisual
/watch?v=AU5UYHcrd30 DJ Liquid Transformers Mix
/watch?v=fLnFHbmyd_I&list=PLA1EB849BF8EE7AFD raving
/watch?v=gBkNLGLWTUc Dj Ravin @ LIQUID The Club Sibiu
/watch?v=ONdG3KpKgaU Dj Liquid - Platinum
/watch?v=9XvjrKNCyRE DJ Liquid Jogja special party - Lagu Terbaik

If we only wanted to get the first result we could remove the for loop and use the Elements.first() method instead.

Element a = doc.select(".yt-lockup-title > a[title]").first();

CSS Selectors

In our select() call we used .yt-lockup-title > a[title] which is called a CSS Selector.

From the docs:

.class matches elements with a class name of “class”

This means that .yt-lockup-title matches any tag that have yt-lockup-title as an entry in their class attribute.

This matches our <h3> tag.

<h3 class="yt-lockup-title ">

We could be more explicit and use h3.yt-lockup-title which would state that we should only search <h3> tags but no other tags have a class attribute matching yt-lockup-title so it can be omitted.

E > F matches an F direct child of E

E in our selector matches the <h3> tag and F in our in our selector is a[title] meaning that a[title] must be a direct child of the <h3> tag.

[attr] matches elements with an attribute named “attr” (with any value)

This means [title] matches any tag with a title attribute however we want to only match <a> tags so we also specify the tag name with a[title]

The reason we specify title here is because there are other child <a> tags that we do not want to match and they do not have a title attribute.

The watch URL is located inside the href attribute of the <a> tag and the video title is located inside the title attribute which we access by using the .attr() method.

You may have noticed that we had 2 “playlist” results in the output.

1 2
/watch?v=fLnFHbmyd_I&list=RDfLnFHbmyd_I Mix /watch?v=fLnFHbmyd_I&list=PLA1EB849BF8EE7AFD

How could we exclude these from our matches?

Well when extracting .attr("href") we could test that it didn’t contain &list= however we could also do it in our Selector by utilizing

[attr*=valContaining] matches elements with an attribute named “attr”, and value containing “valContaining”

… and

:not(selector) matches elements that do not match the selector.

We could use [href*=&list=] to match any tag that has an href attribute whose value contains &list= and we could then use :not() around that to “invert” the match.

To have it apply to our already existing a[title] selector we simply chain them together i.e. a[title]:not([href*=&list=])

This first matches all <a> tags with a title attribute and then filters out any that contain &list= in their href attribute.

With this modification to our selector we no longer match the playlist URLs.

$ java YouTubeSearch
http://www.youtube.com/watch?v=fLnFHbmyd_I DJ Liquid - I Can't Stop Raving
http://www.youtube.com/watch?v=AxFL35Gfwo8 Dj liquid- Tetris (Rave Mix)
http://www.youtube.com/watch?v=AaNlSOZMTvs DJ Liquid | I Can't Stop Raving
http://www.youtube.com/watch?v=vSioSwQH_84 Trance-Rave-Jungle--Dj Liquid- Final
http://www.youtube.com/watch?v=iAVJ9gModh0 Dj Liquid - I Cant Stop Raving
http://www.youtube.com/watch?v=X9eR0st-kZQ DUNE - I Can't Stop Raving
http://www.youtube.com/watch?v=EJjQEDYcWSg RAVE GENERATION 2 - DISC 2
http://www.youtube.com/watch?v=vcSlX7n6VOA Dj Liquid I cant stop raving
http://www.youtube.com/watch?v=d_zxv0UWowM Old Skool Hardcore Breakbeat Rave Mi
http://www.youtube.com/watch?v=g67WJ1f_W54 Jericho Liquid Rave
http://www.youtube.com/watch?v=XLjGsLIAOSY Dj Liquid - Birth Of Liquid Dreams
http://www.youtube.com/watch?v=XbC0S1UtbNs DJ Liquid   Trance Rave Jungle   Jap
http://www.youtube.com/watch?v=r-D5jGMuBtI 1992 Rave in 7 Minutes
http://www.youtube.com/watch?v=YnZaXfWH1sA Liquid Sky & Free Trance | Vegas | B
http://www.youtube.com/watch?v=AU5UYHcrd30 DJ Liquid Transformers Mix
http://www.youtube.com/watch?v=AyXtFuWDTSA Dance-Techno-Trance-RAVE-Happy Hardc
http://www.youtube.com/watch?v=ONdG3KpKgaU Dj Liquid - Platinum
http://www.youtube.com/watch?v=9XvjrKNCyRE DJ Liquid Jogja special party - Lagu

We’ve also added http://www.youtube.com to the output to give us the “absolute” URL as the href attribute only contained a “relative” one.

Python

If you’ve come here as a Python user the equivalent code using requests to fetch the HTML and BeautifulSoup with html5lib to parse it.

To install these libraries you can use pip if you have not already.

  • pip install beautifulsoup4 requests html5lib --user

Onto the code…

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
import requests from bs4 import BeautifulSoup with requests.session() as s: s.headers['user-agent'] = 'Mozilla/5.0' url = 'http://www.youtube.com/results' params = {'search_query': 'dj liquid raving'} r = s.get(url, params=params) soup = BeautifulSoup(r.content, 'html5lib') for a in soup.select('.yt-lockup-title > a[title]'): if '&list=' not in a['href']: print('http://www.youtube.com' + a['href'], a['title'])

BeautifulSoup has “limited” CSS Selector support and does not support the exact selector we used with Jsoup meaning we must filter out the playlist URLs separately.

That’s it!

You should be aware that YouTube does have an API which you may wish to use.

The final version of the code examples used here are also available on github.

♫ ♫ ♪ ♪ I Can’t Stop Ravin’ ♪ ♪ ♫ ♫