The goal is to enter a zipcode into the Community Facts search on the https://factfinder.census.gov page and scrape the resulting 2010 Census General Population and Housing Characteristics table data if present.

We want a solution in Python that uses the requests library, if possible.

“Developer Tools”

This process involves multiple steps so what we will want to do is to debug the HTTP requests that are made during each step so we can replicate them in Python.

To do this we can use the Network tab in our browser which is located in what is commonly referred to as the “Developer Tools”.

The first thing I do is open https://factfinder.census.gov in my browser.

I’m using Firefox with Javascript disabled and I see this “requires an Internet Browser with Javascript enabled” message. I also try to click on the Go button and nothing happens. This tells us that Javascript is used to perform the search and also tells us what type of requests we should look for in the Network tab.

I right-click on the page and select Inspect Element to open up the Inspector tab. (Although you can access it directly from the menu or using keyboard shortcuts.)

From there I select the Network tab. I also filter the type of requests displayed to only XHR which stands for XMLHttpRequest. These are the type of requests made by Javascript and we know to look for these specifically due to the hint we received earlier.

I then re-enable Javascript in my browser and refresh the page.

We’ve chosen the zipcode 99501 as an example and we click on the Go button which brings us to a search result page with a list of links.

We can select a specific request in the Network tab which will open an Information Panel and if we select the last request made (i.e. the one made when we clicked Go) we can see it made a GET request to https://factfinder.census.gov/rest/communityFactsNav/nav and the zipcode was passed along as the param searchTerm

N, log and _ts seem unimportant so we will ignore them for now.

If we click on the Response tab of the Information Panel we can see that we received JSON data (even though the Type column says plain) and scrolling down through the response we see measureAndLinksContent which looks like a good place to investigate.

It looks like HTML which it is and it is infact the HTML used to generate the search results page meaning it contains the link we need to click.

<div>
  <h2>99501</h2>
  <div class='datapoint'>
  ...
  <div class="linkswithhyper">
    <a href="/bkmk/table/1.0/en/DEC/10_DP/DPDP1/8600000US99501">
    General Population and Housing Characteristics (Population, Age, Sex, Race, Households and Housing, ...)
    </a>
  </div>
  ...

The next step is to click on the 2010 General Population link and view the request made in the Network tab.

We can see it makes a GET request to https://factfinder.census.gov/tablerestful/tableServices/renderProductData however there doesn’t seem to be any important data sent in the params (_ts just stands for timestamp). This suggests that data is being sent via some other method e.g. using Cookies.

If we take a look at the Response tab we can see it is similar to the first request. It is JSON data and scrolling down we can see productDataTable which contains HTML and like before it contains the HTML of the table we want to scrape.

Code

Now that we know what HTTP requests are being made we can now try to replicate them in Python.

>>> import requests
>>> 
>>> s = requests.session()
>>> s.headers['user-agent'] = 'Mozilla/5.0'
>>> 
>>> r = s.get('https://factfinder.census.gov/rest/communityFactsNav/nav', params={'searchTerm': 99501})

We know we need to make multiple requests so we use a session object which will maintain cookies, headers, etc between requests.

We’re setting the User-Agent header to Mozilla/5.0 as the default requests value is commonly blocked.

We know that we’re expecting a JSON response from this request so we can check that by using the .json() method on a Response object which turns a JSON “string” into a Python structure (see json.loads())

>>> type(r.json())
<type 'dict'>

We get back a dict so let’s check its .keys()

>>> r.json().keys()
['CFMetaData']
>>> r.json()['CFMetaData'].keys()
['leftNavSelection', 'displayNoDataAvailableMsg', 'disambiguationContent', 'measureAndLinksContent', ... 

You may remember measureAndLinksContent from earlier.

>>> r.json()['CFMetaData']['measureAndLinksContent']
'<div>\n<h2>99501</h2>\n<div class=\'datapoint\'>\n<div class="actionbar">\n...

It contains the HTML we need to extract the link from so let’s save it for later use.

>>> html = r.json()['CFMetaData']['measureAndLinksContent']

We now attempt to make the second request that is made, this time to the renderProductData endpoint.

>>> r = s.get('https://factfinder.census.gov/tablerestful/tableServices/renderProductData')
>>> r.json().keys()
['Exception']

We get an Exception … hmmm….

Well we are missing a step here, in our browser we clicked the actual 2010 report link before the request to renderProductData was made so let’s request the report page to simulate the click.

>>> s.get('https://factfinder.census.gov/bkmk/table/1.0/en/DEC/10_DP/DPDP1/8600000US99501')
<Response [200]>

… and we will try renderProductData again.

>>> r = s.get('https://factfinder.census.gov/tablerestful/tableServices/renderProductData')
>>> r.json().keys()
['ProductData']
>>> r.json()['ProductData'].keys()
['presentationDownloadRowLimit', 'headerNotes', 'noDataFound', ...

It works! This suggests that something in the Cookie data keeps track of what page we’re viewing and renderProductData checks this data when it is called.

If it’s not present/valid it returns an Exception as we saw above.

If you recall from earlier the key we’re looking for is productDataTable and it is present in the result and holds the HTML of the table.

>>> 'productDataTable' in r.json()['ProductData'].keys()
True

So this worked but we knew the URL of the 2010 report page i.e. https://factfinder.census.gov/bkmk/table/1.0/en/DEC/10_DP/DPDP1/8600000US99501 because we copied it from the Network tab.

How would we implement this step in our code?

Well, it is present in the html variable we saved earlier and we could extract it from there but you may notice the last part of the URL i.e. 8600000US99501 contains the zipcode we search for i.e. 99501

If we choose another random zipcode 10002 and run it through the search we see the resulting 2010 report URL is https://factfinder.census.gov/bkmk/table/1.0/en/DEC/10_DP/DPDP1/8600000US10002

This suggests the URLs follow the same naming pattern we just need to add the zipcode to the end of it.

It also suggests that we could skip the first request to rest/communityFactsNav/nav because we can just build the URL directly.

>>> import requests
>>> s = requests.session()
>>> s.headers['user-agent'] = 'Mozilla/5.0'
>>> 
>>> s.get('https://factfinder.census.gov/bkmk/table/1.0/en/DEC/10_DP/DPDP1/8600000US99501')
<Response [200]>
>>> r = s.get('https://factfinder.census.gov/tablerestful/tableServices/renderProductData')
>>> r.json().keys()
['ProductData']
>>> r.json()['ProductData']['productDataTable'][:50]
"<div class='actionbar'><div id='notes-div1' class="

… and indeed it works.

So let’s combine all the steps together.

We’re assuming this needs to be done for multiple zipcodes and that the table needs to be “parsed” in some way.

For the “parsing” we could use BeautifulSoup along with html5lib which can be installed using the command pip install beautifulsoup4 html5lib --user if you do not already have them.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
import requests from bs4 import BeautifulSoup zipcodes = ['99501'] base = 'https://factfinder.census.gov/' report = base + 'bkmk/table/1.0/en/DEC/10_DP/DPDP1/8600000US' render = base + 'tablerestful/tableServices/renderProductData' with requests.session() as s: s.headers['user-agent'] = 'Mozilla/5.0' for zipcode in zipcodes: s.get(report + zipcode) r = s.get(render) html = r.json()['ProductData']['productDataTable'] soup = BeautifulSoup(html, 'html5lib') ...