PINGDOM_CHECK

Extract JSONs like a pro with chompjs and JMESPath

Posted on

4 Mins

Read Time

June 3, 2021

Categories
Handling javascript objects is an important skill for any web data extraction developer.

By

Roy Healy

Return to top

Extract JSONs like a pro with chompjs and JMESPath

Handling javascript objects is an important skill for any web data extraction developer.

At the start, having to extract JSONs can seem like a daunting task to get data from these nested dictionaries within blocks of javascript code, however, I am going to introduce you to two packages I use to make getting info from these seem like a breeze.

You might only start dipping your toes into this area when dealing with dynamic pages, but you will then quickly see that <script> tags are a good way to get data in general.

In this article, you will learn how to use two packages, chompjs and JMESPath, to extract data from JSONs quickly and efficiently. 

Parse script text using chompjs library

Chompjs is a web scraping library that can be used to turn JavaScript objects embedded in web pages into valid Python dictionaries.

As a starting point in your journey to extract JSONs, let’s assume you’ve extracted this text from a script tag.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
__DATA__ = {"data":{"type":"@products", "products":[{"id":12345678, "name":"Bacon", "brand": "Some Brand", "price":2.50, "instock": false},{"id":12345679, "name":"Ham", "price":3.50, "instock": true},{"id":12345680, "name":"Beef", "price":1.50, "instock": false}]}};
some_javascript(data) {results = do_stuff(data); return results};
new beep_boop_js_var = some_javascript(__DATA__)
__DATA__ = {"data":{"type":"@products", "products":[{"id":12345678, "name":"Bacon", "brand": "Some Brand", "price":2.50, "instock": false},{"id":12345679, "name":"Ham", "price":3.50, "instock": true},{"id":12345680, "name":"Beef", "price":1.50, "instock": false}]}}; some_javascript(data) {results = do_stuff(data); return results}; new beep_boop_js_var = some_javascript(__DATA__)
__DATA__ = {"data":{"type":"@products", "products":[{"id":12345678, "name":"Bacon", "brand": "Some Brand", "price":2.50, "instock": false},{"id":12345679, "name":"Ham", "price":3.50, "instock": true},{"id":12345680, "name":"Beef", "price":1.50, "instock": false}]}};
some_javascript(data) {results = do_stuff(data); return results};
new beep_boop_js_var = some_javascript(__DATA__)

This text has a lot of elements that you don’t want but it also has, what looks a lot like a python dictionary, containing data about a lot of products.

That’s a javascript object.

extract jsons

Normally you’d have a JSON package to help you turn it into a dictionary for your overall project to extract JSONs.

But what do you do when it’s not a clean JSON? We already know the end goal is to extract JSONs more effectively.

Fellow Zytan Mariusz Obajtek made a package to help us in this situation: chompjs.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from chompjs import parse_js_object
script = """__DATA__ = {"data":{"type":"@products", "products":[{"id":12345678, "name":"Bacon", "brand": "Some Brand", "price":2.50, "instock": false},{"id":12345679, "name":"Ham", "price":3.50, "instock": true},{"id":12345680, "name":"Beef", "price":1.50, "instock": false}]}};
some_javascript(data) {results = do_stuff(data); return results};
new beep_boop_js_var = some_javascript(__DATA__)"""
data = parse_js_object(script)
from chompjs import parse_js_object script = """__DATA__ = {"data":{"type":"@products", "products":[{"id":12345678, "name":"Bacon", "brand": "Some Brand", "price":2.50, "instock": false},{"id":12345679, "name":"Ham", "price":3.50, "instock": true},{"id":12345680, "name":"Beef", "price":1.50, "instock": false}]}}; some_javascript(data) {results = do_stuff(data); return results}; new beep_boop_js_var = some_javascript(__DATA__)""" data = parse_js_object(script)
from chompjs import parse_js_object
script = """__DATA__ = {"data":{"type":"@products", "products":[{"id":12345678, "name":"Bacon", "brand": "Some Brand", "price":2.50, "instock": false},{"id":12345679, "name":"Ham", "price":3.50, "instock": true},{"id":12345680, "name":"Beef", "price":1.50, "instock": false}]}};
some_javascript(data) {results = do_stuff(data); return results};
new beep_boop_js_var = some_javascript(__DATA__)"""

data = parse_js_object(script)

In this case, the parse_js_object function looks through the script to find the first js object, extracts it, and then turns it into a python dictionary.

From the perspective of having to extract JSONs, well, this is just the tip of the iceberg with chompjs, check out the examples on its Github to see other, more difficult formats you can parse easily with it.

Extract data using JMESPath

So, now that you have your dictionary, what’s the best way to get your data out of it?

We already know the end goal is to extract JSONs in the best way possible.

Moving on, with nested dictionaries, it can be annoying to pick out the fields you need, but you can make it much easier by using another package: JMESPath.

For example if you want to get the list of products from that dictionary, you can do that with a single function call:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
jmespath.search('data.products', data)
jmespath.search('data.products', data)
jmespath.search('data.products', data)

It doesn’t stop there. Let’s go one step further - say you want the names of the products? You can do:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
jmespath.search('data.products[].name', data)
jmespath.search('data.products[].name', data)
jmespath.search('data.products[].name', data)

The change here indicates that I want to go through the products list and pull out the name fields, this will leave me with a list of product names. Now, while that is already very useful, we can go a bit deeper. 

Say you want the dict for only one of these products - the one called “Bacon”. Well, we can actually enter a query within the square brackets to filter our results:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
jmespath.search('data.products[?name==`Bacon`]', data)
jmespath.search('data.products[?name==`Bacon`]', data)
jmespath.search('data.products[?name==`Bacon`]', data)

As before, we can also pull a specific field out:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
jmespath.search('data.products[?name==`Bacon`].price', data)
jmespath.search('data.products[?name==`Bacon`].price', data)
jmespath.search('data.products[?name==`Bacon`].price', data)

Now, let's do something a bit more interesting. Say I want to find products that are over a certain price. Well, I can do other sorts of conditionals in those brackets too:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
jmespath.search('data.products[?price>`2`], data)
jmespath.search('data.products[?price>`2`], data)
jmespath.search('data.products[?price>`2`], data)

You may have noticed that only one of the items has a brand, so if I was to do the following it would give me just the brand name for that one product. Take care in this case as if there are incomplete results you won’t know which of the dictionaries this data actually comes from:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
jmespath.search('data.products[].brand', data)
jmespath.search('data.products[].brand', data)
jmespath.search('data.products[].brand', data)

Finally, you may have noticed that the instock field in our sample has a boolean value, so if we wanted to only get the names of all in-stock items we can do it as so:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
jmespath.search('data.products[?instock].name', data)
jmespath.search('data.products[?instock].name', data)
jmespath.search('data.products[?instock].name', data)

Conclusion to Extract JSONs

These two packages are probably two of the more important packages I use when I am extracting web data.

JSON, or JavaScript Object Notation, is a readable text-based format for structuring data. It is used primarily to transmit data between a server and web application. A JSON string is virtually identical to the code for a javascript object, making it easy to work with in JavaScript, as well as with other programming languages.

Many sites will typically use standard JSON or js scripts in their source, which chompjs can extract for you. With these cases and also most API responses you likely end up with nested dictionaries, which JMESPath makes sifting through a breeze.

Using these tools will enable you to level up your web scraping capabilities and get the data you need, minus the fuss.

Try our Smart Browser, a single API solution with browser and javascript rendering. 

At Zyte, we know the value of data, and we want to empower you with the right tools to web scrape quickly and efficiently. 

If you require custom solutions, talk to our experts today to see what works best for your needs. 

Looking for additional resources? 

Check out other useful open source packages for parsing HTML and extracting data:

If you would like to see a video of these packages in action and how to extract JSONs efficiently, check out my video on Youtube.