JSON is the preferred format for moving data around on the web. jq is a great command line tool for parsing and querying JSON either from a web resource or a locally stored file. It allows you to query, filter, and transform JSON with very little overhead or setup. It’s perfect for quickly exploring and analyzing data returned from a web service.
But what if the JSON you want to work with is in a very large file, maybe 500mb or more?
Serving up an entire data store over HTTP is not desirable for many reasons. The ability to provide a full export to JSON is a quick and easy way for a service to provide all of its open data to its consumers at once for analysis or building applications.
The default behavior for any utility or software that parses JSON generally is to load the entire JSON array into RAM first before being able to parse it. If you’ve ever tried to do this with a very large JSON file, you know that it will never complete because it will end up hammering your computer’s RAM and swap, rendering it unuasable before it can even start parsing the data.
Fortunately jq implemented the capacity to stream your JSON file to the parser. Instead of loading the entire file up at once, streaming sends the contents of the file through the parser one item at a time in an “event driven” fashion. As the items pass through, you can filter and select as you go, without compromising your computer’s memory.
Example Usage
Below are examples for selecting a particular value from each record and for extracting only the items that meet a certain condition (say a specific item format). For these examples I’ll refer to the Digital Public Library of America’s bulk download.
Quick Introduction to jq
Without streaming, say I want to use jq to select all of the titles from an API call to the DPLA for the search term ‘computers’.
I can request this using: http://api.dp.la/v2/items?q=computers&api_key=[API KEY HERE].
From this, I get in return:
If I wanted to get only the titles from that request, I might use:
The result of which would be:
Here I am using the command line utility curl to request the resource and then I’m using the | (or “pipe”) to tell it that I want to send the results of that request to the jq utility for grabbing only the title.
The grabbing is acommplished by using .docs[].sourceResource.title to access the entire “docs” array where the results are provided and then specifying that I want only values for the title property of the sourceResource object for each item returned from the DPLA for this request. There is a good tutorial for getting started with jq if you are interested in knowing more.
Selecting Using a Stream
This is great, but what if I wanted to pull every title for every item in the DPLA and perform some sort of text analysis on it? I can do this if I were to download the entire dataset in bulk from the bulk downloads page (again, about 5gb compressed) and use jq to extract only the titles with something like:
we would have the scenario as described above that our computer’s memory would overload and it would probably freeze up indefinitely before completing [1].
Instead, I can use the --stream argument. Though what jq sees when the data are streamed is a little different. I will show the stream equivalent command and then try to explain a little bit of the differences.
Note: Since we’re now using the bulk downloads, the JSON in them is formatted a bit differently than the JSON returned by the DPLA API. For instance, the individual records are not nested inside of a docs array so we do not need to select that first before getting into the individual records.
There’s a lot going on here that is different from the load-into-memory jq command [2].
With streamimg, a record ends up looking more like this to the parser:
As we can see here, it no longer looks like a JSON obect. Each property of a record gets turned into [<path>, <leaf-value>] as an input for the jq parser. That is, each line in the above is a path to a value in the JSON, which can get rather verbose for very nested values.
So if we are looking for the title for each record, we would be interested only in the input:
Our jq streaming query above then does just that. In the select() function that we use to query the stream, we are asking: “check if the element at index 1 of the 0 index of the input equals ‘_source’ and index 2 of the 0 index of the input equals ‘sourceResource’ and index 3 of the 0 index of the input equals ‘title’ then return the leaf value of that path” For this item, it should be “Blindman’s bluff, c. 1750-1800”.
Written out, it is not compelling as a set of instructions, but hopefully demonstrating what the parser is seeing in streaming mode above is helpful to understand why the query logic is so different.
Filtering a Stream
But say that, instead of picking out a particular value, we would rather return only the whole records that have a particular value for one of its properties. In this example, we only want to return the items in the DPLA that are sound recordings, ignoring anything else that is an image, text, moving image, or physical object. We could do that with:
In this example, I am using one more pipe before selecting the items that I want. I am also using the any(generator; condition)built-in function in the second invocation of jq, which allows for quickly applying a specific condition (in this case the type of the item returned must be ‘sound’) to a specified input (in this case the sourceResource object in the item).
But before I do that, in the first jq invocation, I want to utilize the fromstream() and truncate_stream() functions in order to first turn the stream of paths and leaves back into a stream of JSON objects.
If we look at first item that comes out of this initial transformation, we see the output as:
It looks like JSON again. As this is now the input for the select() using any(), we evaluate the type property of the sourceResource object from each DPLA item in the stream and only return the items that have a type that equals “sound”. And we don’t crash our system doing so!
Conclusion
Streaming serialized data so it can be acted on dynamically, without stressing our RAM, is nothing new. And there are plenty of other tools that can be used to stream JSON data in particular. Libraries in Python and Ruby, for example, implement the YAJL streaming JSON parsing library from C. But jq is a great tool because it is easily and quickly accessible from the command line, making it perfect for exploring and shaping JSON data on the fly or in a pinch.
The above is only the tip of the iceberg. There is plenty more that can be done with jq and you can explore all of its capabilities at the documentation.
Notes
[1] Here I am using the full data download and I am accessing it locally, so there’s a few differences in the command syntax from the intro command. First, I am using a utility called zcat on the compressed json file because I just want to send the JSON content to jq, rather than uncompress it to a new file first. Also, the structure for the bulk data download is a little different from what is returned from the DPLA API in that the sourceResource object is nested within a _source object. More information about the format of the bulk downloads can be found here [back]
[2] The developers of jq provide a bit of a more technical explanation of streaming in jq here, though I found this a little hard to grasp. I’ve attempted to describe it here in a bit more casually, but it admitedly lacks technical detail and rigor. [back]