Handling Large JSON Files with Streaming

gif of json data streaming through jq — Probably going to want to scroll down before you start reading

Working With Large JSON Files

JSON is the preferred format for moving data around on the web. jq is a great command line tool for parsing and querying JSON either from a web resource or a locally stored file. It allows you to query, filter, and transform JSON with very little overhead or setup. It’s perfect for quickly exploring and analyzing data returned from a web service.

But what if the JSON you want to work with is in a very large file, maybe 500mb or more?

There are a number of reasons why you might be working with a large JSON file rather than pulling it from an API. Perhaps a service provides a full data dump, such as the Digital Public Library of America, which provides a 5gb (and expanding) compressed JSON file of all of its metadata.

Serving up an entire data store over HTTP is not desirable for many reasons. The ability to provide a full export to JSON is a quick and easy way for a service to provide all of its open data to its consumers at once for analysis or building applications.

The default behavior for any utility or software that parses JSON generally is to load the entire JSON array into RAM first before being able to parse it. If you’ve ever tried to do this with a very large JSON file, you know that it will never complete because it will end up hammering your computer’s RAM and swap, rendering it unuasable before it can even start parsing the data.

Image of macintosh crash screen — 5gb of JSON to load into RAM? I'll just be over here crashing.

Fortunately jq implemented the capacity to stream your JSON file to the parser. Instead of loading the entire file up at once, streaming sends the contents of the file through the parser one item at a time in an “event driven” fashion. As the items pass through, you can filter and select as you go, without compromising your computer’s memory.

Example Usage

Below are examples for selecting a particular value from each record and for extracting only the items that meet a certain condition (say a specific item format). For these examples I’ll refer to the Digital Public Library of America’s bulk download.

Quick Introduction to jq

Without streaming, say I want to use jq to select all of the titles from an API call to the DPLA for the search term ‘computers’.

I can request this using: http://api.dp.la/v2/items?q=computers&api_key=[API KEY HERE].

From this, I get in return:

{
  "count": 4121,
  "start": 0,
  "limit": 10,
  "docs": [
    {
      "@context": "http://dp.la/api/items/context",
      "isShownAt": "http://collections.mnhs.org/cms/display.php?irn=10463525",
      "dataProvider": "Minnesota Historical Society",
      "@type": "ore:Aggregation",
      "provider": {
        "@id": "http://dp.la/api/contributor/mdl",
        "name": "Minnesota Digital Library"
      },
      "object": "http://collections.mnhs.org/cms/web5/media.php?thumb=yes&irn=10269009",
      "ingestionSequence": 20,
      "id": "0b1340a232731dd6c54bc63ce91d4396",
      "ingestDate": "2016-03-18T21:54:14.478119Z",
      "_rev": "7-e246af1e802f4d4e64e6d93d4d6899bc",
      "aggregatedCHO": "#sourceResource",
      "_id": "minnesota--http://collections.mnhs.org/cms/display.php?irn=10463525",
      "sourceResource": {
        "title": "Computers",
        "description": "Cutting edge program to acquaint students with computers shared with the Pillsbury Company. Tom Aker, reporter, comments from school grounds.",
        "subject": [
          {
            "name": "Computers"
          },
          {
            "name": "Education Data processing"
          }
        ],
        "rights": "http://www.mnhs.org/copyright",
        "format": "Moving Images",
        "collection": {
          "id": "b97327edc09d4a8daf286afc58bdc4cf",
          "title": "Collections Online",
          "@id": "http://dp.la/api/collections/b97327edc09d4a8daf286afc58bdc4cf"
        },
        "stateLocatedIn": {
          "iso3166-2": "US-MN"
        },
        "@id": "http://dp.la/api/items/0b1340a232731dd6c54bc63ce91d4396#sourceResource",
        "type": "image",
        "date": [
          {
            "displayDate": "Created: 06/06/1966"
          }
        ],
        "creator": [
          "Hubbard Broadcasting Corporation, KSTP television (Channel 5)",
          "Aker, Tom",
          "Smicht, Rallie",
          "Pillsbury Company",
          "Robbinsdale Cooper High School"
        ]
      },
      "admin": {
        "validation_message": "{u'iso3166-2': u'US-MN'} is not of type 'array'",
        "sourceResource": {
          "title": "Computers"
        },
        "valid_after_enrich": false
      },
      "ingestType": "item",
      "@id": "http://dp.la/api/items/0b1340a232731dd6c54bc63ce91d4396",
      "originalRecord": {
        "tags": [
          "dpla"
        ],
        "record": {
          "@context": "http://dp.la/api/items/context",
          "record_hash": "57e2b2b546f19b8c6d14df202907bd1a3fc02e9f",
          "title": "Computers",
          "isShownAt": "http://collections.mnhs.org/cms/display.php?irn=10463525",
          "dataProvider": "Minnesota Historical Society",
          "sourceResource": {
            "title": "Computers",
            "description": "Cutting edge program to acquaint students with computers shared with the Pillsbury Company.  Tom Aker, reporter, comments from school grounds.",
            "subject": [
              {
                "name": "Computers"
              },
              {
                "name": "Education Data processing"
              }
            ],
            "rights": "http://www.mnhs.org/copyright",
            "format": "Moving Images",
            "collection": {
              "title": "Collections Online"
            },
            "stateLocatedIn": {
              "iso3166-2": "US-MN"
            },
            "date": {
              "displayDate": "Created: 06/06/1966"
            },
            "type": "imagemimageoimagevimageiimagenimagegimage imageiimagemimageaimagegimageeimagesimage",
            "creator": [
              "Hubbard Broadcasting Corporation, KSTP television (Channel 5)",
              "Aker, Tom",
              "Smicht, Rallie",
              "Pillsbury Company",
              "Robbinsdale Cooper High School"
            ]
          },
          "provider": "Minnesota Digital Library",
          "object": "http://collections.mnhs.org/cms/web5/media.php?thumb=yes&irn=10269009",
          "identifier": "http://collections.mnhs.org/cms/web5/media.php?thumb=yes&irn=10269009",
          "originalRecord": {
            "family_name_facets": {
              "family_name_facet": [
                "Aker",
                "Smicht"
              ]
            },
            "fulltexts": {
              "fulltext": [
                "Cutting edge program to acquaint students with computers shared with the Pillsbury Company.  Tom Aker, reporter, comments from school grounds.",
                "Transcribed Title: Computers",
                "Film reel",
                "Standard sound aperture, reduced frame",
                "Color Sound on medium Magnetic sound track on film Magnetic stripe sound Original reversal film B wind",
                "KSTP-TV Archive, Minnesota Historical Society",
                "film",
                "moving images"
              ]
            },
            "subject_facets": {
              "subject_facet": [
                "Computers",
                "Education Data processing"
              ]
            },
            "given_name_facets": {
              "given_name_facet": [
                "Tom",
                "Rallie"
              ]
            },
            "event_dates": {
              "event_date": [
                "1966-06-06T00:00:00Z",
                "1966-06-06T23:59:59Z"
              ]
            },
            "subject_texts": {
              "subject_text": [
                "Computers",
                "Education Data processing"
              ]
            },
            "link": "http://collections.mnhs.org/cms/display.php?irn=10463525",
            "family_name_displays": {
              "family_name_display": [
                "Aker",
                "Smicht"
              ]
            },
            "type_displays": {
              "type_display": "Moving Images"
            },
            "name_displays": {
              "name_display": [
                "Hubbard Broadcasting Corporation, KSTP television (Channel 5)",
                "Aker, Tom",
                "Smicht, Rallie",
                "Pillsbury Company",
                "Robbinsdale Cooper High School"
              ]
            },
            "collection_name": "Collections Online",
            "image_link": "http://collections.mnhs.org/cms/web5/media.php?thumb=yes&irn=10269009",
            "given_name_displays": {
              "given_name_display": [
                "Tom",
                "Rallie"
              ]
            },
            "descriptions": {
              "description": [
                "Cutting edge program to acquaint students with computers shared with the Pillsbury Company.  Tom Aker, reporter, comments from school grounds.",
                "Transcribed Title: Computers"
              ]
            },
            "type_texts": {
              "type_text": "Moving Images"
            },
            "name_texts": {
              "name_text": [
                "hubbard broadcasting corporation, kstp television (channel 5)",
                "aker, tom",
                "smicht, rallie",
                "pillsbury company",
                "robbinsdale cooper high school",
                "cooper high school"
              ]
            },
            "event_date_label": "Dates",
            "titles": {
              "title": "Computers"
            },
            "event_display_dates": {
              "event_display_date": "Created: 06/06/1966"
            },
            "family_name_texts": {
              "family_name_text": [
                "aker",
                "smicht"
              ]
            },
            "type_facets": {
              "type_facet": "Moving Images"
            },
            "given_name_texts": {
              "given_name_text": [
                "tom",
                "rallie"
              ]
            },
            "name_facets": {
              "name_facet": [
                "Hubbard Broadcasting Corporation, KSTP television (Channel 5)",
                "Aker, Tom",
                "Smicht, Rallie",
                "Pillsbury Company",
                "Robbinsdale Cooper High School"
              ]
            },
            "subject_displays": {
              "subject_display": [
                "Computers",
                "Education Data processing"
              ]
            }
          }
        },
        "_id": "57e2b2b546f19b8c6d14df202907bd1a3fc02e9f",
        "ingested_at": "2015-12-29T16:24:09Z",
        "provider": {
          "@id": "http://dp.la/api/contributor/mdl",
          "name": "Minnesota Digital Library"
        },
        "collection": {
          "id": "b97327edc09d4a8daf286afc58bdc4cf",
          "title": "Collections Online",
          "@id": "http://dp.la/api/collections/b97327edc09d4a8daf286afc58bdc4cf"
        },
        "import_job_id": 4,
        "import_job_name": "MHS",
        "published": true,
        "record_id": "57e2b2b546f19b8c6d14df202907bd1a3fc02e9f"
      },
      "score": 10.649403
    },
    [...WAY MORE RECORDS HERE...]
  ]
}

If I wanted to get only the titles from that request, I might use:

curl 'http://api.dp.la/v2/items?q=computers&api_key=[API KEY HERE]' | jq '.docs[].sourceResource.title'

The result of which would be:

"Computers"
"on computers"
"Computers"
"Computers"
"Computers"
[
  "Computers"
]
[
  "Computers"
]
"on computers"
"Personal computers"
"Computers, 1958"

Here I am using the command line utility curl to request the resource and then I’m using the | (or “pipe”) to tell it that I want to send the results of that request to the jq utility for grabbing only the title.

The grabbing is acommplished by using .docs[].sourceResource.title to access the entire “docs” array where the results are provided and then specifying that I want only values for the title property of the sourceResource object for each item returned from the DPLA for this request. There is a good tutorial for getting started with jq if you are interested in knowing more.

Selecting Using a Stream

This is great, but what if I wanted to pull every title for every item in the DPLA and perform some sort of text analysis on it? I can do this if I were to download the entire dataset in bulk from the bulk downloads page (again, about 5gb compressed) and use jq to extract only the titles with something like:

zcat < all.json.gz | jq '.[]._source.sourceResource.title'

we would have the scenario as described above that our computer’s memory would overload and it would probably freeze up indefinitely before completing [1].

Instead, I can use the --stream argument. Though what jq sees when the data are streamed is a little different. I will show the stream equivalent command and then try to explain a little bit of the differences.

Note: Since we’re now using the bulk downloads, the JSON in them is formatted a bit differently than the JSON returned by the DPLA API. For instance, the individual records are not nested inside of a docs array so we do not need to select that first before getting into the individual records.

zcat < all.json.gz | jq --stream 'select(.[0][1] == "_source" and .[0][2] == "sourceResource" and .[0][3] == "title") | .[1]'

There’s a lot going on here that is different from the load-into-memory jq command [2].

With streamimg, a record ends up looking more like this to the parser:

[[0,"_index"],"dpla-20150410-144958"]
[[0,"_type"],"item"]
[[0,"_id"],"getty--GETTY_ROSETTAIE638128"]
[[0,"_score"],0]
[[0,"_source","@context"],"http://dp.la/api/items/context"]
[[0,"_source","isShownAt"],"http://primo.getty.edu/primo_library/libweb/action/dlDisplay.do?vid=GRI&afterPDS=true&institution=01GRI&docId=GETTY_ROSETTAIE638128"]
[[0,"_source","dataProvider"],"Getty Research Institute"]
[[0,"_source","@type"],"ore:Aggregation"]
[[0,"_source","provider","@id"],"http://dp.la/api/contributor/getty"]
[[0,"_source","provider","name"],"J. Paul Getty Trust"]
[[0,"_source","provider","name"]]
[[0,"_source","object"],"http://rosettaapp.getty.edu:1801/delivery/DeliveryManagerServlet?dps_pid=IE638128&dps_func=thumbnail"]
[[0,"_source","ingestionSequence"],6]
[[0,"_source","ingestDate"],"2016-01-09T02:53:07.536775Z"]
[[0,"_source","_rev"],"5-1f33d45f65d30018800e166e0ff93274"]
[[0,"_source","id"],"509667c03617c3f9dabcb5641aec3fe5"]
[[0,"_source","aggregatedCHO"],"#sourceResource"]
[[0,"_source","_id"],"getty--GETTY_ROSETTAIE638128"]
[[0,"_source","admin","validation_message"],null]
[[0,"_source","admin","sourceResource","title"],"Blindman's bluff, c. 1750-1800"]
[[0,"_source","admin","sourceResource","title"]]
[[0,"_source","admin","valid_after_enrich"],true]
[[0,"_source","admin","valid_after_enrich"]]
[[0,"_source","sourceResource","title",0],"Blindman's bluff, c. 1750-1800"]
...<<TRUNCATED FOR SPACE>>

As we can see here, it no longer looks like a JSON obect. Each property of a record gets turned into [<path>, <leaf-value>] as an input for the jq parser. That is, each line in the above is a path to a value in the JSON, which can get rather verbose for very nested values.

So if we are looking for the title for each record, we would be interested only in the input:

[[0,"_source","sourceResource","title",0],"Blindman's bluff, c. 1750-1800"]

Our jq streaming query above then does just that. In the select() function that we use to query the stream, we are asking: “check if the element at index 1 of the 0 index of the input equals ‘_source’ and index 2 of the 0 index of the input equals ‘sourceResource’ and index 3 of the 0 index of the input equals ‘title’ then return the leaf value of that path” For this item, it should be “Blindman’s bluff, c. 1750-1800”.

Written out, it is not compelling as a set of instructions, but hopefully demonstrating what the parser is seeing in streaming mode above is helpful to understand why the query logic is so different.

Filtering a Stream

But say that, instead of picking out a particular value, we would rather return only the whole records that have a particular value for one of its properties. In this example, we only want to return the items in the DPLA that are sound recordings, ignoring anything else that is an image, text, moving image, or physical object. We could do that with:

zcat < all.json.gz | jq --stream "fromstream(1|truncate_stream(inputs))" | jq "select(any(._source.sourceResource; .type=="sound"))"

In this example, I am using one more pipe before selecting the items that I want. I am also using the any(generator; condition) built-in function in the second invocation of jq, which allows for quickly applying a specific condition (in this case the type of the item returned must be ‘sound’) to a specified input (in this case the sourceResource object in the item).

But before I do that, in the first jq invocation, I want to utilize the fromstream() and truncate_stream() functions in order to first turn the stream of paths and leaves back into a stream of JSON objects.

If we look at first item that comes out of this initial transformation, we see the output as:

{
  "_type": "item",
  "_id": "getty--GETTY_ROSETTAIE638128",
  "_score": 0,
  "_source": {
    "@context": "http://dp.la/api/items/context",
    "isShownAt": "http://primo.getty.edu/primo_library/libweb/action/dlDisplay.do?vid=GRI&afterPDS=true&institution=01GRI&docId=GETTY_ROSETTAIE638128",
    "dataProvider": "Getty Research Institute",
    "@type": "ore:Aggregation",
    "provider": {
      "@id": "http://dp.la/api/contributor/getty",
      "name": "J. Paul Getty Trust"
    },
    "object": "http://rosettaapp.getty.edu:1801/delivery/DeliveryManagerServlet?dps_pid=IE638128&dps_func=thumbnail",
    "ingestionSequence": 6,
    "ingestDate": "2016-01-09T02:53:07.536775Z",
    "_rev": "5-1f33d45f65d30018800e166e0ff93274",
    "id": "509667c03617c3f9dabcb5641aec3fe5",
    "aggregatedCHO": "#sourceResource",
    "_id": "getty--GETTY_ROSETTAIE638128",
    "admin": {
      "validation_message": null,
      "sourceResource": {
        "title": "Blindman's bluff, c. 1750-1800"
      },
      "valid_after_enrich": true
    },
    "sourceResource": {
      "title": [
        "Blindman's bluff, c. 1750-1800"
      ],
      "extent": "1 image of 1 tapestry",
      "description": [
        "Tapestry Dimensions: H 8'2\" x W 7'5\"",
        "Tapestry Materials/Techniques: unknown",
        "Culture: French",
        "Weaving Center: Aubusson",
        "Ownership History: French  sold to Charles Sternberg 2/7/1966.",
        "In landscape with trees  young man standing blind-folded  others seated, watching the scene; river & buildings in distance",
        "(BRD) string of pearls with spiral garland",
        "A similar tapestry with the same border, albeit composition reversed, is illustrated in Göbel Vol. 2, part 2, ill. 278.",
        "French & Co. stock sheet in archive, 45851",
        "Göbel, Wandteppiche II:2 (1928), Ill. 278",
        "Related Works: Compositionally similar tapestries: GCPA 0243088, 0243090-0243094, 0243096-0243097, 0243089 (composition reversed); compositionally similar tapestry (larger composition), GCPA 0243086"
      ],
      "subject": [
        {
          "name": "Genre: Country Life"
        }
      ],
      "rights": "Digital images courtesy of the Getty's Open Content Program.",
      "@id": "http://dp.la/api/items/509667c03617c3f9dabcb5641aec3fe5#sourceResource",
      "collection": {
        "id": "632c46648b97b7838bcbe3c95509ec23",
        "title": "Study photographs of tapestries",
        "@id": "http://dp.la/api/collections/632c46648b97b7838bcbe3c95509ec23"
      },
      "date": {
        "displayDate": "c. 1750-1800",
        "end": "1800",
        "begin": "1750"
      },
      "type": "image",
      "identifier": [
        "97.P.7",
        "Original DB ID: 3230",
        "Web DB ID: 308146"
      ],
      "creator": "Huet, Jean-Baptiste (French, 1745-1811) (designed after) [painter]",
      "specType": [
        "Photograph/Pictorial Works"
      ]
    },
    "@id": "http://dp.la/api/items/509667c03617c3f9dabcb5641aec3fe5",
    "ingestType": "item",
    "originalRecord": {
      "PrimoNMBib": {
        "record": {
          "control": {
            "sourceid": "GETTY_ROSETTA",
            "sourceformat": "DC",
            "recordid": "GETTY_ROSETTAIE638128",
            "sourcesystem": "Other",
            "sourcerecordid": "IE638128"
          },
          "sort": {
            "creationdate": "1750",
            "author": "Huet, Jean-Baptiste (French, 1745-1811) (designed after) [painter]",
            "title": "Blindman's bluff, c. 1750-1800"
          },
          "addata": {
            "genre": "unknown",
            "btitle": "Blindman's bluff, c. 1750-1800",
            "ristype": "GEN",
            "risdate": "1750",
            "format": "book",
            "date": "1750",
            "au": "Huet, Jean-Baptiste (French, 1745-1811) (designed after) [painter]"
          },
          "search": {
            "addtitle": [
              "Access File Blindman's bluff Image 1 Preservation Master File Blindman's bluff Image 1",
              "Study photographs of tapestries (http://hdl.handle.net/10020/cat373681)",
              "GRI Photo Archive"
            ],
            "scope": "GETTY_ROSETTA",
            "lsr34": "Study photographs of tapestries",
            "subject": "Genre: Country Life",
            "format": "1 image of 1 tapestry",
            "creationdate": "c. 1750-1800",
            "general": [
              "http://hdl.handle.net/10020/97p7_308146",
              "97.P.7",
              "Original DB ID: 3230",
              "Web DB ID: 308146",
              "Study photographs of tapestries (http://hdl.handle.net/10020/cat373681)",
              "GRI Photo Archive",
              "Access File Blindman's bluff Image 1 Preservation Master File Blindman's bluff Image 1",
              "Digital images courtesy of the Getty's Open Content Program."
            ],
            "creatorcontrib": "Huet, Jean-Baptiste (French, 1745-1811) (designed after) [painter]",
            "title": "Blindman's bluff",
            "startdate": "17500101",
            "description": [
              "Tapestry Dimensions: H 8'2\" x W 7'5\"",
              "Tapestry Materials/Techniques: unknown",
              "Culture: French",
              "Weaving Center: Aubusson",
              "Ownership History: French  sold to Charles Sternberg 2/7/1966.",
              "In landscape with trees  young man standing blind-folded  others seated, watching the scene; river & buildings in distance",
              "(BRD) string of pearls with spiral garland",
              "A similar tapestry with the same border, albeit composition reversed, is illustrated in Göbel Vol. 2, part 2, ill. 278.",
              "French & Co. stock sheet in archive, 45851",
              "Göbel, Wandteppiche II:2 (1928), Ill. 278",
              "Related Works: Compositionally similar tapestries: GCPA 0243088, 0243090-0243094, 0243096-0243097, 0243089 (composition reversed); compositionally similar tapestry (larger composition), GCPA 0243086"
            ],
            "searchscope": "GETTY_ROSETTA",
            "sourceid": "GETTY_ROSETTA",
            "enddate": "17501231",
            "lsr08": [
              "Tapestry Dimensions: H 8'2\" x W 7'5\"",
              "Tapestry Materials/Techniques: unknown",
              "Culture: French",
              "Weaving Center: Aubusson",
              "Ownership History: French  sold to Charles Sternberg 2/7/1966.",
              "In landscape with trees  young man standing blind-folded  others seated, watching the scene; river & buildings in distance",
              "(BRD) string of pearls with spiral garland",
              "A similar tapestry with the same border, albeit composition reversed, is illustrated in Göbel Vol. 2, part 2, ill. 278.",
              "French & Co. stock sheet in archive, 45851",
              "Göbel, Wandteppiche II:2 (1928), Ill. 278",
              "Related Works: Compositionally similar tapestries: GCPA 0243088, 0243090-0243094, 0243096-0243097, 0243089 (composition reversed); compositionally similar tapestry (larger composition), GCPA 0243086"
            ],
            "recordid": "GETTY_ROSETTAIE638128",
            "rsrctype": [
              "digital_entity",
              "Textiles - Tapestries",
              "Still image"
            ]
          },
          "browse": {
            "author": "$$DHuet, Jean-Baptiste (French, 1745-1811) (designed after) [painter]$$EHuet, Jean-Baptiste (French, 1745-1811) (designed after) [painter]",
            "subject": "$$DGenre: Country Life$$EGenre: Country Life",
            "title": "$$DBlindman's bluff$$EBlindman's bluff"
          },
          "facets": {
            "topic": "Genre: Country Life",
            "prefilter": "digital_entities",
            "creatorcontrib": "Huet, Jean-Baptiste (French, 1745-1811) (designed after) [painter]",
            "toplevel": "online_resources",
            "frbrtype": "6",
            "frbrgroupid": "687489380",
            "lfc02": [
              "GRI Digital Collections",
              "GRI Photo Archive"
            ],
            "creationdate": "1750",
            "lfc05": "DPLA",
            "rsrctype": "digital_entities"
          },
          "links": {
            "linktorsrc": "$$Tgetty_rosetta_linktorscr$$DDisplay item",
            "thumbnail": "$$Tgetty_rosetta_thumb",
            "lln02": "$$Tgetty_rosetta_linktorscr$$D",
            "openurlfulltext": "$$Topenurlfull_journal"
          },
          "display": {
            "lds29": "http://hdl.handle.net/10020/97p7_308146",
            "lds27": "Digital images courtesy of the Getty's Open Content Program.",
            "subject": "Genre: Country Life",
            "format": "1 image of 1 tapestry",
            "type": "digital_entity",
            "creationdate": "c. 1750-1800",
            "creator": "Huet, Jean-Baptiste (French, 1745-1811) (designed after) [painter]",
            "lds14": [
              "97.P.7",
              "Original DB ID: 3230",
              "Web DB ID: 308146"
            ],
            "lds26": "Still image",
            "title": "Blindman's bluff, c. 1750-1800",
            "source": "GETTY_ROSETTA",
            "lds04": [
              "Tapestry Dimensions: H 8'2\" x W 7'5\"",
              "Tapestry Materials/Techniques: unknown",
              "Culture: French",
              "Weaving Center: Aubusson",
              "Ownership History: French  sold to Charles Sternberg 2/7/1966.",
              "In landscape with trees  young man standing blind-folded  others seated, watching the scene; river & buildings in distance",
              "(BRD) string of pearls with spiral garland",
              "A similar tapestry with the same border, albeit composition reversed, is illustrated in Göbel Vol. 2, part 2, ill. 278.",
              "French & Co. stock sheet in archive, 45851",
              "Göbel, Wandteppiche II:2 (1928), Ill. 278",
              "Related Works: Compositionally similar tapestries: GCPA 0243088, 0243090-0243094, 0243096-0243097, 0243089 (composition reversed); compositionally similar tapestry (larger composition), GCPA 0243086"
            ],
            "lds47": "Collection description",
            "lds34": "Study photographs of tapestries"
          },
          "delivery": {
            "delcategory": "Online Resource"
          },
          "ranking": {
            "booster2": "1",
            "booster1": "1"
          }
        },
        "xmlns": "http://www.exlibrisgroup.com/xsd/primo/primo_nm_bib"
      },
      "RANK": "1.0",
      "sear:GETIT": {
        "deliveryCategory": "Online Resource",
        "GetIt2": "http://1.1.1.1?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2016-01-08T17%3A55%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Journal-GETTY_ROSETTA&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=unknown&rft.atitle=&rft.jtitle=&rft.btitle=Blindman's%20bluff,%20c.%201750-1800&rft.aulast=&rft.auinit=&rft.auinit1=&rft.auinitm=&rft.ausuffix=&rft.au=Huet,%20Jean-Baptiste%20(French,%201745-1811)%20(designed%20after)%20%5Bpainter%5D&rft.aucorp=&rft.volume=&rft.issue=&rft.part=&rft.quarter=&rft.ssn=&rft.spage=&rft.epage=&rft.pages=&rft.artnum=&rft.issn=&rft.eissn=&rft.isbn=&rft.sici=&rft.coden=&rft_id=info:doi/&rft.object_id=&rft_dat=IE638128&rft.eisbn=&rft_id=info:oai/&req.language=",
        "GetIt1": "http://rosettaapp.getty.edu:1801/delivery/DeliveryManagerServlet?dps_pid=IE638128"
      },
      "_id": "GETTY_ROSETTAIE638128",
      "SEARCH_ENGINE_TYPE": "Local Search Engine",
      "ID": "1433634",
      "provider": {
        "@id": "http://dp.la/api/contributor/getty",
        "name": "J. Paul Getty Trust"
      },
      "SEARCH_ENGINE": "Local Search Engine",
      "collection": {
        "id": "632c46648b97b7838bcbe3c95509ec23",
        "title": "Study photographs of tapestries",
        "@id": "http://dp.la/api/collections/632c46648b97b7838bcbe3c95509ec23"
      },
      "sear:LINKS": {
        "sear:thumbnail": "http://rosettaapp.getty.edu:1801/delivery/DeliveryManagerServlet?dps_pid=IE638128&dps_func=thumbnail",
        "sear:openurlfulltext": "http://1.1.1.1?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2016-01-08T17%3A55%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Journal-GETTY_ROSETTA&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=unknown&rft.atitle=&rft.jtitle=&rft.btitle=Blindman's%20bluff,%20c.%201750-1800&rft.aulast=&rft.auinit=&rft.auinit1=&rft.auinitm=&rft.ausuffix=&rft.au=Huet,%20Jean-Baptiste%20(French,%201745-1811)%20(designed%20after)%20%5Bpainter%5D&rft.aucorp=&rft.volume=&rft.issue=&rft.part=&rft.quarter=&rft.ssn=&rft.spage=&rft.epage=&rft.pages=&rft.artnum=&rft.issn=&rft.eissn=&rft.isbn=&rft.sici=&rft.coden=&rft_id=info:doi/&rft.object_id=&svc_val_fmt=info:ofi/fmt:kev:mtx:sch_svc&svc.fulltext=yes&rft_dat=IE638128&rft.eisbn=&rft_id=info:oai/&req.language=",
        "sear:linktorsrc": "http://rosettaapp.getty.edu:1801/delivery/DeliveryManagerServlet?dps_pid=IE638128",
        "sear:lln02": "http://rosettaapp.getty.edu:1801/delivery/DeliveryManagerServlet?dps_pid=IE638128"
      },
      "NO": "10918"
    }
  }
}

It looks like JSON again. As this is now the input for the select() using any(), we evaluate the type property of the sourceResource object from each DPLA item in the stream and only return the items that have a type that equals “sound”. And we don’t crash our system doing so!

Conclusion

Streaming serialized data so it can be acted on dynamically, without stressing our RAM, is nothing new. And there are plenty of other tools that can be used to stream JSON data in particular. Libraries in Python and Ruby, for example, implement the YAJL streaming JSON parsing library from C. But jq is a great tool because it is easily and quickly accessible from the command line, making it perfect for exploring and shaping JSON data on the fly or in a pinch.

The above is only the tip of the iceberg. There is plenty more that can be done with jq and you can explore all of its capabilities at the documentation.

Notes

[1] Here I am using the full data download and I am accessing it locally, so there’s a few differences in the command syntax from the intro command. First, I am using a utility called zcat on the compressed json file because I just want to send the JSON content to jq, rather than uncompress it to a new file first. Also, the structure for the bulk data download is a little different from what is returned from the DPLA API in that the sourceResource object is nested within a _source object. More information about the format of the bulk downloads can be found here [back]

[2] The developers of jq provide a bit of a more technical explanation of streaming in jq here, though I found this a little hard to grasp. I’ve attempted to describe it here in a bit more casually, but it admitedly lacks technical detail and rigor. [back]