Reading contents of large arrays as a stream #4413

philrz · 2023-02-28T23:29:30Z

At the time of the filing of this issue Zed is at commit 08ee509.

This StackOverflow post is a reminder that users often have JSON data consisting of one large array and want to work with its contents. As the responses in the thread discuss, one way tools often approach this is to read the contents as a stream rather than trying to read the entire array into memory.

At the moment, here's where things stand with the Zed tooling. To reproduce symptoms, s3://brim-sampledata/fdns/fdns-10GB.json.gz can be used as sample data, as it uncompresses to a single JSON array almost 10 GB in size.

Reading via auto-detect fails quickly. This is known and tracked in "buffer exceeded max size" when reading JSON array via auto-detect #3865.

$ zq 'count()' fdns-10GB.json.gz
fdns-10GB.json.gz: format detection error
...
	json: buffer exceeded max size trying to infer input format
...

Reading via -i json consumes all available memory. For example, on an AWS m6idn.xlarge instance that has 16 GB of memory and no swap, doing zq -i json 'count()' fdns-10GB.json.gz effectively hangs the system. When attempting zed load -i json fdns-10GB.json.gz to load it into a lake fronted by zed serve the loading process was killed via OOM.

At one time in the past the Zed tooling automatically unwrapped top-level JSON arrays and streamed their contents as input, similar to what users were proposing in the StackOverflow post. It's been proposed that we could bring this back as an option, e.g., -i jsonarray. While this would handle the specific case of a top-level array, it's also been noted that such arrays could exist deeper inside objects as well, and these would still pose memory problems. Therefore it's also been proposed we could do something similar to jq --stream to handle this case.

Regardless of what flag(s) we may add, the possibility still remains that a user might ignore them and shoot themselves in the foot by trying to read such an array in its entirety and still hit the memory problems. For this reason, it seems it would be helpful to find a more graceful way to fail. #4025 already exists to cover the wider topic of memory management, so perhaps we can revisit this topic when that's being taken up.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading contents of large arrays as a stream #4413

Reading contents of large arrays as a stream #4413

philrz commented Feb 28, 2023

Reading contents of large arrays as a stream #4413

Reading contents of large arrays as a stream #4413

Comments

philrz commented Feb 28, 2023