You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the time of the filing of this issue Zed is at commit 08ee509.
This StackOverflow post is a reminder that users often have JSON data consisting of one large array and want to work with its contents. As the responses in the thread discuss, one way tools often approach this is to read the contents as a stream rather than trying to read the entire array into memory.
At the moment, here's where things stand with the Zed tooling. To reproduce symptoms, s3://brim-sampledata/fdns/fdns-10GB.json.gz can be used as sample data, as it uncompresses to a single JSON array almost 10 GB in size.
$ zq 'count()' fdns-10GB.json.gz
fdns-10GB.json.gz: format detection error
...
json: buffer exceeded max size trying to infer input format
...
Reading via -i json consumes all available memory. For example, on an AWS m6idn.xlarge instance that has 16 GB of memory and no swap, doing zq -i json 'count()' fdns-10GB.json.gz effectively hangs the system. When attempting zed load -i json fdns-10GB.json.gz to load it into a lake fronted by zed serve the loading process was killed via OOM.
At one time in the past the Zed tooling automatically unwrapped top-level JSON arrays and streamed their contents as input, similar to what users were proposing in the StackOverflow post. It's been proposed that we could bring this back as an option, e.g., -i jsonarray. While this would handle the specific case of a top-level array, it's also been noted that such arrays could exist deeper inside objects as well, and these would still pose memory problems. Therefore it's also been proposed we could do something similar to jq --stream to handle this case.
Regardless of what flag(s) we may add, the possibility still remains that a user might ignore them and shoot themselves in the foot by trying to read such an array in its entirety and still hit the memory problems. For this reason, it seems it would be helpful to find a more graceful way to fail. #4025 already exists to cover the wider topic of memory management, so perhaps we can revisit this topic when that's being taken up.
The text was updated successfully, but these errors were encountered:
At the time of the filing of this issue Zed is at commit 08ee509.
This StackOverflow post is a reminder that users often have JSON data consisting of one large array and want to work with its contents. As the responses in the thread discuss, one way tools often approach this is to read the contents as a stream rather than trying to read the entire array into memory.
At the moment, here's where things stand with the Zed tooling. To reproduce symptoms,
s3://brim-sampledata/fdns/fdns-10GB.json.gz
can be used as sample data, as it uncompresses to a single JSON array almost 10 GB in size.-i json
consumes all available memory. For example, on an AWSm6idn.xlarge
instance that has 16 GB of memory and no swap, doingzq -i json 'count()' fdns-10GB.json.gz
effectively hangs the system. When attemptingzed load -i json fdns-10GB.json.gz
to load it into a lake fronted byzed serve
the loading process was killed via OOM.At one time in the past the Zed tooling automatically unwrapped top-level JSON arrays and streamed their contents as input, similar to what users were proposing in the StackOverflow post. It's been proposed that we could bring this back as an option, e.g.,
-i jsonarray
. While this would handle the specific case of a top-level array, it's also been noted that such arrays could exist deeper inside objects as well, and these would still pose memory problems. Therefore it's also been proposed we could do something similar tojq --stream
to handle this case.Regardless of what flag(s) we may add, the possibility still remains that a user might ignore them and shoot themselves in the foot by trying to read such an array in its entirety and still hit the memory problems. For this reason, it seems it would be helpful to find a more graceful way to fail. #4025 already exists to cover the wider topic of memory management, so perhaps we can revisit this topic when that's being taken up.
The text was updated successfully, but these errors were encountered: