Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading contents of large arrays as a stream #4413

Open
philrz opened this issue Feb 28, 2023 · 0 comments
Open

Reading contents of large arrays as a stream #4413

philrz opened this issue Feb 28, 2023 · 0 comments

Comments

@philrz
Copy link
Contributor

philrz commented Feb 28, 2023

At the time of the filing of this issue Zed is at commit 08ee509.

This StackOverflow post is a reminder that users often have JSON data consisting of one large array and want to work with its contents. As the responses in the thread discuss, one way tools often approach this is to read the contents as a stream rather than trying to read the entire array into memory.

At the moment, here's where things stand with the Zed tooling. To reproduce symptoms, s3://brim-sampledata/fdns/fdns-10GB.json.gz can be used as sample data, as it uncompresses to a single JSON array almost 10 GB in size.

  1. Reading via auto-detect fails quickly. This is known and tracked in "buffer exceeded max size" when reading JSON array via auto-detect #3865.
$ zq 'count()' fdns-10GB.json.gz
fdns-10GB.json.gz: format detection error
...
	json: buffer exceeded max size trying to infer input format
...
  1. Reading via -i json consumes all available memory. For example, on an AWS m6idn.xlarge instance that has 16 GB of memory and no swap, doing zq -i json 'count()' fdns-10GB.json.gz effectively hangs the system. When attempting zed load -i json fdns-10GB.json.gz to load it into a lake fronted by zed serve the loading process was killed via OOM.

At one time in the past the Zed tooling automatically unwrapped top-level JSON arrays and streamed their contents as input, similar to what users were proposing in the StackOverflow post. It's been proposed that we could bring this back as an option, e.g., -i jsonarray. While this would handle the specific case of a top-level array, it's also been noted that such arrays could exist deeper inside objects as well, and these would still pose memory problems. Therefore it's also been proposed we could do something similar to jq --stream to handle this case.

Regardless of what flag(s) we may add, the possibility still remains that a user might ignore them and shoot themselves in the foot by trying to read such an array in its entirety and still hit the memory problems. For this reason, it seems it would be helpful to find a more graceful way to fail. #4025 already exists to cover the wider topic of memory management, so perhaps we can revisit this topic when that's being taken up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant