You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was using wr.s3.download on 2GiB memory VM and noticed that when I download a 1006 MiB GZIP file from S3 it allocates ~2295 MiB in both cases with and without use_threads parameter. It was measured using this memory profiler.
Obviously my script fails with OOM error on 2GiB memory machine with 2 CPUs. dmesg gives a little different memory estimation:
$ dmesg | tail -1
Out of memory: Killed process 10020 (python3) total-vm:2573584kB, anon-rss:1644684kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:3844kB oom_score_adj:0
It turns out that wr.s3.download by default uses botocore's s3.get_object and fits whole response into a memory:
Is it possible to chunkify reading of botocore response in awswrangler to be more memory efficient?
For instance, using the following snippet I got my file without any issues on the same machine:
raw_stream = s3.get_object(**kwargs)["Body"]
with open("test_botocore_iter_chunks.gz", 'wb') as f:
for chunk in iter(lambda: raw_stream.read(64 * 1024), b''):
f.write(chunk)
I tried to use wr.config.s3_block_size parameter expecting to chunkify the response but it does not help. After setting the s3_block_size up to be less than the file size you fall into this if condition:
Describe the bug
I was using
wr.s3.download
on 2GiB memory VM and noticed that when I download a 1006 MiB GZIP file from S3 it allocates ~2295 MiB in both cases with and withoutuse_threads
parameter. It was measured using this memory profiler.Obviously my script fails with OOM error on 2GiB memory machine with 2 CPUs.
dmesg
gives a little different memory estimation:It turns out that
wr.s3.download
by default usesbotocore
'ss3.get_object
and fits whole response into a memory:aws-sdk-pandas/awswrangler/s3/_fs.py
Lines 65 to 75 in 7e83b89
Is it possible to chunkify reading of botocore response in
awswrangler
to be more memory efficient?For instance, using the following snippet I got my file without any issues on the same machine:
I tried to use
wr.config.s3_block_size
parameter expecting to chunkify the response but it does not help. After setting thes3_block_size
up to be less than the file size you fall into thisif
condition:aws-sdk-pandas/awswrangler/s3/_fs.py
Line 326 in 7e83b89
which just fits the whole response into a memory
How to Reproduce
use memory profiler on
Expected behavior
Please let me know if it's already possible to read chunkified response
Your project
No response
Screenshots
No response
OS
Linux
Python version
3.6.9 -- this is old, but I can double check on newer versions
AWS SDK for pandas version
2.14.0
Additional context
No response
The text was updated successfully, but these errors were encountered: