Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wr.s3.download fits the whole file into memory, with 2x memory allocation #2831

Open
roykoand opened this issue May 22, 2024 · 0 comments
Open
Labels
bug Something isn't working needs-triage

Comments

@roykoand
Copy link

Describe the bug

I was using wr.s3.download on 2GiB memory VM and noticed that when I download a 1006 MiB GZIP file from S3 it allocates ~2295 MiB in both cases with and without use_threads parameter. It was measured using this memory profiler.

Obviously my script fails with OOM error on 2GiB memory machine with 2 CPUs. dmesg gives a little different memory estimation:

$ dmesg  | tail -1
Out of memory: Killed process 10020 (python3) total-vm:2573584kB, anon-rss:1644684kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:3844kB oom_score_adj:0

It turns out that wr.s3.download by default uses botocore's s3.get_object and fits whole response into a memory:

resp = _utils.try_it(
f=s3_client.get_object,
ex=_S3_RETRYABLE_ERRORS,
base=0.5,
max_num_tries=6,
Bucket=bucket,
Key=key,
Range=f"bytes={start}-{end - 1}",
**boto3_kwargs,
)
return start, resp["Body"].read()

Is it possible to chunkify reading of botocore response in awswrangler to be more memory efficient?

For instance, using the following snippet I got my file without any issues on the same machine:

raw_stream = s3.get_object(**kwargs)["Body"]

with open("test_botocore_iter_chunks.gz", 'wb') as f:
    for chunk in iter(lambda: raw_stream.read(64 * 1024), b''):
        f.write(chunk)

I tried to use wr.config.s3_block_size parameter expecting to chunkify the response but it does not help. After setting the s3_block_size up to be less than the file size you fall into this if condition:

if end - start >= self._s3_block_size: # Fetching length greater than cache length

which just fits the whole response into a memory

How to Reproduce

use memory profiler on

wr.s3.download(path, local_file)

Expected behavior

Please let me know if it's already possible to read chunkified response

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.6.9 -- this is old, but I can double check on newer versions

AWS SDK for pandas version

2.14.0

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage
Projects
None yet
Development

No branches or pull requests

1 participant