wr.s3.download fits the whole file into memory, with 2x memory allocation #2831

roykoand · 2024-05-22T08:10:05Z

Describe the bug

I was using wr.s3.download on 2GiB memory VM and noticed that when I download a 1006 MiB GZIP file from S3 it allocates ~2295 MiB in both cases with and without use_threads parameter. It was measured using this memory profiler.

Obviously my script fails with OOM error on 2GiB memory machine with 2 CPUs. dmesg gives a little different memory estimation:

$ dmesg  | tail -1
Out of memory: Killed process 10020 (python3) total-vm:2573584kB, anon-rss:1644684kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:3844kB oom_score_adj:0

It turns out that wr.s3.download by default uses botocore's s3.get_object and fits whole response into a memory:

aws-sdk-pandas/awswrangler/s3/_fs.py

Lines 65 to 75 in 7e83b89

    
           resp = _utils.try_it( 
        
               f=s3_client.get_object, 
        
               ex=_S3_RETRYABLE_ERRORS, 
        
               base=0.5, 
        
               max_num_tries=6, 
        
               Bucket=bucket, 
        
               Key=key, 
        
               Range=f"bytes={start}-{end - 1}", 
        
               **boto3_kwargs, 
        
           ) 
        
           return start, resp["Body"].read()

Is it possible to chunkify reading of botocore response in awswrangler to be more memory efficient?

For instance, using the following snippet I got my file without any issues on the same machine:

raw_stream = s3.get_object(**kwargs)["Body"]

with open("test_botocore_iter_chunks.gz", 'wb') as f:
    for chunk in iter(lambda: raw_stream.read(64 * 1024), b''):
        f.write(chunk)

I tried to use wr.config.s3_block_size parameter expecting to chunkify the response but it does not help. After setting the s3_block_size up to be less than the file size you fall into this if condition:

aws-sdk-pandas/awswrangler/s3/_fs.py

Line 326 in 7e83b89

    
           if end - start >= self._s3_block_size:  # Fetching length greater than cache length

which just fits the whole response into a memory

How to Reproduce

use memory profiler on

wr.s3.download(path, local_file)

Expected behavior

Please let me know if it's already possible to read chunkified response

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.6.9 -- this is old, but I can double check on newer versions

AWS SDK for pandas version

2.14.0

Additional context

No response

The text was updated successfully, but these errors were encountered:

roykoand added the bug Something isn't working label May 22, 2024

github-actions bot added the needs-triage label May 27, 2024

kukushking mentioned this issue Nov 3, 2024

read_parquet function takes up a lot of memory even before it returns the iterable object #3010

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wr.s3.download fits the whole file into memory, with 2x memory allocation #2831

wr.s3.download fits the whole file into memory, with 2x memory allocation #2831

roykoand commented May 22, 2024

wr.s3.download fits the whole file into memory, with 2x memory allocation #2831

wr.s3.download fits the whole file into memory, with 2x memory allocation #2831

Comments

roykoand commented May 22, 2024

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context