[Performance] Escalate chunk partial reads to full chunk downloads #2526

nvoxland · 2023-08-07T18:26:59Z

🚀 🚀 Pull Request

Impact

Bug fix (non-breaking change which fixes expected existing functionality)
Enhancement/New feature (adds functionality without impacting existing logic)
Breaking change (fix or feature that would cause existing functionality to change)

Description

For some workloads (such as optimizing a view), many requests are made to the underlying storage system for small ranges of bytes from the same chunk files. For example, if there are 100 items in a chunk which are being included in the optimized view, the copy transform will read each of those items as individual byte downloads. Like bytes 10313-11400, then 11401-12500, then 12501-13600 etc.

This PR enhances the LRUCache so that if rather than always passing along byte range requests to the underlying storage, it checks to see if "enough" pieces has been requested previously. Once enough have been, it downloads the whole chunk at once and then starts serving the byte blocks from the cached file.

The check of how many pieces have been previously downloaded is based on what is on the file in the LRUCache. So if the file falls out of the cache, the next time it's fetched the count restarts.

Things to be aware of

The "enough" constant of _CHUNK_PARTIAL_READ_THRESHOLD was set arbitrarily at 5. I didn't try to do any benchmarking with different sizes or how it imacts different workloads
I've not been able to see a wall-clock time difference when this is enabled, but I do see many less network requests. But maybe there is a small enough overhead on downloading pieces of s3 vs. the whole file that it overall doesn't matter in time?

Things to worry about

Is it correctly returning the byte portion after downloading? I've not been able to test all the way through a view optimization because my sample dataset is so large. When I wasn't ignoring the header, though (LRUCache line 258) it gave decoding errors so it seems like that logic is right??

…er of reads to avoid the overhead of many small reads

nvoxland · 2023-09-08T15:44:28Z

I'm going to close this as "probably not needed" and we can always find it again if we find we do

Escalate chunk partial reads to full chunk downloads after a set numb…

eae8f3b

…er of reads to avoid the overhead of many small reads

nvoxland marked this pull request as draft August 7, 2023 18:27

nvoxland closed this Sep 8, 2023

activesoull deleted the escalate_chunk_downloads branch September 27, 2024 08:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Escalate chunk partial reads to full chunk downloads #2526

[Performance] Escalate chunk partial reads to full chunk downloads #2526

nvoxland commented Aug 7, 2023

nvoxland commented Sep 8, 2023

[Performance] Escalate chunk partial reads to full chunk downloads #2526

[Performance] Escalate chunk partial reads to full chunk downloads #2526

Conversation

nvoxland commented Aug 7, 2023

🚀 🚀 Pull Request

Impact

Description

Things to be aware of

Things to worry about

nvoxland commented Sep 8, 2023