[Performance] Escalate chunk partial reads to full chunk downloads #2526
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
🚀 🚀 Pull Request
Impact
Description
For some workloads (such as optimizing a view), many requests are made to the underlying storage system for small ranges of bytes from the same chunk files. For example, if there are 100 items in a chunk which are being included in the optimized view, the copy transform will read each of those items as individual byte downloads. Like bytes 10313-11400, then 11401-12500, then 12501-13600 etc.
This PR enhances the LRUCache so that if rather than always passing along byte range requests to the underlying storage, it checks to see if "enough" pieces has been requested previously. Once enough have been, it downloads the whole chunk at once and then starts serving the byte blocks from the cached file.
The check of how many pieces have been previously downloaded is based on what is on the file in the LRUCache. So if the file falls out of the cache, the next time it's fetched the count restarts.
Things to be aware of
_CHUNK_PARTIAL_READ_THRESHOLD
was set arbitrarily at 5. I didn't try to do any benchmarking with different sizes or how it imacts different workloadsThings to worry about
Is it correctly returning the byte portion after downloading? I've not been able to test all the way through a view optimization because my sample dataset is so large. When I wasn't ignoring the header, though (LRUCache line 258) it gave decoding errors so it seems like that logic is right??