Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Escalate chunk partial reads to full chunk downloads #2526

Closed
wants to merge 1 commit into from

Conversation

nvoxland
Copy link
Contributor

@nvoxland nvoxland commented Aug 7, 2023

🚀 🚀 Pull Request

Impact

  • Bug fix (non-breaking change which fixes expected existing functionality)
  • Enhancement/New feature (adds functionality without impacting existing logic)
  • Breaking change (fix or feature that would cause existing functionality to change)

Description

For some workloads (such as optimizing a view), many requests are made to the underlying storage system for small ranges of bytes from the same chunk files. For example, if there are 100 items in a chunk which are being included in the optimized view, the copy transform will read each of those items as individual byte downloads. Like bytes 10313-11400, then 11401-12500, then 12501-13600 etc.

This PR enhances the LRUCache so that if rather than always passing along byte range requests to the underlying storage, it checks to see if "enough" pieces has been requested previously. Once enough have been, it downloads the whole chunk at once and then starts serving the byte blocks from the cached file.

The check of how many pieces have been previously downloaded is based on what is on the file in the LRUCache. So if the file falls out of the cache, the next time it's fetched the count restarts.

Things to be aware of

  • The "enough" constant of _CHUNK_PARTIAL_READ_THRESHOLD was set arbitrarily at 5. I didn't try to do any benchmarking with different sizes or how it imacts different workloads
  • I've not been able to see a wall-clock time difference when this is enabled, but I do see many less network requests. But maybe there is a small enough overhead on downloading pieces of s3 vs. the whole file that it overall doesn't matter in time?

Things to worry about

Is it correctly returning the byte portion after downloading? I've not been able to test all the way through a view optimization because my sample dataset is so large. When I wasn't ignoring the header, though (LRUCache line 258) it gave decoding errors so it seems like that logic is right??

…er of reads to avoid the overhead of many small reads
@nvoxland nvoxland marked this pull request as draft August 7, 2023 18:27
@nvoxland
Copy link
Contributor Author

nvoxland commented Sep 8, 2023

I'm going to close this as "probably not needed" and we can always find it again if we find we do

@nvoxland nvoxland closed this Sep 8, 2023
@activesoull activesoull deleted the escalate_chunk_downloads branch September 27, 2024 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant