Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V1] VLM prefix caching: Add hashing of images #10497

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

alexm-neuralmagic
Copy link
Collaborator

@alexm-neuralmagic alexm-neuralmagic commented Nov 20, 2024

As part of V1 VLM prefix caching, we need to support hashing of images. This PR adds logic to hash images and pipes the hashes down to the model runner (if needed). Currently, it uses a cryptographic hash so the match between image and hash is precise, however, it is also possible to use a less precise hash to match "similar" images. The library used for hashing is blake3 (), which seems to be pretty efficient.

As an example to hash a 1770x1180 RGB PIL image, it takes 1.6ms to perform image.tobytes() and 0.8ms to hash all of the image bytes (177011803 = 6265800 bytes). Log print:

req.mm_data = {'image': <PIL.Image.Image image mode=RGB size=1770x1180 at 0x7E871606C1F0>}
tobytes time = 0.0016231536865234375
hash time = 0.0008261203765869141
hash value = 9de5230fb3052dee056a26e71a545552207495528a763a15ad98d31961f53578

As a reference, to run the HF mapper/preprocessor it may take 10-50ms per image.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@alexm-neuralmagic alexm-neuralmagic self-assigned this Nov 20, 2024
@alexm-neuralmagic alexm-neuralmagic marked this pull request as draft November 20, 2024 17:13
@comaniac
Copy link
Collaborator

cc @rickyyx @yue-anyscale

@@ -101,6 +131,9 @@ def add_request(self, request: EngineCoreRequest):
# take 10-50 ms, which can cause a spike in the latency. We should
# consider moving this to a separate thread.
if req.mm_data:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on doing this on the frontend engine process (i.e. v1/engine/processor.py::Processor) before sending to the EngineCore?

IIUC: this add_request is called on the EngineCore process, meaning it's sync blocking the model executor too?

Copy link
Member

@ywang96 ywang96 Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this is already planned. Eventually the multimodal data processor will live on the frontend, together with input token sequence processor. #10044 is working towards this direction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rickyyx I think it is a good idea, I can try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants