Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use vLLM to load LLMs #230

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Use vLLM to load LLMs #230

wants to merge 2 commits into from

Conversation

kyriediculous
Copy link
Contributor

This PR upgrades the LLM pipeline to use vLLM to load and perform inference on models using optimised batching and other features that come with vLLM.

Dependencies have been upgraded to be compatible with vLLM 0.6.3, these new dependency versions are untested with other pipelines (though could benefit them as well)

  • Both fp16 and 8 bit quantization is still supported, but could be further optimized by detecting GPUs on the machine and adjusting quantization methods to be used accordingly.

  • Docker file has been updated to use newer pip and torch

  • Docker file has been udpated to respect CUDA_PCI_BUS_ORDER , ensuring the same develop experience as go-livepeer when specifying GPU id's found in nvidia-smi

  • Adds Top_P and Top_K parameters to the LLM route

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant