Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR upgrades the LLM pipeline to use vLLM to load and perform inference on models using optimised batching and other features that come with vLLM.
Dependencies have been upgraded to be compatible with vLLM 0.6.3, these new dependency versions are untested with other pipelines (though could benefit them as well)
Both fp16 and 8 bit quantization is still supported, but could be further optimized by detecting GPUs on the machine and adjusting quantization methods to be used accordingly.
Docker file has been updated to use newer
pip
andtorch
Docker file has been udpated to respect
CUDA_PCI_BUS_ORDER
, ensuring the same develop experience as go-livepeer when specifying GPU id's found innvidia-smi
Adds
Top_P
andTop_K
parameters to the LLM route