Use vLLM to load LLMs #230

kyriediculous · 2024-10-17T11:54:49Z

This PR upgrades the LLM pipeline to use vLLM to load and perform inference on models using optimised batching and other features that come with vLLM.

Dependencies have been upgraded to be compatible with vLLM 0.6.3, these new dependency versions are untested with other pipelines (though could benefit them as well)

Both fp16 and 8 bit quantization is still supported, but could be further optimized by detecting GPUs on the machine and adjusting quantization methods to be used accordingly.
Docker file has been updated to use newer pip and torch
Docker file has been udpated to respect CUDA_PCI_BUS_ORDER , ensuring the same develop experience as go-livepeer when specifying GPU id's found in nvidia-smi
Adds Top_P and Top_K parameters to the LLM route

kyriediculous force-pushed the llm branch from 62c0f11 to 261392f Compare October 17, 2024 19:51

kyriediculous marked this pull request as ready for review October 17, 2024 19:55

kyriediculous requested a review from rickstaa as a code owner October 17, 2024 19:55

kyriediculous force-pushed the llm branch from d00739e to 30cc874 Compare October 17, 2024 20:04

llm: use vLLM

b7e5606

kyriediculous force-pushed the llm branch from 30cc874 to b7e5606 Compare October 17, 2024 20:09

fixup: additional params

6f9256d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use vLLM to load LLMs #230

Use vLLM to load LLMs #230

kyriediculous commented Oct 17, 2024

Use vLLM to load LLMs #230

Are you sure you want to change the base?

Use vLLM to load LLMs #230

Conversation

kyriediculous commented Oct 17, 2024