Consider support for vllm-hosted models? #140

cboettig · 2024-10-30T04:14:00Z

Hi @hadley, thanks for sharing this, really exciting.

Very nice to see support for open models via ollama. I wonder if you would consider adding support for VLLM-hosted models as well, e.g. see https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#tool-calling-in-the-chat-completion-api

Overall these use the OpenAI API, and indeed we can already do basic chat with these models out of the box by just dropping in a base_url to the openai_chat(), but it looks like tool use may require a bit more tweaking? I'm not sure, as it works out-of-the-box with the openai interfaces in langchain, but when I try my tool use examples in elmer i get this error:

OpenAIException - Error code: 400 - {'object': 'error', 'message': "[{'type': 'extra_forbidden', 'loc': ('body', 'tools', 0, 'function', 'strict'), 'msg': 'Extra inputs are not permitted', 'input': True}, {'type': 'extra_forbidden', 'loc': ('body', 'tools', 1, 'function', 'strict'), 'msg': 'Extra inputs are not permitted', 'input': True}]", 'type': 'BadRequestError', 'param': None, 'code': 400}

Context: ollama is great for individual use, but expecting every student to have access to a local GPU is often difficult in large classroom settings, and energy-inefficient as well. A natural solution we've been exploring this semester is to simply host the model on a single node using vllm, which is better at paging many requests than ollama, and serving it to the students as an API, e.g. via LiteLLM, something that we're testing out this semester with the fully open-source function calling model, Gorilla-LLM, developed here at Berkeley. https://gorilla.cs.berkeley.edu/ . Would be really keen to do this in R as well.

The text was updated successfully, but these errors were encountered:

hadley · 2024-10-30T15:49:32Z

@cboettig sure, happy to look into it. What's the easiest way for me to get access to a live vllm endpoint?

cboettig · 2024-10-30T15:54:07Z

@hadley amazing! Can I just email you a token for this NAIRR test platform we're using?

For comparison here's a quick gist showing tool-use via langchain on the same platform: https://gist.github.com/cboettig/6359fc23a637d328e5fa25a40499abb2

hadley · 2024-10-30T20:53:58Z

Is there an easy way for you to see what JSON langchain is sending to the server? That would make it easier for me to figure out what elmer is doing wrong.

hadley · 2024-10-31T13:00:49Z

library(elmer)
chat <- chat_openai(
  base_url = "https://llm.nrp-nautilus.io/",
  api_key = Sys.getenv("VLLM_KEY"),
  model = "gorilla"
)
chat$register_tool(tool(function() "2024-01-01", "Gets the current date"))
chat$chat("What's the current date in YMD format?")
#> Error in `req_perform_connection()`:
#> ! HTTP 400 Bad Request.
#> • litellm.BadRequestError: OpenAIException - Error code: 400 - {'object': 'error', 'message': "[{'type': 'extra_forbidden', 'loc': ('body', 'tools', 0, 'function', 'strict'), 'msg': 'Extra inputs are not permitted', 'input': True}]", 'type': 'BadRequestError', 'param': None, 'code': 400}
#> Received Model Group=gorilla
#> Available Model Group Fallbacks=None

elmer:::last_request_json()
#> {
#>   "messages": [
#>     {
#>       "role": "user",
#>       "content": [
#>         {
#>           "type": "text",
#>           "text": "What's the current date in YMD format?"
#>         }
#>       ]
#>     }
#>   ],
#>   "model": "gorilla",
#>   "stream": true,
#>   "stream_options": {
#>     "include_usage": true
#>   },
#>   "tools": [
#>     {
#>       "type": "function",
#>       "function": {
#>         "name": "tool_001",
#>         "description": "Gets the current date",
#>         "strict": true,
#>         "parameters": {
#>           "type": "object",
#>           "description": "",
#>           "properties": {},
#>           "required": [],
#>           "additionalProperties": false
#>         }
#>       }
#>     }
#>   ]
#> }

^{Created on 2024-10-31 with reprex v2.1.0}

hadley · 2024-10-31T13:03:24Z

Ooooh I get it; I'm sending strict because I think that's generally what you for openAI, but it's clearly not supported by vllm. I think the best approach is probably to make a specific vllm backend. (Especially since the default model will be different.)

cboettig · 2024-10-31T15:58:59Z

@hadley nice! Yeah, it looks like langchain is just sending like this, no stict like you said.

I'm a total novice at this, but just some notes from what I've gleaned in setting up things with VLLM for tool parsing.
Clearly the 'standards' are still shaking out here, looks like OpenAI introduced strict in August

because VLLM supports a wide range of models that do function calling differently, it has some configuration options for the --tool-call-parser. This gorilla model is using the llama3_json parser.

anyway I think a vllm-specific backend would be awesome, and I think a prototype based on the openai one but without stict would work get us off the ground here! But down-the-road it might need pluggable options to support the different tool-calling-parsers of different vllm models?

{
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "multiply",
                "description": "Multiply 'x' times 'y'.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "x": {"type": "number"},
                        "y": {"type": "number"}
                    },
                    "required": ["x", "y"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "exponentiate",
                "description": "Raise 'x' to the 'y'.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "x": {"type": "number"},
                        "y": {"type": "number"}
                    },
                    "required": ["x", "y"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "add",
                "description": "Add 'x' and 'y'.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "x": {"type": "number"},
                        "y": {"type": "number"}
                    },
                    "required": ["x", "y"]
                }
            }
        }
    ]
}

hadley · 2024-10-31T16:08:50Z

Yeah, I think once you get a vllm provider, it's easy enough to customise. I'll probably take a stab at it in the next week or two, unless you get to it first 😄

hadley · 2024-10-31T16:47:39Z

Can use https://platform.openai.com/docs/api-reference/models/list to list all available models (to provide interface like ollama)

cboettig · 2024-10-31T17:54:36Z

Yup!

Just for clarity: the web API here is actually LiteLLM, see endpoints: https://litellm-api.up.railway.app/. This includes the models list.

curl https://llm.nrp-nautilus.io/v1/models -H "Authorization: Bearer $OPENAI_API_KEY"

{"data":[{"id":"phi3","object":"model","created":1677610602,"owned_by":"openai"},
{"id":"mixtral","object":"model","created":1677610602,"owned_by":"openai"},
{"id":"llava","object":"model","created":1677610602,"owned_by":"openai"},
{"id":"embed-mistral","object":"model","created":1677610602,"owned_by":"openai"},
{"id":"gemma2","object":"model","created":1677610602,"owned_by":"openai"},
{"id":"gorilla","object":"model","created":1677610602,"owned_by":"openai"},
{"id":"llama3","object":"model","created":1677610602,"owned_by":"openai"}],"object":"list"}

(obviously the metadata here isn't super, and these models aren't actually owned by 'openai')

on the NRP platfrom LiteLLM API sits in front of the VLLMs, each model is hosted as it's own separate vllm app which the user can then select from via litellm api. All those endpoints in the LiteLLM docs should work given a token. (sorry I know this complexifies whether we think of this as the 'vllm' backend or the 'litellm' backend).

It's possible to use LiteLLM as interface to non-VLLM-backed models, including commercial models. VLLM is nice both because of it's ability to support many concurrent requests and also to deploy on a wide variety of GPU hardware, and can be used to directly serve most models that are on huggingface (and elsewhere), instead of hoping someone packages it for Ollama. I believe this streamlines the process of adding new models to the library and facilitates development & competition in these models.

cboettig · 2024-11-06T06:42:33Z

also another small clarification -- while we have this hosted version setup to serve a large classroom of students without local installations, just to note if you have a local (or remote) GPU you can pip install vllm and serve like you do with ollama (works nicely with nvidia & most other gpus used by cloud providers, though at this time I don't think it supports apple's chips).

e.g. for function calling this would just be:

vllm serve gorilla-llm/gorilla-openfunctions-v2 --dtype=half --trust-remote-code --enable-auto-tool-choice  --tool-call-parser llama3_json

(add --host and --port if necessary)

where by default the model name is the huggingface repo (e.g. https://huggingface.co/gorilla-llm/gorilla-openfunctions-v2). (and apologies if this was obvious or not relevant, I'm just learning as I go along here).

Fixes #140

hadley added a commit that referenced this issue Nov 7, 2024

Implement chat_vllm()

3bf0c51

Fixes #140

hadley linked a pull request Nov 7, 2024 that will close this issue

Implement chat_vllm() #148

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider support for vllm-hosted models? #140

Consider support for vllm-hosted models? #140

cboettig commented Oct 30, 2024

hadley commented Oct 30, 2024

cboettig commented Oct 30, 2024

hadley commented Oct 30, 2024

hadley commented Oct 31, 2024

hadley commented Oct 31, 2024

cboettig commented Oct 31, 2024

hadley commented Oct 31, 2024

hadley commented Oct 31, 2024

cboettig commented Oct 31, 2024

cboettig commented Nov 6, 2024

Consider support for vllm-hosted models? #140

Consider support for vllm-hosted models? #140

Comments

cboettig commented Oct 30, 2024

hadley commented Oct 30, 2024

cboettig commented Oct 30, 2024

hadley commented Oct 30, 2024

hadley commented Oct 31, 2024

hadley commented Oct 31, 2024

cboettig commented Oct 31, 2024

hadley commented Oct 31, 2024

hadley commented Oct 31, 2024

cboettig commented Oct 31, 2024

cboettig commented Nov 6, 2024