Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider support for vllm-hosted models? #140

Open
cboettig opened this issue Oct 30, 2024 · 10 comments · May be fixed by #148
Open

Consider support for vllm-hosted models? #140

cboettig opened this issue Oct 30, 2024 · 10 comments · May be fixed by #148

Comments

@cboettig
Copy link

Hi @hadley, thanks for sharing this, really exciting.

Very nice to see support for open models via ollama. I wonder if you would consider adding support for VLLM-hosted models as well, e.g. see https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#tool-calling-in-the-chat-completion-api

Overall these use the OpenAI API, and indeed we can already do basic chat with these models out of the box by just dropping in a base_url to the openai_chat(), but it looks like tool use may require a bit more tweaking? I'm not sure, as it works out-of-the-box with the openai interfaces in langchain, but when I try my tool use examples in elmer i get this error:

OpenAIException - Error code: 400 - {'object': 'error', 'message': "[{'type': 'extra_forbidden', 'loc': ('body', 'tools', 0, 'function', 'strict'), 'msg': 'Extra inputs are not permitted', 'input': True}, {'type': 'extra_forbidden', 'loc': ('body', 'tools', 1, 'function', 'strict'), 'msg': 'Extra inputs are not permitted', 'input': True}]", 'type': 'BadRequestError', 'param': None, 'code': 400} 

Context: ollama is great for individual use, but expecting every student to have access to a local GPU is often difficult in large classroom settings, and energy-inefficient as well. A natural solution we've been exploring this semester is to simply host the model on a single node using vllm, which is better at paging many requests than ollama, and serving it to the students as an API, e.g. via LiteLLM, something that we're testing out this semester with the fully open-source function calling model, Gorilla-LLM, developed here at Berkeley. https://gorilla.cs.berkeley.edu/ . Would be really keen to do this in R as well.

@hadley
Copy link
Member

hadley commented Oct 30, 2024

@cboettig sure, happy to look into it. What's the easiest way for me to get access to a live vllm endpoint?

@cboettig
Copy link
Author

@hadley amazing! Can I just email you a token for this NAIRR test platform we're using?

For comparison here's a quick gist showing tool-use via langchain on the same platform: https://gist.github.com/cboettig/6359fc23a637d328e5fa25a40499abb2

@hadley
Copy link
Member

hadley commented Oct 30, 2024

Is there an easy way for you to see what JSON langchain is sending to the server? That would make it easier for me to figure out what elmer is doing wrong.

@hadley
Copy link
Member

hadley commented Oct 31, 2024

library(elmer)
chat <- chat_openai(
  base_url = "https://llm.nrp-nautilus.io/",
  api_key = Sys.getenv("VLLM_KEY"),
  model = "gorilla"
)
chat$register_tool(tool(function() "2024-01-01", "Gets the current date"))
chat$chat("What's the current date in YMD format?")
#> Error in `req_perform_connection()`:
#> ! HTTP 400 Bad Request.
#> • litellm.BadRequestError: OpenAIException - Error code: 400 - {'object': 'error', 'message': "[{'type': 'extra_forbidden', 'loc': ('body', 'tools', 0, 'function', 'strict'), 'msg': 'Extra inputs are not permitted', 'input': True}]", 'type': 'BadRequestError', 'param': None, 'code': 400}
#> Received Model Group=gorilla
#> Available Model Group Fallbacks=None

elmer:::last_request_json()
#> {
#>   "messages": [
#>     {
#>       "role": "user",
#>       "content": [
#>         {
#>           "type": "text",
#>           "text": "What's the current date in YMD format?"
#>         }
#>       ]
#>     }
#>   ],
#>   "model": "gorilla",
#>   "stream": true,
#>   "stream_options": {
#>     "include_usage": true
#>   },
#>   "tools": [
#>     {
#>       "type": "function",
#>       "function": {
#>         "name": "tool_001",
#>         "description": "Gets the current date",
#>         "strict": true,
#>         "parameters": {
#>           "type": "object",
#>           "description": "",
#>           "properties": {},
#>           "required": [],
#>           "additionalProperties": false
#>         }
#>       }
#>     }
#>   ]
#> }

Created on 2024-10-31 with reprex v2.1.0

@hadley
Copy link
Member

hadley commented Oct 31, 2024

Ooooh I get it; I'm sending strict because I think that's generally what you for openAI, but it's clearly not supported by vllm. I think the best approach is probably to make a specific vllm backend. (Especially since the default model will be different.)

@cboettig
Copy link
Author

@hadley nice! Yeah, it looks like langchain is just sending like this, no stict like you said.

I'm a total novice at this, but just some notes from what I've gleaned in setting up things with VLLM for tool parsing.
Clearly the 'standards' are still shaking out here, looks like OpenAI introduced strict in August

because VLLM supports a wide range of models that do function calling differently, it has some configuration options for the --tool-call-parser. This gorilla model is using the llama3_json parser.

anyway I think a vllm-specific backend would be awesome, and I think a prototype based on the openai one but without stict would work get us off the ground here! But down-the-road it might need pluggable options to support the different tool-calling-parsers of different vllm models?

{
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "multiply",
                "description": "Multiply 'x' times 'y'.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "x": {"type": "number"},
                        "y": {"type": "number"}
                    },
                    "required": ["x", "y"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "exponentiate",
                "description": "Raise 'x' to the 'y'.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "x": {"type": "number"},
                        "y": {"type": "number"}
                    },
                    "required": ["x", "y"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "add",
                "description": "Add 'x' and 'y'.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "x": {"type": "number"},
                        "y": {"type": "number"}
                    },
                    "required": ["x", "y"]
                }
            }
        }
    ]
}

@hadley
Copy link
Member

hadley commented Oct 31, 2024

Yeah, I think once you get a vllm provider, it's easy enough to customise. I'll probably take a stab at it in the next week or two, unless you get to it first 😄

@hadley
Copy link
Member

hadley commented Oct 31, 2024

Can use https://platform.openai.com/docs/api-reference/models/list to list all available models (to provide interface like ollama)

@cboettig
Copy link
Author

Yup!

Just for clarity: the web API here is actually LiteLLM, see endpoints: https://litellm-api.up.railway.app/. This includes the models list.

curl https://llm.nrp-nautilus.io/v1/models -H "Authorization: Bearer $OPENAI_API_KEY"

{"data":[{"id":"phi3","object":"model","created":1677610602,"owned_by":"openai"},
{"id":"mixtral","object":"model","created":1677610602,"owned_by":"openai"},
{"id":"llava","object":"model","created":1677610602,"owned_by":"openai"},
{"id":"embed-mistral","object":"model","created":1677610602,"owned_by":"openai"},
{"id":"gemma2","object":"model","created":1677610602,"owned_by":"openai"},
{"id":"gorilla","object":"model","created":1677610602,"owned_by":"openai"},
{"id":"llama3","object":"model","created":1677610602,"owned_by":"openai"}],"object":"list"}

(obviously the metadata here isn't super, and these models aren't actually owned by 'openai')

on the NRP platfrom LiteLLM API sits in front of the VLLMs, each model is hosted as it's own separate vllm app which the user can then select from via litellm api. All those endpoints in the LiteLLM docs should work given a token. (sorry I know this complexifies whether we think of this as the 'vllm' backend or the 'litellm' backend).

It's possible to use LiteLLM as interface to non-VLLM-backed models, including commercial models. VLLM is nice both because of it's ability to support many concurrent requests and also to deploy on a wide variety of GPU hardware, and can be used to directly serve most models that are on huggingface (and elsewhere), instead of hoping someone packages it for Ollama. I believe this streamlines the process of adding new models to the library and facilitates development & competition in these models.

@cboettig
Copy link
Author

cboettig commented Nov 6, 2024

also another small clarification -- while we have this hosted version setup to serve a large classroom of students without local installations, just to note if you have a local (or remote) GPU you can pip install vllm and serve like you do with ollama (works nicely with nvidia & most other gpus used by cloud providers, though at this time I don't think it supports apple's chips).

e.g. for function calling this would just be:

vllm serve gorilla-llm/gorilla-openfunctions-v2 --dtype=half --trust-remote-code --enable-auto-tool-choice  --tool-call-parser llama3_json

(add --host and --port if necessary)

where by default the model name is the huggingface repo (e.g. https://huggingface.co/gorilla-llm/gorilla-openfunctions-v2). (and apologies if this was obvious or not relevant, I'm just learning as I go along here).

hadley added a commit that referenced this issue Nov 7, 2024
@hadley hadley linked a pull request Nov 7, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants