-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider support for vllm-hosted models? #140
Comments
@cboettig sure, happy to look into it. What's the easiest way for me to get access to a live vllm endpoint? |
@hadley amazing! Can I just email you a token for this NAIRR test platform we're using? For comparison here's a quick gist showing tool-use via langchain on the same platform: https://gist.github.com/cboettig/6359fc23a637d328e5fa25a40499abb2 |
Is there an easy way for you to see what JSON langchain is sending to the server? That would make it easier for me to figure out what elmer is doing wrong. |
library(elmer)
chat <- chat_openai(
base_url = "https://llm.nrp-nautilus.io/",
api_key = Sys.getenv("VLLM_KEY"),
model = "gorilla"
)
chat$register_tool(tool(function() "2024-01-01", "Gets the current date"))
chat$chat("What's the current date in YMD format?")
#> Error in `req_perform_connection()`:
#> ! HTTP 400 Bad Request.
#> • litellm.BadRequestError: OpenAIException - Error code: 400 - {'object': 'error', 'message': "[{'type': 'extra_forbidden', 'loc': ('body', 'tools', 0, 'function', 'strict'), 'msg': 'Extra inputs are not permitted', 'input': True}]", 'type': 'BadRequestError', 'param': None, 'code': 400}
#> Received Model Group=gorilla
#> Available Model Group Fallbacks=None
elmer:::last_request_json()
#> {
#> "messages": [
#> {
#> "role": "user",
#> "content": [
#> {
#> "type": "text",
#> "text": "What's the current date in YMD format?"
#> }
#> ]
#> }
#> ],
#> "model": "gorilla",
#> "stream": true,
#> "stream_options": {
#> "include_usage": true
#> },
#> "tools": [
#> {
#> "type": "function",
#> "function": {
#> "name": "tool_001",
#> "description": "Gets the current date",
#> "strict": true,
#> "parameters": {
#> "type": "object",
#> "description": "",
#> "properties": {},
#> "required": [],
#> "additionalProperties": false
#> }
#> }
#> }
#> ]
#> } Created on 2024-10-31 with reprex v2.1.0 |
Ooooh I get it; I'm sending |
@hadley nice! Yeah, it looks like langchain is just sending like this, no I'm a total novice at this, but just some notes from what I've gleaned in setting up things with VLLM for tool parsing. because VLLM supports a wide range of models that do function calling differently, it has some configuration options for the anyway I think a vllm-specific backend would be awesome, and I think a prototype based on the openai one but without
|
Yeah, I think once you get a vllm provider, it's easy enough to customise. I'll probably take a stab at it in the next week or two, unless you get to it first 😄 |
Can use https://platform.openai.com/docs/api-reference/models/list to list all available models (to provide interface like ollama) |
Yup! Just for clarity: the web API here is actually LiteLLM, see endpoints: https://litellm-api.up.railway.app/. This includes the models list.
(obviously the metadata here isn't super, and these models aren't actually owned by 'openai') on the NRP platfrom LiteLLM API sits in front of the VLLMs, each model is hosted as it's own separate vllm app which the user can then select from via litellm api. All those endpoints in the LiteLLM docs should work given a token. (sorry I know this complexifies whether we think of this as the 'vllm' backend or the 'litellm' backend). It's possible to use LiteLLM as interface to non-VLLM-backed models, including commercial models. VLLM is nice both because of it's ability to support many concurrent requests and also to deploy on a wide variety of GPU hardware, and can be used to directly serve most models that are on huggingface (and elsewhere), instead of hoping someone packages it for Ollama. I believe this streamlines the process of adding new models to the library and facilitates development & competition in these models. |
also another small clarification -- while we have this hosted version setup to serve a large classroom of students without local installations, just to note if you have a local (or remote) GPU you can e.g. for function calling this would just be:
(add where by default the model name is the huggingface repo (e.g. https://huggingface.co/gorilla-llm/gorilla-openfunctions-v2). (and apologies if this was obvious or not relevant, I'm just learning as I go along here). |
Hi @hadley, thanks for sharing this, really exciting.
Very nice to see support for open models via ollama. I wonder if you would consider adding support for VLLM-hosted models as well, e.g. see https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#tool-calling-in-the-chat-completion-api
Overall these use the OpenAI API, and indeed we can already do basic chat with these models out of the box by just dropping in a
base_url
to theopenai_chat()
, but it looks like tool use may require a bit more tweaking? I'm not sure, as it works out-of-the-box with the openai interfaces in langchain, but when I try my tool use examples in elmer i get this error:Context: ollama is great for individual use, but expecting every student to have access to a local GPU is often difficult in large classroom settings, and energy-inefficient as well. A natural solution we've been exploring this semester is to simply host the model on a single node using vllm, which is better at paging many requests than ollama, and serving it to the students as an API, e.g. via LiteLLM, something that we're testing out this semester with the fully open-source function calling model, Gorilla-LLM, developed here at Berkeley. https://gorilla.cs.berkeley.edu/ . Would be really keen to do this in R as well.
The text was updated successfully, but these errors were encountered: