-
Notifications
You must be signed in to change notification settings - Fork 60
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
📝 Added docs for instruction to setup chatqna with vllm
- Loading branch information
Showing
3 changed files
with
185 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
# llm-vllm Microservice | ||
|
||
Helm chart for deploying a microservice which facilitates connections and handles responses from OpenVINO vLLM microservice. | ||
|
||
`llm-vllm-uservice` depends on OpenVINO vLLM. You should properly set `vLLM_ENDPOINT` as the HOST URI of vLLM microservice. If not set, it will consider the default value : `http://<helm-release-name>-vllm-openvino:80` | ||
|
||
As this service depends on vLLM microservice, we can proceed in either of 2 ways: | ||
|
||
- Install both microservices separately one after another. | ||
- Install the vLLM microservice as dependency for the our main `llm-vllm-uservice` microservice. | ||
|
||
## (Option 1): Installing the chart separately: | ||
|
||
First, you need to install the `vllm-openvino` chart, please refer to the [vllm-openvino](../vllm-openvino) chart for more information. | ||
|
||
After you've deployed the `vllm-openvino` chart successfully, please run `kubectl get svc` to get the vLLM service name with port. We need to provide this to `llm-vllm-uservice` as a value for vLLM_ENDPOINT for letting it discover and connect to the vLLM microservice. | ||
|
||
|
||
> **_NOTE:_** While installing charts separately, if you don't provide any vLLM endpoint explicitly, it will take the default endpoint as `http://<helm-release-name>-vllm-openvino:80`. So, if you are not providing the vLLM endpoint explicitly, please make sure to provide same helm release name to both the charts while installing. | ||
To install the chart, run the following: | ||
|
||
```bash | ||
cd GenAIInfra/helm-charts/common/llm-vllm-uservice | ||
export HFTOKEN="insert-your-huggingface-token-here" | ||
export vLLM_ENDPOINT="http://vllm-openvino" | ||
export MODELNAME="bigscience/bloom-560m" | ||
|
||
# If proxy is required, please export the appropriate proxy values. | ||
export http_proxy=<your_http_proxy> | ||
export https_proxy=<your_https_proxy> | ||
|
||
helm dependency update | ||
helm install llm-vllm-uservice . --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set vLLM_ENDPOINT=${vLLM_ENDPOINT} --set global.LLM_MODEL_ID=${MODELNAME} --set global.http_proxy=${http_proxy} --set global.https_proxy=${https_proxy} --wait | ||
``` | ||
|
||
## (Option 2): Installing the chart with automatic installation of dependency: | ||
|
||
```bash | ||
cd GenAIInfra/helm-charts/common/llm-vllm-uservice | ||
export HFTOKEN="insert-your-huggingface-token-here" | ||
export MODELNAME="bigscience/bloom-560m" | ||
|
||
# If proxy is required, please export the appropriate proxy values. | ||
export http_proxy=<your_http_proxy> | ||
export https_proxy=<your_https_proxy> | ||
|
||
helm dependency update | ||
helm install llm-vllm-uservice . --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.LLM_MODEL_ID=${MODELNAME} --set autodependency.enabled=true --set global.http_proxy=${http_proxy} --set global.https_proxy=${https_proxy} --wait | ||
``` | ||
|
||
`--wait` flag in above installation command will make sure that all the dependencies are resolved and all services are deployed. | ||
|
||
## Verify | ||
|
||
To verify the installation, run the following command to make sure all pods are running. | ||
|
||
```bash | ||
kubectl get pod | ||
``` | ||
|
||
Once you see `llm-vllm-uservice` pod and `llm-vllm-uservice-vllm-openvino` pod in ready and running state, run the following command: | ||
|
||
```bash | ||
kubectl port-forward svc/llm-vllm-uservice 9000:9000 | ||
``` | ||
|
||
This exposes the port 9000, on which `llm-vllm-uservice` is running inside the pod, at port 9000 on the host. | ||
|
||
Now, we can access the service from the host machine. Open another terminal and run the following command to verify whether `llm-vllm-uservice` is working: | ||
|
||
```bash | ||
curl http://localhost:9000/v1/chat/completions \ | ||
-X POST \ | ||
-d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' \ | ||
-H 'Content-Type: application/json' | ||
``` | ||
|
||
## Values | ||
|
||
| Key | Type | Default | Description | | ||
| ------------------------------- | ------ | -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ||
| global.HUGGINGFACEHUB_API_TOKEN | string | `""` | Your own Hugging Face API token | | ||
| global.modelUseHostPath | string | `"/mnt/opea-models"` | Cached models directory, vLLM will not download if the model is cached here. The host path "modelUseHostPath" will be mounted to container as /data directory | | ||
| image.repository | string | `"opea/llm-vllm"` | | | ||
| service.port | string | `"9000"` | | | ||
| vLLM_ENDPOINT | string | `""` | OpenVINO vLLM service endpoint | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
# OpenVINO vLLM | ||
|
||
Helm chart for deploying OpenVINO optimized vLLM Inference service. | ||
|
||
## Installing the Chart | ||
|
||
To install the chart, run the following: | ||
|
||
```bash | ||
cd GenAIInfra/helm-charts/common | ||
export MODELDIR=/mnt/opea-models | ||
export MODELNAME="bigscience/bloom-560m" | ||
export HFTOKEN="insert-your-huggingface-token-here" | ||
|
||
# If proxy is required, please export the appropriate proxy values. | ||
export http_proxy=<your_http_proxy> | ||
export https_proxy=<your_https_proxy> | ||
|
||
helm install vllm-openvino vllm-openvino --set global.modelUseHostPath=${MODELDIR} --set global.LLM_MODEL_ID=${MODELNAME} --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.http_proxy=${http_proxy} --set global.https_proxy=${https_proxy} --wait | ||
``` | ||
|
||
`--wait` flag in the above helm installation command lets the shell wait till `vllm-openvino` is completely up and ready. | ||
|
||
>**_NOTE:_** Make sure your `MODELDIR` exists on the node where your workload is scheduled so you can cache the downloaded model for next time use. Otherwise, set `global.modelUseHostPath` to 'null' if you don't want to cache the model. | ||
If you already cached the model locally, you can pass it to container like this example: | ||
|
||
MODELDIR=/mnt/opea-models | ||
|
||
MODELNAME="/data/models--bigscience--bloom-560m" | ||
|
||
>**_NOTE:_** By default, the vLLM service will be downloading **Intel/neural-chat-7b-v3-3** model from Huggingface, which is around 4GB in size. To use a smaller model, please set the LLM_MODEL_ID value to your desired model, as shown above, while installing the chart. | ||
## Verify | ||
|
||
To verify the installation, run the following command to make sure all pods are running. Please note that it may take a while to come the vLLM pod in ready state. | ||
|
||
```bash | ||
kubectl get pod | ||
``` | ||
|
||
Once you see `vllm-openvino` pod in ready and running state, run the following command: | ||
|
||
```bash | ||
kubectl port-forward svc/vllm-openvino 2080:80 | ||
``` | ||
|
||
This exposes the port 80, on which `vllm-openvino` service is running inside the pod, at port 2080 on the host. | ||
|
||
Now, we can access the service from the host machine. Open another terminal and run the following command to verify whether `vllm-openvino` service is working: | ||
|
||
```bash | ||
curl http://localhost:2080/v1/completions -sS --fail-with-body \ | ||
-X POST \ | ||
-d '{"prompt":"What is Deep Learning?", "model": "bigscience/bloom-560m", "max_tokens":17, "temperature": 0.5}' \ | ||
-H 'Content-Type: application/json' | ||
``` | ||
|
||
## Values | ||
|
||
| Key | Type | Default | Description | | ||
| ------------------------------- | ------ | ------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| LLM_MODEL_ID | string | `"bigscience/bloom-560m"` | Models id from https://huggingface.co/, or predownloaded model directory | | ||
| global.HUGGINGFACEHUB_API_TOKEN | string | `insert-your-huggingface-token-here` | Hugging Face API token | | ||
| global.modelUseHostPath | string | `"/mnt/opea-models"` | Cached models directory, vLLM will not download if the model is cached here. The host path "modelUseHostPath" will be mounted to container as /data directory. Setting this to null/empty will force it to download model. | | ||
| image.repository | string | `"vllm"` | | | ||
| image.tag | string | `"openvino"` | | | ||
| horizontalPodAutoscaler.enabled | bool | false | Enable HPA autoscaling for the service deployment based on metrics it provides. See HPA section in ../../README.md before enabling! | |