Skip to content

Commit

Permalink
📝 Added docs for instruction to setup chatqna with vllm
Browse files Browse the repository at this point in the history
  • Loading branch information
krish918 committed Sep 6, 2024
1 parent cfd902f commit c06f613
Show file tree
Hide file tree
Showing 3 changed files with 185 additions and 9 deletions.
39 changes: 30 additions & 9 deletions helm-charts/chatqna/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,30 +9,50 @@ Helm chart for deploying ChatQnA service. ChatQnA depends on the following servi
- [redis-vector-db](../common/redis-vector-db)
- [reranking-usvc](../common/reranking-usvc)
- [teirerank](../common/teirerank)
- [llm-uservice](../common/llm-uservice)
- [tgi](../common/tgi)

Apart from above mentioned services, there are following conditional dependencies (out of which, one are required):

1. If we want to use TGI as our inference service, following 2 services will be required:

- [llm-uservice](../common/llm-uservice)
- [tgi](../common/tgi)

2. If we want to use OpenVINO vLLM inference service, following 2 services would be required:
- [llm-vllm-uservice](../common/llm-vllm-uservice)
- [vllm-openvino](../common/vllm-openvino)


## Installing the Chart

To install the chart, run the following:

```console
```bash
cd GenAIInfra/helm-charts/
./update_dependency.sh
helm dependency update chatqna
export HFTOKEN="insert-your-huggingface-token-here"
export MODELDIR="/mnt/opea-models"
export MODELNAME="Intel/neural-chat-7b-v3-3"
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME}

# To use Gaudi device
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/gaudi-values.yaml
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/gaudi-values.yaml

# To use Nvidia GPU
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/nv-values.yaml
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/nv-values.yaml


# To use OpenVINO vLLM inference engine on Xeon device

helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set global.LLM_MODEL_ID=${MODELNAME} --set tags.tgi=false --set vllm-openvino.enabled=true
```


### IMPORTANT NOTE

1. Make sure your `MODELDIR` exists on the node where your workload is schedueled so you can cache the downloaded model for next time use. Otherwise, set `global.modelUseHostPath` to 'null' if you don't want to cache the model.
1. Make sure your `MODELDIR` exists on the node where your workload is scheduled so you can cache the downloaded model for next time use. Otherwise, set `global.modelUseHostPath` to 'null' if you don't want to cache the model.

2. Please set `http_proxy`, `https_proxy` and `no_proxy` values while installing chart, if you are behind a proxy.

## Verify

Expand All @@ -46,8 +66,9 @@ Run the command `kubectl port-forward svc/chatqna 8888:8888` to expose the servi

Open another terminal and run the following command to verify the service if working:

```console
```bash
curl http://localhost:8888/v1/chatqna \
-X POST \
-H "Content-Type: application/json" \
-d '{"messages": "What is the revenue of Nike in 2023?"}'
```
Expand All @@ -71,7 +92,6 @@ docker save -o ui.tar opea/chatqna-conversation-ui:latest
sudo ctr -n k8s.io image import ui.tar

# install UI using helm chart. Replace image tag if required
cd
cd GenAIInfra/helm-charts/
helm install ui common/chatqna-ui --set BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/chatqna",DATAPREP_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/dataprep",image.tag="latest"

Expand All @@ -88,4 +108,5 @@ Access `http://localhost:5174` to play with the ChatQnA workload through UI.
| image.repository | string | `"opea/chatqna"` | |
| service.port | string | `"8888"` | |
| tgi.LLM_MODEL_ID | string | `"Intel/neural-chat-7b-v3-3"` | Models id from https://huggingface.co/, or predownloaded model directory |
| global.horizontalPodAutoscaler.enabled | bop; | false | HPA autoscaling for the TGI and TEI service deployments based on metrics they provide. See HPA section in ../README.md before enabling! |
| vllm-openvino.LLM_MODEL_ID | string | `"Intel/neural-chat-7b-v3-3"` | Models id from https://huggingface.co/, or predownloaded model directory |
| global.horizontalPodAutoscaler.enabled | bool | false | HPA autoscaling for the TGI and TEI service deployments based on metrics they provide. See HPA section in ../README.md before enabling! |
87 changes: 87 additions & 0 deletions helm-charts/common/llm-vllm-uservice/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# llm-vllm Microservice

Helm chart for deploying a microservice which facilitates connections and handles responses from OpenVINO vLLM microservice.

`llm-vllm-uservice` depends on OpenVINO vLLM. You should properly set `vLLM_ENDPOINT` as the HOST URI of vLLM microservice. If not set, it will consider the default value : `http://<helm-release-name>-vllm-openvino:80`

As this service depends on vLLM microservice, we can proceed in either of 2 ways:

- Install both microservices separately one after another.
- Install the vLLM microservice as dependency for the our main `llm-vllm-uservice` microservice.

## (Option 1): Installing the chart separately:

First, you need to install the `vllm-openvino` chart, please refer to the [vllm-openvino](../vllm-openvino) chart for more information.

After you've deployed the `vllm-openvino` chart successfully, please run `kubectl get svc` to get the vLLM service name with port. We need to provide this to `llm-vllm-uservice` as a value for vLLM_ENDPOINT for letting it discover and connect to the vLLM microservice.


> **_NOTE:_** While installing charts separately, if you don't provide any vLLM endpoint explicitly, it will take the default endpoint as `http://<helm-release-name>-vllm-openvino:80`. So, if you are not providing the vLLM endpoint explicitly, please make sure to provide same helm release name to both the charts while installing.
To install the chart, run the following:

```bash
cd GenAIInfra/helm-charts/common/llm-vllm-uservice
export HFTOKEN="insert-your-huggingface-token-here"
export vLLM_ENDPOINT="http://vllm-openvino"
export MODELNAME="bigscience/bloom-560m"

# If proxy is required, please export the appropriate proxy values.
export http_proxy=<your_http_proxy>
export https_proxy=<your_https_proxy>

helm dependency update
helm install llm-vllm-uservice . --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set vLLM_ENDPOINT=${vLLM_ENDPOINT} --set global.LLM_MODEL_ID=${MODELNAME} --set global.http_proxy=${http_proxy} --set global.https_proxy=${https_proxy} --wait
```

## (Option 2): Installing the chart with automatic installation of dependency:

```bash
cd GenAIInfra/helm-charts/common/llm-vllm-uservice
export HFTOKEN="insert-your-huggingface-token-here"
export MODELNAME="bigscience/bloom-560m"

# If proxy is required, please export the appropriate proxy values.
export http_proxy=<your_http_proxy>
export https_proxy=<your_https_proxy>

helm dependency update
helm install llm-vllm-uservice . --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.LLM_MODEL_ID=${MODELNAME} --set autodependency.enabled=true --set global.http_proxy=${http_proxy} --set global.https_proxy=${https_proxy} --wait
```

`--wait` flag in above installation command will make sure that all the dependencies are resolved and all services are deployed.

## Verify

To verify the installation, run the following command to make sure all pods are running.

```bash
kubectl get pod
```

Once you see `llm-vllm-uservice` pod and `llm-vllm-uservice-vllm-openvino` pod in ready and running state, run the following command:

```bash
kubectl port-forward svc/llm-vllm-uservice 9000:9000
```

This exposes the port 9000, on which `llm-vllm-uservice` is running inside the pod, at port 9000 on the host.

Now, we can access the service from the host machine. Open another terminal and run the following command to verify whether `llm-vllm-uservice` is working:

```bash
curl http://localhost:9000/v1/chat/completions \
-X POST \
-d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' \
-H 'Content-Type: application/json'
```

## Values

| Key | Type | Default | Description |
| ------------------------------- | ------ | -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| global.HUGGINGFACEHUB_API_TOKEN | string | `""` | Your own Hugging Face API token |
| global.modelUseHostPath | string | `"/mnt/opea-models"` | Cached models directory, vLLM will not download if the model is cached here. The host path "modelUseHostPath" will be mounted to container as /data directory |
| image.repository | string | `"opea/llm-vllm"` | |
| service.port | string | `"9000"` | |
| vLLM_ENDPOINT | string | `""` | OpenVINO vLLM service endpoint |
68 changes: 68 additions & 0 deletions helm-charts/common/vllm-openvino/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# OpenVINO vLLM

Helm chart for deploying OpenVINO optimized vLLM Inference service.

## Installing the Chart

To install the chart, run the following:

```bash
cd GenAIInfra/helm-charts/common
export MODELDIR=/mnt/opea-models
export MODELNAME="bigscience/bloom-560m"
export HFTOKEN="insert-your-huggingface-token-here"

# If proxy is required, please export the appropriate proxy values.
export http_proxy=<your_http_proxy>
export https_proxy=<your_https_proxy>

helm install vllm-openvino vllm-openvino --set global.modelUseHostPath=${MODELDIR} --set global.LLM_MODEL_ID=${MODELNAME} --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.http_proxy=${http_proxy} --set global.https_proxy=${https_proxy} --wait
```

`--wait` flag in the above helm installation command lets the shell wait till `vllm-openvino` is completely up and ready.

>**_NOTE:_** Make sure your `MODELDIR` exists on the node where your workload is scheduled so you can cache the downloaded model for next time use. Otherwise, set `global.modelUseHostPath` to 'null' if you don't want to cache the model.
If you already cached the model locally, you can pass it to container like this example:

MODELDIR=/mnt/opea-models

MODELNAME="/data/models--bigscience--bloom-560m"

>**_NOTE:_** By default, the vLLM service will be downloading **Intel/neural-chat-7b-v3-3** model from Huggingface, which is around 4GB in size. To use a smaller model, please set the LLM_MODEL_ID value to your desired model, as shown above, while installing the chart.
## Verify

To verify the installation, run the following command to make sure all pods are running. Please note that it may take a while to come the vLLM pod in ready state.

```bash
kubectl get pod
```

Once you see `vllm-openvino` pod in ready and running state, run the following command:

```bash
kubectl port-forward svc/vllm-openvino 2080:80
```

This exposes the port 80, on which `vllm-openvino` service is running inside the pod, at port 2080 on the host.

Now, we can access the service from the host machine. Open another terminal and run the following command to verify whether `vllm-openvino` service is working:

```bash
curl http://localhost:2080/v1/completions -sS --fail-with-body \
-X POST \
-d '{"prompt":"What is Deep Learning?", "model": "bigscience/bloom-560m", "max_tokens":17, "temperature": 0.5}' \
-H 'Content-Type: application/json'
```

## Values

| Key | Type | Default | Description |
| ------------------------------- | ------ | ------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| LLM_MODEL_ID | string | `"bigscience/bloom-560m"` | Models id from https://huggingface.co/, or predownloaded model directory |
| global.HUGGINGFACEHUB_API_TOKEN | string | `insert-your-huggingface-token-here` | Hugging Face API token |
| global.modelUseHostPath | string | `"/mnt/opea-models"` | Cached models directory, vLLM will not download if the model is cached here. The host path "modelUseHostPath" will be mounted to container as /data directory. Setting this to null/empty will force it to download model. |
| image.repository | string | `"vllm"` | |
| image.tag | string | `"openvino"` | |
| horizontalPodAutoscaler.enabled | bool | false | Enable HPA autoscaling for the service deployment based on metrics it provides. See HPA section in ../../README.md before enabling! |

0 comments on commit c06f613

Please sign in to comment.