📝 Added docs for instruction to setup chatqna with vllm

opea-project · Sep 6, 2024 · c06f613 · c06f613
1 parent cfd902f
commit c06f613
Show file tree

Hide file tree

Showing 3 changed files with 185 additions and 9 deletions.
diff --git a/helm-charts/chatqna/README.md b/helm-charts/chatqna/README.md
@@ -9,30 +9,50 @@ Helm chart for deploying ChatQnA service. ChatQnA depends on the following servi
 - [redis-vector-db](../common/redis-vector-db)
 - [reranking-usvc](../common/reranking-usvc)
 - [teirerank](../common/teirerank)
-- [llm-uservice](../common/llm-uservice)
-- [tgi](../common/tgi)
+
+Apart from above mentioned services, there are following conditional dependencies (out of which, one are required):
+
+1. If we want to use TGI as our inference service, following 2 services will be required:
+
+    - [llm-uservice](../common/llm-uservice)
+    - [tgi](../common/tgi)
+
+2. If we want to use OpenVINO vLLM inference service, following 2 services would be required:
+    - [llm-vllm-uservice](../common/llm-vllm-uservice)
+    - [vllm-openvino](../common/vllm-openvino)
+
 
 ## Installing the Chart
 
 To install the chart, run the following:
 
-```console
+```bash
 cd GenAIInfra/helm-charts/
 ./update_dependency.sh
 helm dependency update chatqna
 export HFTOKEN="insert-your-huggingface-token-here"
 export MODELDIR="/mnt/opea-models"
 export MODELNAME="Intel/neural-chat-7b-v3-3"
 helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME}
+
 # To use Gaudi device
-#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/gaudi-values.yaml
+helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/gaudi-values.yaml
+
 # To use Nvidia GPU
-#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/nv-values.yaml
+helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/nv-values.yaml
+
+
+# To use OpenVINO vLLM inference engine on Xeon device
+
+helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set global.LLM_MODEL_ID=${MODELNAME} --set tags.tgi=false --set vllm-openvino.enabled=true
 ```
 
+
 ### IMPORTANT NOTE
 
-1. Make sure your `MODELDIR` exists on the node where your workload is schedueled so you can cache the downloaded model for next time use. Otherwise, set `global.modelUseHostPath` to 'null' if you don't want to cache the model.
+1. Make sure your `MODELDIR` exists on the node where your workload is scheduled so you can cache the downloaded model for next time use. Otherwise, set `global.modelUseHostPath` to 'null' if you don't want to cache the model.
+
+2. Please set `http_proxy`, `https_proxy` and `no_proxy` values while installing chart, if you are behind a proxy.
 
 ## Verify
 
@@ -46,8 +66,9 @@ Run the command `kubectl port-forward svc/chatqna 8888:8888` to expose the servi
 
 Open another terminal and run the following command to verify the service if working:
 
-```console
+```bash
 curl http://localhost:8888/v1/chatqna \
+    -X POST \
     -H "Content-Type: application/json" \
     -d '{"messages": "What is the revenue of Nike in 2023?"}'
 ```
@@ -71,7 +92,6 @@ docker save -o ui.tar opea/chatqna-conversation-ui:latest
 sudo ctr -n k8s.io image import ui.tar
 
 # install UI using helm chart. Replace image tag if required
-cd
 cd GenAIInfra/helm-charts/
 helm install ui common/chatqna-ui --set BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/chatqna",DATAPREP_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/dataprep",image.tag="latest"
 
@@ -88,4 +108,5 @@ Access `http://localhost:5174` to play with the ChatQnA workload through UI.
 | image.repository                       | string | `"opea/chatqna"`              |                                                                                                                                         |
 | service.port                           | string | `"8888"`                      |                                                                                                                                         |
 | tgi.LLM_MODEL_ID                       | string | `"Intel/neural-chat-7b-v3-3"` | Models id from https://huggingface.co/, or predownloaded model directory                                                                |
-| global.horizontalPodAutoscaler.enabled | bop;   | false                         | HPA autoscaling for the TGI and TEI service deployments based on metrics they provide. See HPA section in ../README.md before enabling! |
+| vllm-openvino.LLM_MODEL_ID                       | string | `"Intel/neural-chat-7b-v3-3"` | Models id from https://huggingface.co/, or predownloaded model directory                                                                |
+| global.horizontalPodAutoscaler.enabled | bool  | false                         | HPA autoscaling for the TGI and TEI service deployments based on metrics they provide. See HPA section in ../README.md before enabling! |
diff --git a/helm-charts/common/llm-vllm-uservice/README.md b/helm-charts/common/llm-vllm-uservice/README.md
@@ -0,0 +1,87 @@
+# llm-vllm Microservice
+
+Helm chart for deploying a microservice which facilitates connections and handles responses from OpenVINO vLLM microservice.
+
+`llm-vllm-uservice` depends on OpenVINO vLLM. You should properly set `vLLM_ENDPOINT` as the HOST URI of vLLM microservice. If not set, it will consider the default value : `http://<helm-release-name>-vllm-openvino:80`
+
+As this service depends on vLLM microservice, we can proceed in either of 2 ways: 
+
+- Install both microservices separately one after another.
+- Install the vLLM microservice as dependency for the our main `llm-vllm-uservice` microservice.
+
+## (Option 1): Installing the chart separately:
+
+First, you need to install the `vllm-openvino` chart, please refer to the [vllm-openvino](../vllm-openvino) chart for more information.
+
+After you've deployed the `vllm-openvino` chart successfully, please run `kubectl get svc` to get the vLLM service name with port. We need to provide this to `llm-vllm-uservice` as a value for vLLM_ENDPOINT for letting it discover and connect to the vLLM microservice.
+
+
+> **_NOTE:_** While installing charts separately, if you don't provide any vLLM endpoint explicitly, it will take the default endpoint as `http://<helm-release-name>-vllm-openvino:80`. So, if you are not providing the vLLM endpoint explicitly, please make sure to provide same helm release name to both the charts while installing.
+
+To install the chart, run the following:
+
+```bash
+cd GenAIInfra/helm-charts/common/llm-vllm-uservice
+export HFTOKEN="insert-your-huggingface-token-here"
+export vLLM_ENDPOINT="http://vllm-openvino"
+export MODELNAME="bigscience/bloom-560m"
+
+# If proxy is required, please export the appropriate proxy values.
+export http_proxy=<your_http_proxy>
+export https_proxy=<your_https_proxy>
+
+helm dependency update
+helm install llm-vllm-uservice . --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set vLLM_ENDPOINT=${vLLM_ENDPOINT} --set global.LLM_MODEL_ID=${MODELNAME} --set global.http_proxy=${http_proxy} --set global.https_proxy=${https_proxy} --wait
+```
+
+## (Option 2): Installing the chart with automatic installation of dependency:
+
+```bash
+cd GenAIInfra/helm-charts/common/llm-vllm-uservice
+export HFTOKEN="insert-your-huggingface-token-here"
+export MODELNAME="bigscience/bloom-560m"
+
+# If proxy is required, please export the appropriate proxy values.
+export http_proxy=<your_http_proxy>
+export https_proxy=<your_https_proxy>
+
+helm dependency update
+helm install llm-vllm-uservice . --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.LLM_MODEL_ID=${MODELNAME} --set autodependency.enabled=true --set global.http_proxy=${http_proxy} --set global.https_proxy=${https_proxy} --wait
+```
+
+`--wait` flag in above installation command will make sure that all the dependencies are resolved and all services are deployed.
+
+## Verify
+
+To verify the installation, run the following command to make sure all pods are running.
+
+```bash
+kubectl get pod
+```
+
+Once you see `llm-vllm-uservice` pod and `llm-vllm-uservice-vllm-openvino` pod in ready and running state, run the following command:
+
+```bash
+kubectl port-forward svc/llm-vllm-uservice 9000:9000
+```
+
+This exposes the port 9000, on which `llm-vllm-uservice` is running inside the pod, at port 9000 on the host. 
+
+Now, we can access the service from the host machine. Open another terminal and run the following command to verify whether `llm-vllm-uservice` is working:
+
+```bash
+curl http://localhost:9000/v1/chat/completions \
+    -X POST \
+    -d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' \
+    -H 'Content-Type: application/json'
+```
+
+## Values
+
+| Key                             | Type   | Default              | Description                                                                                                                                                  |
+| ------------------------------- | ------ | -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| global.HUGGINGFACEHUB_API_TOKEN | string | `""`                 | Your own Hugging Face API token                                                                                                                              |
+| global.modelUseHostPath         | string | `"/mnt/opea-models"` | Cached models directory, vLLM will not download if the model is cached here. The host path "modelUseHostPath" will be mounted to container as /data directory |
+| image.repository                | string | `"opea/llm-vllm"`     |                                                                                                                                                              |
+| service.port                    | string | `"9000"`             |                                                                                                                                                              |
+| vLLM_ENDPOINT                | string | `""`                 | OpenVINO vLLM service endpoint                                                                                                                                                 |
diff --git a/helm-charts/common/vllm-openvino/README.md b/helm-charts/common/vllm-openvino/README.md
@@ -0,0 +1,68 @@
+# OpenVINO vLLM 
+
+Helm chart for deploying OpenVINO optimized vLLM Inference service.
+
+## Installing the Chart
+
+To install the chart, run the following:
+
+```bash
+cd GenAIInfra/helm-charts/common
+export MODELDIR=/mnt/opea-models
+export MODELNAME="bigscience/bloom-560m"
+export HFTOKEN="insert-your-huggingface-token-here"
+
+# If proxy is required, please export the appropriate proxy values.
+export http_proxy=<your_http_proxy>
+export https_proxy=<your_https_proxy>
+
+helm install vllm-openvino vllm-openvino --set global.modelUseHostPath=${MODELDIR} --set global.LLM_MODEL_ID=${MODELNAME} --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.http_proxy=${http_proxy} --set global.https_proxy=${https_proxy} --wait
+```
+
+`--wait` flag in the above helm installation command lets the shell wait till `vllm-openvino` is completely up and ready.
+
+>**_NOTE:_** Make sure your `MODELDIR` exists on the node where your workload is scheduled so you can cache the downloaded model for next time use. Otherwise, set `global.modelUseHostPath` to 'null' if you don't want to cache the model.
+
+If you already cached the model locally, you can pass it to container like this example:
+
+MODELDIR=/mnt/opea-models
+
+MODELNAME="/data/models--bigscience--bloom-560m"
+
+>**_NOTE:_** By default, the vLLM service will be downloading **Intel/neural-chat-7b-v3-3** model from Huggingface, which is around 4GB in size. To use a smaller model, please set the LLM_MODEL_ID value to your desired model, as shown above, while installing the chart.
+
+## Verify
+
+To verify the installation, run the following command to make sure all pods are running. Please note that it may take a while to come the vLLM pod in ready state.
+
+```bash
+kubectl get pod
+```
+
+Once you see `vllm-openvino` pod in ready and running state, run the following command:
+
+```bash
+ kubectl port-forward svc/vllm-openvino 2080:80
+```
+
+This exposes the port 80, on which `vllm-openvino` service is running inside the pod, at port 2080 on the host. 
+
+Now, we can access the service from the host machine. Open another terminal and run the following command to verify whether `vllm-openvino` service is working:
+
+```bash
+curl http://localhost:2080/v1/completions -sS --fail-with-body \
+    -X POST \
+    -d '{"prompt":"What is Deep Learning?", "model": "bigscience/bloom-560m", "max_tokens":17, "temperature": 0.5}' \
+    -H 'Content-Type: application/json'
+```
+
+## Values
+
+| Key                             | Type   | Default                                           | Description                                                                                                                                                                                                           |
+| ------------------------------- | ------ | ------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| LLM_MODEL_ID                    | string | `"bigscience/bloom-560m"`                         | Models id from https://huggingface.co/, or predownloaded model directory                                                                                                                                              |
+| global.HUGGINGFACEHUB_API_TOKEN | string | `insert-your-huggingface-token-here`              | Hugging Face API token                                                                                                                                                                                                |
+| global.modelUseHostPath         | string | `"/mnt/opea-models"`                              | Cached models directory, vLLM will not download if the model is cached here. The host path "modelUseHostPath" will be mounted to container as /data directory. Setting this to null/empty will force it to download model. |
+| image.repository                | string | `"vllm"` |                                                                                                                                                                                                                       |
+| image.tag                       | string | `"openvino"`                                           |                                                                                                                                                                                                                       |
+| horizontalPodAutoscaler.enabled | bool   | false                                             | Enable HPA autoscaling for the service deployment based on metrics it provides. See HPA section in ../../README.md before enabling!                                                                                   |