You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on May 28, 2024. It is now read-only.
Thanks for the awesome implementation. However, I am running into several problems, and am not able to run the model successfully. Here is my full reproduction procedure and the corresponding issues I ran into.
Reproduction procedure
I started with the cloning of the repo. And then the first issue I ran was in the docker command. The original command was:
docker run -it --gpus all --shm-size 1g -p 8000:8000 -e HF_HOME=~/data -v $cache_dir:~/data anyscale/ray-llm:latest bash
Here is the error I got:
docker: Error response from daemon: invalid volume specification: '/home/paperspace/.cache:~/data': invalid mount config fortype"bind": invalid mount path: '~/data' mount path must be absolute.
See 'docker run --help'.
This is how I gave a quick fix to this:
docker run -it --gpus all --shm-size 1g -p 8000:8000 -e HF_HOME=~/data -v $cache_dir/data anyscale/ray-llm:latest bash
This ran, and I came inside the bash. After this, I did not find the server_configs folder. These were the files:
So, I cloned ray-llm inside the container, and then I tried to run the server with this command:
serve run serve_configs/amazon--LightGPT.yaml
this produces the following error:
(ServeController pid=589) INFO 2024-01-06 09:35:39,656 controller 589 deployment_state.py:1679 - Adding 2 replicas to deployment Router in application 'ray-llm'.
(autoscaler +12s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +12s) Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 9.0, 'accelerator_type_a10': 0.02, 'GPU': 1.0}). Add suitable node types to this cluster to resolve this issue.
(ServeController pid=589) WARNING 2024-01-06 09:36:09,717 controller 589 deployment_state.py:1987 - Deployment 'VLLMDeployment:amazon--LightGPT'in application 'ray-llm' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: [{"CPU": 1.0, "accelerator_type_a10": 0.01}, {"GPU": 1.0, "CPU": 8.0, "accelerator_type_a10": 0.01}], total resources available: {}. Use `ray status`for more details.
(ServeReplica:ray-llm:Router pid=708) There was a problem when trying to write in your cache folder (/home/paperspace/data/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(ServeReplica:ray-llm:Router pid=708) [WARNING 2024-01-06 09:35:42,961] api.py: 382 DeprecationWarning: `route_prefix`in`@serve.deployment` has been deprecated. To specify a route prefix for an application, pass it into `serve.run` instead. [repeated 2x across cluster]
(autoscaler +47s) Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 9.0, 'accelerator_type_a10': 0.02, 'GPU': 1.0}). Add suitable node types to this cluster to resolve this issue.
(ServeController pid=589) WARNING 2024-01-06 09:36:39,753 controller 589 deployment_state.py:1987 - Deployment 'VLLMDeployment:amazon--LightGPT'in application 'ray-llm' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: [{"CPU": 1.0, "accelerator_type_a10": 0.01}, {"GPU": 1.0, "CPU": 8.0, "accelerator_type_a10": 0.01}], total resources available: {}. Use `ray status`for more details.
Doing a quick Google search got me into issue #101 and I got a quick fix there. I edited some part based on my device config and then pasted this command:
ray start --head --dashboard-host=0.0.0.0 --num-cpus 12 --num-gpus 1 --resources '{"accelerator_type_a10":1}'
After this, I ran the same command:
(ServeReplica:ray-llm:VLLMDeployment:amazon--LightGPT pid=1546) [INFO 2024-01-06 09:38:52,912] vllm_models.py: 218 Using existing placement group <ray.util.placement_group.PlacementGroup object at 0x7f04c6f0a850> PlacementGroupID(9e905cb0eba5cd4f2a25fea839a201000000). {'placement_group_id': '9e905cb0eba5cd4f2a25fea839a201000000', 'name': 'SERVE_REPLICA::ray-llm#VLLMDeployment:amazon--LightGPT#TSxkyG', 'bundles': {0: {'accelerator_type_a10': 0.01, 'CPU': 1.0}, 1: {'accelerator_type_a10': 0.01, 'GPU': 1.0, 'CPU': 8.0}}, 'bundles_to_node_id': {0: '5a6951a3f5dafbc1bb6ff8115ee842faf1177d5ca9a623a1708173f1', 1: '5a6951a3f5dafbc1bb6ff8115ee842faf1177d5ca9a623a1708173f1'}, 'strategy': 'STRICT_PACK', 'state': 'CREATED', 'stats': {'end_to_end_creation_latency_ms': 2.788, 'scheduling_latency_ms': 2.63, 'scheduling_attempt': 1, 'highest_retry_delay_ms': 0.0, 'scheduling_state': 'FINISHED'}}
(ServeReplica:ray-llm:VLLMDeployment:amazon--LightGPT pid=1546) [INFO 2024-01-06 09:38:52,913] vllm_models.py: 221 Using existing placement group <ray.util.placement_group.PlacementGroup object at 0x7f04c6f0a850>
(ServeReplica:ray-llm:VLLMDeployment:amazon--LightGPT pid=1546) [INFO 2024-01-06 09:38:52,913] vllm_node_initializer.py: 38 Starting initialize_node tasks on the workers and local node...
(pid=1133) There was a problem when trying to write in your cache folder (/home/paperspace/data/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory. [repeated 4x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(ServeReplica:ray-llm:Router pid=1548) [WARNING 2024-01-06 09:38:53,032] api.py: 382 DeprecationWarning: `route_prefix`in`@serve.deployment` has been deprecated. To specify a route prefix for an application, pass it into `serve.run` instead. [repeated 3x across cluster]
(ServeReplica:ray-llm:VLLMDeployment:amazon--LightGPT pid=1546) [INFO 2024-01-06 09:38:55,653] utils.py: 299 Did not receive s3_mirror_config or gcs_error_config. Not downloading model from AWS S3 or Google Cloud Storage.
(ServeController pid=1431) ERROR 2024-01-06 09:38:55,751 controller 1431 deployment_state.py:617 - Exception in replica 'ray-llm#VLLMDeployment:amazon--LightGPT#TSxkyG', the replica will be stopped.
(ServeController pid=1431) Traceback (most recent call last):
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/deployment_state.py", line 615, in check_ready
(ServeController pid=1431) _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
(ServeController pid=1431) return fn(*args, **kwargs)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeController pid=1431) return func(*args, **kwargs)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 2547, in get
(ServeController pid=1431) raise value.as_instanceof_cause()
(ServeController pid=1431) ray.exceptions.RayTaskError(RuntimeError): ray::ServeReplica:ray-llm:VLLMDeployment:amazon--LightGPT.initialize_and_get_metadata() (pid=1546, ip=172.17.0.2, actor_id=bf8c710a99c964b227d2ee6f01000000, repr=<ray.serve._private.replica.ServeReplica:ray-llm:VLLMDeployment:amazon--LightGPT object at 0x7f059a46a250>)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 439, in result
(ServeController pid=1431) returnself.__get_result()
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
(ServeController pid=1431) raise self._exception
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 442, in initialize_and_get_metadata
(ServeController pid=1431) raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=1431) RuntimeError: Traceback (most recent call last):
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 430, in initialize_and_get_metadata
(ServeController pid=1431) await self._initialize_replica()
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 190, in initialize_replica
(ServeController pid=1431) await sync_to_async(_callable.__init__)(*init_args, **init_kwargs)
(ServeController pid=1431) File "/home/ray/ray-llm/rayllm/backend/server/vllm/vllm_deployment.py", line 37, in __init__
(ServeController pid=1431) await self.engine.start()
(ServeController pid=1431) File "/home/ray/ray-llm/rayllm/backend/llm/vllm/vllm_engine.py", line 78, in start
(ServeController pid=1431) pg, runtime_env = await self.node_initializer.initialize_node(self.llm_app)
(ServeController pid=1431) File "/home/ray/ray-llm/rayllm/backend/llm/vllm/vllm_node_initializer.py", line 52, in initialize_node
(ServeController pid=1431) await self._initialize_local_node(engine_config)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run
(ServeController pid=1431) result = self.fn(*self.args, **self.kwargs)
(ServeController pid=1431) File "/home/ray/ray-llm/rayllm/backend/llm/vllm/vllm_node_initializer.py", line 72, in _initialize_local_node
(ServeController pid=1431) _ = AutoTokenizer.from_pretrained(engine_config.actual_hf_model_id)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 701, in from_pretrained
(ServeController pid=1431) tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 534, in get_tokenizer_config
(ServeController pid=1431) resolved_config_file = cached_file(
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/utils/hub.py", line 429, in cached_file
(ServeController pid=1431) resolved_file = hf_hub_download(
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
(ServeController pid=1431) return fn(*args, **kwargs)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1195, in hf_hub_download
(ServeController pid=1431) os.makedirs(storage_folder, exist_ok=True)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/os.py", line 215, in makedirs
(ServeController pid=1431) makedirs(head, exist_ok=exist_ok)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/os.py", line 215, in makedirs
(ServeController pid=1431) makedirs(head, exist_ok=exist_ok)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/os.py", line 225, in makedirs
(ServeController pid=1431) mkdir(name, mode)
(ServeController pid=1431) PermissionError: [Errno 13] Permission denied: '/home/paperspace/data'
(ServeController pid=1431) INFO 2024-01-06 09:38:55,861 controller 1431 deployment_state.py:2027 - Replica ray-llm#VLLMDeployment:amazon--LightGPT#TSxkyG is stopped.
(ServeController pid=1431) INFO 2024-01-06 09:38:55,861 controller 1431 deployment_state.py:1679 - Adding 1 replica to deployment VLLMDeployment:amazon--LightGPT in application 'ray-llm'.
(ServeController pid=1431) ERROR 2024-01-06 09:38:59,054 controller 1431 deployment_state.py:617 - Exception in replica 'ray-llm#VLLMDeployment:amazon--LightGPT#Oywlbj', the replica will be stopped.
(ServeController pid=1431) Traceback (most recent call last):
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/deployment_state.py", line 615, in check_ready
(ServeController pid=1431) _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
(ServeController pid=1431) return fn(*args, **kwargs)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeController pid=1431) return func(*args, **kwargs)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 2547, in get
(ServeController pid=1431) raise value.as_instanceof_cause()
(ServeController pid=1431) ray.exceptions.RayTaskError(RuntimeError): ray::ServeReplica:ray-llm:VLLMDeployment:amazon--LightGPT.initialize_and_get_metadata() (pid=1703, ip=172.17.0.2, actor_id=389a40046d7c102bd4b6d1c101000000, repr=<ray.serve._private.replica.ServeReplica:ray-llm:VLLMDeployment:amazon--LightGPT object at 0x7f290c1de250>)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 439, in result
(ServeController pid=1431) returnself.__get_result()
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
(ServeController pid=1431) raise self._exception
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 442, in initialize_and_get_metadata
(ServeController pid=1431) raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=1431) RuntimeError: Traceback (most recent call last):
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 430, in initialize_and_get_metadata
(ServeController pid=1431) await self._initialize_replica()
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 190, in initialize_replica
(ServeController pid=1431) await sync_to_async(_callable.__init__)(*init_args, **init_kwargs)
(ServeController pid=1431) File "/home/ray/ray-llm/rayllm/backend/server/vllm/vllm_deployment.py", line 37, in __init__
(ServeController pid=1431) await self.engine.start()
(ServeController pid=1431) File "/home/ray/ray-llm/rayllm/backend/llm/vllm/vllm_engine.py", line 78, in start
(ServeController pid=1431) pg, runtime_env = await self.node_initializer.initialize_node(self.llm_app)
(ServeController pid=1431) File "/home/ray/ray-llm/rayllm/backend/llm/vllm/vllm_node_initializer.py", line 52, in initialize_node
(ServeController pid=1431) await self._initialize_local_node(engine_config)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run
(ServeController pid=1431) result = self.fn(*self.args, **self.kwargs)
(ServeController pid=1431) File "/home/ray/ray-llm/rayllm/backend/llm/vllm/vllm_node_initializer.py", line 72, in _initialize_local_node
(ServeController pid=1431) _ = AutoTokenizer.from_pretrained(engine_config.actual_hf_model_id)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 701, in from_pretrained
(ServeController pid=1431) tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 534, in get_tokenizer_config
(ServeController pid=1431) resolved_config_file = cached_file(
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/utils/hub.py", line 429, in cached_file
(ServeController pid=1431) resolved_file = hf_hub_download(
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
(ServeController pid=1431) return fn(*args, **kwargs)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1195, in hf_hub_download
(ServeController pid=1431) os.makedirs(storage_folder, exist_ok=True)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/os.py", line 215, in makedirs
(ServeController pid=1431) makedirs(head, exist_ok=exist_ok)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/os.py", line 215, in makedirs
(ServeController pid=1431) makedirs(head, exist_ok=exist_ok)
(ServeController pid=1431) File "/home/ray/anaconda3/lib/python3.9/os.py", line 225, in makedirs
(ServeController pid=1431) mkdir(name, mode)
(ServeController pid=1431) PermissionError: [Errno 13] Permission denied: '/home/paperspace/data'
Specifically saw this permission denied problem at: '/home/paperspace/data'. I tried to change the folder from this given folder to a different folder but got the same error.
Additionally, I got these error logs too, after the above permission-denied error was printed. Which is very similar to the issue #55
The text was updated successfully, but these errors were encountered:
Hello,
Thanks for the awesome implementation. However, I am running into several problems, and am not able to run the model successfully. Here is my full reproduction procedure and the corresponding issues I ran into.
Reproduction procedure
I started with the cloning of the repo. And then the first issue I ran was in the docker command. The original command was:
Here is the error I got:
This is how I gave a quick fix to this:
This ran, and I came inside the bash. After this, I did not find the
server_configs
folder. These were the files:So, I cloned ray-llm inside the container, and then I tried to run the server with this command:
this produces the following error:
Doing a quick Google search got me into issue #101 and I got a quick fix there. I edited some part based on my device config and then pasted this command:
ray start --head --dashboard-host=0.0.0.0 --num-cpus 12 --num-gpus 1 --resources '{"accelerator_type_a10":1}'
After this, I ran the same command:
Specifically saw this permission denied problem at:
'/home/paperspace/data'
. I tried to change the folder from this given folder to a different folder but got the same error.Additionally, I got these error logs too, after the above permission-denied error was printed. Which is very similar to the issue #55
The text was updated successfully, but these errors were encountered: