You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
When I execute ray-cluster in k8s, my worker pod keeps initializing, the log shows: Timed out while waiting for GCS to become available
log gcs_server.err in /tmp/ray/session_latest/logs/ is empty.
Reproduction script
apiVersion: ray.io/v1
kind: RayCluster
metadata:
labels:
controller-tools.k8s.io: "1.0"
# A unique identifier for the head node and workers of this cluster.
name: raycluster-complete
spec:
rayVersion: '2.38.0'
# Ray head pod template
headGroupSpec:
serviceType: ClusterIP
rayStartParams:
dashboard-host: '0.0.0.0'
# pod template
template:
metadata:
labels: {}
spec:
containers:
- name: ray-head
image: harbor.unijn.cn/sunjian/ray
resources:
limits:
cpu: 10
memory: 20Gi
requests:
cpu: 10
memory: 20Gi
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
volumeMounts:
- mountPath: /fs/nlp/sunjian
name: ray-logs
volumes:
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 4
# logical group name, for this called large-group, also can be functional
groupName: large-group
rayStartParams: {}
#pod template
template:
spec:
containers:
- name: ray-worker
image: harbor.unijn.cn/sunjian/ray
resources:
limits:
cpu: 10
nvidia.com/gpu: 4
memory: 20Gi
requests:
cpu: 10
nvidia.com/gpu: 4
memory: 20Gi
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
# use volumeMounts.Optional.
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumeMounts:
- mountPath: /fs/nlp/sunjian
name: ray-logs
# use volumes
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumes:
- name: ray-logs
emptyDir: {}
######################status#################################
Anything else
67 seconds elapsed: Waiting for GCS to be ready.
134 seconds elapsed: Waiting for GCS to be ready.
[2024-11-14 09:10:17,717 E 208 208] gcs_rpc_client.h:179: Failed to connect to GCS at address raycluster-complete-head-svc.sunjian.svc.cluster.local:6379 within 5 seconds.
[2024-11-14 09:10:51,721 W 208 208] gcs_client.cc:178: Failed to get cluster ID from GCS server: TimedOut: Timed out while waiting for GCS to become available.
Traceback (most recent call last):
File "/usr/local/bin/ray", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/ray/scripts/scripts.py", line 2614, in main
return cli()
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/scripts/scripts.py", line 2377, in healthcheck
gcs_client = ray._raylet.GcsClient(address=address)
File "python/ray/_raylet.pyx", line 2705, in ray._raylet.GcsClient.__cinit__
File "python/ray/includes/gcs_client.pxi", line 64, in ray._raylet.NewGcsClient.standalone
File "python/ray/includes/common.pxi", line 114, in ray._raylet.check_status_timeout_as_rpc_error
ray.exceptions.RpcError: Timed out while waiting for GCS to become available.
203 seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md.
gcs_server.out:
[2024-11-14 17:16:38,071 I 75 75] (gcs_server) gcs_server_main.cc:52: Ray cluster metadata ray_version=2.38.0 ray_commit=385ee466260ef3cd218d5e372aef5d39338b7b94
[2024-11-14 17:16:38,074 I 75 75] (gcs_server) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:491: Ray Event initialized for GCS
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:491: Ray Event initialized for EXPORT_NODE
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:491: Ray Event initialized for EXPORT_ACTOR
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:491: Ray Event initialized for EXPORT_DRIVER_JOB
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:322: Set ray event level to warning
[2024-11-14 17:16:38,080 I 75 75] (gcs_server) gcs_server.cc:76: GCS storage type is StorageType::IN_MEMORY
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:42: Loading job table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:54: Loading node table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:80: Loading actor table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:93: Loading actor task spec table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:66: Loading placement group table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:46: Finished loading job table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:58: Finished loading node table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:84: Finished loading actor table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:97: Finished loading actor task spec table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:71: Finished loading placement group table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_server.cc:164: No existing server cluster ID found. Generating new ID: 4f7b90abd98f7cdc6d1076bfa86d2f5f978300a83f585ace064062bf
[2024-11-14 17:16:38,083 I 75 75] (gcs_server) gcs_server.cc:645: Autoscaler V2 enabled: 0
[2024-11-14 17:16:38,087 I 75 75] (gcs_server) grpc_server.cc:134: GcsServer server started, listening on port 6379.
[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:247: Gcs Debug state:
GcsNodeManager:
- RegisterNode request count: 0
- DrainNode request count: 0
- GetAllNodeInfo request count: 0
GcsActorManager:
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 0
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_restart_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0
GcsResourceManager:
- GetAllAvailableResources request count: 0
- GetAllTotalResources request count: 0
- GetAllResourceUsage request count: 0
GcsPlacementGroupManager:
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0
Publisher:
[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:
GcsTaskManager:
-Total num task events reported: 0
-Total num status task events dropped: 0
-Total num profile events dropped: 0
-Current num of task events stored: 0
-Total num of actor creation tasks: 0
-Total num of actor tasks: 0
-Total num of normal tasks: 0
-Total num of driver tasks: 0
[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:843: main_service_ Event stats:
Global stats: 23 total (6 active)
Queueing time: mean = 88.973 ms, max = 228.266 ms, min = 5.591 us, total = 2.046 s
Execution time: mean = 9.936 ms, total = 228.523 ms
Event stats:
GcsInMemoryStore.Put - 9 total (0 active), Execution time: mean = 25.370 ms, total = 228.326 ms, Queueing time: mean = 176.614 ms, max = 227.521 ms, min = 5.591 us, total = 1.590 s
GcsInMemoryStore.GetAll - 5 total (0 active), Execution time: mean = 25.400 us, total = 126.999 us, Queueing time: mean = 101.508 us, max = 105.025 us, min = 95.198 us, total = 507.540 us
PeriodicalRunner.RunFnPeriodically - 4 total (2 active, 1 running), Execution time: mean = 1.814 us, total = 7.256 us, Queueing time: mean = 114.083 ms, max = 228.266 ms, min = 228.064 ms, total = 456.330 ms
NodeInfoGcsService.grpc_server.GetClusterId - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
RayletLoadPulled - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
GcsInMemoryStore.Get - 1 total (0 active), Execution time: mean = 62.770 us, total = 62.770 us, Queueing time: mean = 10.564 us, max = 10.564 us, min = 10.564 us, total = 10.564 us
NodeInfoGcsService.grpc_server.GetClusterId.HandleRequestImpl - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:845: pubsub_io_context_ Event stats:
Global stats: 2 total (1 active)
Queueing time: mean = 10.771 us, max = 21.543 us, min = 21.543 us, total = 21.543 us
Execution time: mean = 85.118 us, total = 170.237 us
Event stats:
Publisher.CheckDeadSubscribers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
PeriodicalRunner.RunFnPeriodically - 1 total (0 active), Execution time: mean = 170.237 us, total = 170.237 us, Queueing time: mean = 21.543 us, max = 21.543 us, min = 21.543 us, total = 21.543 us
[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:847: task_io_context_ Event stats:
Global stats: 2 total (1 active)
Queueing time: mean = 8.081 us, max = 16.161 us, min = 16.161 us, total = 16.161 us
Execution time: mean = 46.675 us, total = 93.350 us
Event stats:
PeriodicalRunner.RunFnPeriodically - 1 total (0 active), Execution time: mean = 93.350 us, total = 93.350 us, Queueing time: mean = 16.161 us, max = 16.161 us, min = 16.161 us, total = 16.161 us
GcsTaskManager.GcJobSummary - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:849: ray_syncer_io_context_ Event stats:
Global stats: 2 total (0 active)
Queueing time: mean = 27.348 us, max = 32.609 us, min = 22.087 us, total = 54.696 us
Execution time: mean = 4.326 us, total = 8.652 us
Event stats:
RaySyncerRegister - 2 total (0 active), Execution time: mean = 4.326 us, total = 8.652 us, Queueing time: mean = 27.348 us, max = 32.609 us, min = 22.087 us, total = 54.696 us
[2024-11-14 17:16:39,538 I 75 75] (gcs_server) gcs_node_manager.cc:85: Registering node info, address = 10.233.70.150, node name = 10.233.70.150 node_id=c429e1f1a07b5244e103af32116ac9528307ec62f91a13a56b3b9625
[2024-11-14 17:16:39,539 I 75 75] (gcs_server) gcs_node_manager.cc:91: Finished registering node info, address = 10.233.70.150, node name = 10.233.70.150 node_id=c429e1f1a07b5244e103af32116ac9528307ec62f91a13a56b3b9625
[2024-11-14 17:16:39,539 I 75 75] (gcs_server) gcs_placement_group_manager.cc:819: A new node: c429e1f1a07b5244e103af32116ac9528307ec62f91a13a56b3b9625 registered, will try to reschedule all the infeasible placement groups.
[2024-11-14 17:16:39,547 I 75 151] (gcs_server) ray_syncer.cc:377: Get connection node_id=c429e1f1a07b5244e103af32116ac9528307ec62f91a13a56b3b9625
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
When I execute ray-cluster in k8s, my worker pod keeps initializing, the log shows:
Timed out while waiting for GCS to become available
log gcs_server.err in /tmp/ray/session_latest/logs/ is empty.
Reproduction script
Anything else
gcs_server.out:
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: