Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] ray.exceptions.RpcError: Timed out while waiting for GCS to become available. #2540

Open
1 of 2 tasks
BeerTai opened this issue Nov 14, 2024 · 0 comments
Open
1 of 2 tasks
Labels
bug Something isn't working triage

Comments

@BeerTai
Copy link

BeerTai commented Nov 14, 2024

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

When I execute ray-cluster in k8s, my worker pod keeps initializing, the log shows:
Timed out while waiting for GCS to become available
log gcs_server.err in /tmp/ray/session_latest/logs/ is empty.

Reproduction script

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
    # A unique identifier for the head node and workers of this cluster.
  name: raycluster-complete
spec:
  rayVersion: '2.38.0'
  # Ray head pod template
  headGroupSpec:
    serviceType: ClusterIP
    rayStartParams:
      dashboard-host: '0.0.0.0'
    # pod template
    template:
      metadata:
        labels: {}
      spec:
        containers:
        - name: ray-head
          image: harbor.unijn.cn/sunjian/ray
          resources:
            limits:
              cpu: 10
              memory: 20Gi
            requests:
              cpu: 10
              memory: 20Gi
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          volumeMounts:
            - mountPath: /fs/nlp/sunjian
              name: ray-logs
        volumes:
          - name: ray-logs
            emptyDir: {}
  workerGroupSpecs:
  # the pod replicas in this group typed worker
  - replicas: 1
    minReplicas: 1
    maxReplicas: 4
    # logical group name, for this called large-group, also can be functional
    groupName: large-group
    rayStartParams: {}
    #pod template
    template:
      spec:
        containers:
        - name: ray-worker
          image: harbor.unijn.cn/sunjian/ray
          resources:
            limits:
              cpu: 10
              nvidia.com/gpu: 4
              memory: 20Gi 
            requests:
              cpu: 10
              nvidia.com/gpu: 4 
              memory: 20Gi 
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          # use volumeMounts.Optional.
          # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
          volumeMounts:
            - mountPath:  /fs/nlp/sunjian
              name: ray-logs
        # use volumes
        # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
        volumes:
          - name: ray-logs
            emptyDir: {}
######################status#################################

Anything else

67 seconds elapsed: Waiting for GCS to be ready.
134 seconds elapsed: Waiting for GCS to be ready.
[2024-11-14 09:10:17,717 E 208 208] gcs_rpc_client.h:179: Failed to connect to GCS at address raycluster-complete-head-svc.sunjian.svc.cluster.local:6379 within 5 seconds.
[2024-11-14 09:10:51,721 W 208 208] gcs_client.cc:178: Failed to get cluster ID from GCS server: TimedOut: Timed out while waiting for GCS to become available.
Traceback (most recent call last):
  File "/usr/local/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/ray/scripts/scripts.py", line 2614, in main
    return cli()
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/scripts/scripts.py", line 2377, in healthcheck
    gcs_client = ray._raylet.GcsClient(address=address)
  File "python/ray/_raylet.pyx", line 2705, in ray._raylet.GcsClient.__cinit__
  File "python/ray/includes/gcs_client.pxi", line 64, in ray._raylet.NewGcsClient.standalone
  File "python/ray/includes/common.pxi", line 114, in ray._raylet.check_status_timeout_as_rpc_error
ray.exceptions.RpcError: Timed out while waiting for GCS to become available.
203 seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md.

gcs_server.out:

[2024-11-14 17:16:38,071 I 75 75] (gcs_server) gcs_server_main.cc:52: Ray cluster metadata ray_version=2.38.0 ray_commit=385ee466260ef3cd218d5e372aef5d39338b7b94
[2024-11-14 17:16:38,074 I 75 75] (gcs_server) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:491: Ray Event initialized for GCS
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:491: Ray Event initialized for EXPORT_NODE
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:491: Ray Event initialized for EXPORT_ACTOR
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:491: Ray Event initialized for EXPORT_DRIVER_JOB
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:322: Set ray event level to warning
[2024-11-14 17:16:38,080 I 75 75] (gcs_server) gcs_server.cc:76: GCS storage type is StorageType::IN_MEMORY
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:42: Loading job table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:54: Loading node table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:80: Loading actor table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:93: Loading actor task spec table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:66: Loading placement group table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:46: Finished loading job table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:58: Finished loading node table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:84: Finished loading actor table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:97: Finished loading actor task spec table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:71: Finished loading placement group table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_server.cc:164: No existing server cluster ID found. Generating new ID: 4f7b90abd98f7cdc6d1076bfa86d2f5f978300a83f585ace064062bf
[2024-11-14 17:16:38,083 I 75 75] (gcs_server) gcs_server.cc:645: Autoscaler V2 enabled: 0
[2024-11-14 17:16:38,087 I 75 75] (gcs_server) grpc_server.cc:134: GcsServer server started, listening on port 6379.
[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:247: Gcs Debug state:

GcsNodeManager:
- RegisterNode request count: 0
- DrainNode request count: 0
- GetAllNodeInfo request count: 0

GcsActorManager:
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 0
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_restart_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0

GcsResourceManager:
- GetAllAvailableResources request count: 0
- GetAllTotalResources request count: 0
- GetAllResourceUsage request count: 0

GcsPlacementGroupManager:
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0

Publisher:

[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:

GcsTaskManager:
-Total num task events reported: 0
-Total num status task events dropped: 0
-Total num profile events dropped: 0
-Current num of task events stored: 0
-Total num of actor creation tasks: 0
-Total num of actor tasks: 0
-Total num of normal tasks: 0
-Total num of driver tasks: 0


[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:843: main_service_ Event stats:


Global stats: 23 total (6 active)
Queueing time: mean = 88.973 ms, max = 228.266 ms, min = 5.591 us, total = 2.046 s
Execution time:  mean = 9.936 ms, total = 228.523 ms
Event stats:
        GcsInMemoryStore.Put - 9 total (0 active), Execution time: mean = 25.370 ms, total = 228.326 ms, Queueing time: mean = 176.614 ms, max = 227.521 ms, min = 5.591 us, total = 1.590 s
        GcsInMemoryStore.GetAll - 5 total (0 active), Execution time: mean = 25.400 us, total = 126.999 us, Queueing time: mean = 101.508 us, max = 105.025 us, min = 95.198 us, total = 507.540 us
        PeriodicalRunner.RunFnPeriodically - 4 total (2 active, 1 running), Execution time: mean = 1.814 us, total = 7.256 us, Queueing time: mean = 114.083 ms, max = 228.266 ms, min = 228.064 ms, total = 456.330 ms
        NodeInfoGcsService.grpc_server.GetClusterId - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
        RayletLoadPulled - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
        GcsInMemoryStore.Get - 1 total (0 active), Execution time: mean = 62.770 us, total = 62.770 us, Queueing time: mean = 10.564 us, max = 10.564 us, min = 10.564 us, total = 10.564 us
        NodeInfoGcsService.grpc_server.GetClusterId.HandleRequestImpl - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
        ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s


[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:845: pubsub_io_context_ Event stats:


Global stats: 2 total (1 active)
Queueing time: mean = 10.771 us, max = 21.543 us, min = 21.543 us, total = 21.543 us
Execution time:  mean = 85.118 us, total = 170.237 us
Event stats:
        Publisher.CheckDeadSubscribers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
        PeriodicalRunner.RunFnPeriodically - 1 total (0 active), Execution time: mean = 170.237 us, total = 170.237 us, Queueing time: mean = 21.543 us, max = 21.543 us, min = 21.543 us, total = 21.543 us


[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:847: task_io_context_ Event stats:


Global stats: 2 total (1 active)
Queueing time: mean = 8.081 us, max = 16.161 us, min = 16.161 us, total = 16.161 us
Execution time:  mean = 46.675 us, total = 93.350 us
Event stats:
        PeriodicalRunner.RunFnPeriodically - 1 total (0 active), Execution time: mean = 93.350 us, total = 93.350 us, Queueing time: mean = 16.161 us, max = 16.161 us, min = 16.161 us, total = 16.161 us
        GcsTaskManager.GcJobSummary - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s


[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:849: ray_syncer_io_context_ Event stats:


Global stats: 2 total (0 active)
Queueing time: mean = 27.348 us, max = 32.609 us, min = 22.087 us, total = 54.696 us
Execution time:  mean = 4.326 us, total = 8.652 us
Event stats:
        RaySyncerRegister - 2 total (0 active), Execution time: mean = 4.326 us, total = 8.652 us, Queueing time: mean = 27.348 us, max = 32.609 us, min = 22.087 us, total = 54.696 us


[2024-11-14 17:16:39,538 I 75 75] (gcs_server) gcs_node_manager.cc:85: Registering node info, address = 10.233.70.150, node name = 10.233.70.150 node_id=c429e1f1a07b5244e103af32116ac9528307ec62f91a13a56b3b9625
[2024-11-14 17:16:39,539 I 75 75] (gcs_server) gcs_node_manager.cc:91: Finished registering node info, address = 10.233.70.150, node name = 10.233.70.150 node_id=c429e1f1a07b5244e103af32116ac9528307ec62f91a13a56b3b9625
[2024-11-14 17:16:39,539 I 75 75] (gcs_server) gcs_placement_group_manager.cc:819: A new node: c429e1f1a07b5244e103af32116ac9528307ec62f91a13a56b3b9625 registered, will try to reschedule all the infeasible placement groups.
[2024-11-14 17:16:39,547 I 75 151] (gcs_server) ray_syncer.cc:377: Get connection node_id=c429e1f1a07b5244e103af32116ac9528307ec62f91a13a56b3b9625

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@BeerTai BeerTai added bug Something isn't working triage labels Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

1 participant