Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cronjob Name to long #425

Open
didlawowo opened this issue Oct 20, 2024 · 4 comments
Open

Cronjob Name to long #425

didlawowo opened this issue Oct 20, 2024 · 4 comments

Comments

@didlawowo
Copy link

Hello,
should be nice to setup cronjob name or provide fullnameoveride to prevent this error in helm chart when enable persistence 👍

forbidden,CronJob.batch "in-cluster-kube-image-keeper-registry-garbage-collection" is invalid: metadata.name: Invalid value: "in-cluster-kube-image-keeper-registry-garbage-collection": must be no more than 52 characters

@Nicolasgouze
Copy link
Contributor

Hi @didlawowo , we'll shorty fix it. In the timeframe, using a shorter relaease name will unlock you.

@didlawowo
Copy link
Author

yes i do that .
perhaps i have disabled kuik because often i have a problem with image not found localhost:portkuik. then often it cause crash for deployment.
i'll be happy to show that .

@Nicolasgouze
Copy link
Contributor

Hi @didlawowo , would be great to get the logs associated to "image not found localhost:portkuik", yes !

@didlawowo
Copy link
Author

didlawowo commented Nov 16, 2024

Hi @didlawowo , would be great to get the logs associated to "image not found localhost:portkuik", yes !

look like

Name:             raycluster-kuberay-workergroup-worker-69rsq
Namespace:        ray
Priority:         0
Service Account:  default
Node:             rtx/192.168.1.29
Start Time:       Sat, 16 Nov 2024 09:11:19 +0100
Labels:           app.kubernetes.io/created-by=kuberay-operator
                  app.kubernetes.io/instance=raycluster
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=kuberay
                  helm.sh/chart=ray-cluster-1.2.2
                  kuik.enix.io/managed=true
                  ray.io/cluster=raycluster-kuberay
                  ray.io/group=workergroup
                  ray.io/identifier=raycluster-kuberay-worker
                  ray.io/is-ray-node=yes
                  ray.io/node-type=worker
Annotations:      ad.datadoghq.com/ray.checks:
                    {
                      "ray": {
                        "instances": [
                          {
                            "openmetrics_endpoint": "http://%%host%%:8080"
                          }
                        ]
                      }
                    }
                  kuik.enix.io/rewrite-images: true
                  original-image-ray-worker: rayproject/ray:2.38.0-py311-gpu
                  original-init-image-wait-gcs-ready: rayproject/ray:2.38.0-py311-gpu
                  ray.io/ft-enabled: false
Status:           Pending
IP:               10.0.3.94
IPs:
  IP:           10.0.3.94
Controlled By:  RayCluster/raycluster-kuberay
Init Containers:
  wait-gcs-ready:
    Container ID:  
    Image:         localhost:7439/rayproject/ray:2.38.0-py311-gpu
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -lc
      --
    Args:
      
                            SECONDS=0
                            while true; do
                              if (( SECONDS <= 120 )); then
                                if ray health-check --address raycluster-kuberay-head-svc.ray.svc.cluster.local:6379 > /dev/null 2>&1; then
                                  echo "GCS is ready."
                                  break
                                fi
                                echo "$SECONDS seconds elapsed: Waiting for GCS to be ready."
                              else
                                if ray health-check --address raycluster-kuberay-head-svc.ray.svc.cluster.local:6379; then
                                  echo "GCS is ready. Any error messages above can be safely ignored."
                                  break
                                fi
                                echo "$SECONDS seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md."
                              fi
                              sleep 5
                            done
                          
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     200m
      memory:  256Mi
    Requests:
      cpu:     200m
      memory:  256Mi
    Environment:
      FQ_RAY_IP:  raycluster-kuberay-head-svc.ray.svc.cluster.local
      RAY_IP:     raycluster-kuberay-head-svc
    Mounts:
      /tmp/ray from log-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pftj7 (ro)
Containers:
  ray-worker:
    Container ID:  
    Image:         localhost:7439/rayproject/ray:2.38.0-py311-gpu
    Image ID:      
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /bin/bash
      -lc
      --
    Args:
      ulimit -n 65536; ray start  --num-cpus=1  --memory=5000000000  --num-gpus=1  --address=raycluster-kuberay-head-svc.ray.svc.cluster.local:6379  --metrics-export-port=8080  --block  --dashboard-agent-listen-port=52365 
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:             1
      memory:          5G
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          1G
      nvidia.com/gpu:  1
    Liveness:          exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success] delay=30s timeout=2s period=5s #success=1 #failure=120
    Readiness:         exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success] delay=10s timeout=2s period=5s #success=1 #failure=10
    Environment:
      FQ_RAY_IP:                            raycluster-kuberay-head-svc.ray.svc.cluster.local
      RAY_IP:                               raycluster-kuberay-head-svc
      RAY_CLUSTER_NAME:                      (v1:metadata.labels['ray.io/cluster'])
      RAY_CLOUD_INSTANCE_ID:                raycluster-kuberay-workergroup-worker-69rsq (v1:metadata.name)
      RAY_NODE_TYPE_NAME:                    (v1:metadata.labels['ray.io/group'])
      KUBERAY_GEN_RAY_START_CMD:            ray start  --num-cpus=1  --memory=5000000000  --num-gpus=1  --address=raycluster-kuberay-head-svc.ray.svc.cluster.local:6379  --metrics-export-port=8080  --block  --dashboard-agent-listen-port=52365 
      RAY_PORT:                             6379
      RAY_ADDRESS:                          raycluster-kuberay-head-svc.ray.svc.cluster.local:6379
      RAY_USAGE_STATS_KUBERAY_IN_USE:       1
      REDIS_PASSWORD:                       
      RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE:  1
    Mounts:
      /dev/shm from shared-mem (rw)
      /tmp/ray from log-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pftj7 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  log-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  shared-mem:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  1G
  kube-api-access-pftj7:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/hostname=rtx
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  27m                  default-scheduler  Successfully assigned ray/raycluster-kuberay-workergroup-worker-69rsq to rtx
  Warning  Failed     26m                  kubelet            Failed to pull image "localhost:7439/rayproject/ray:2.38.0-py311-gpu": failed to pull and unpack image "localhost:7439/rayproject/ray:2.38.0-py311-gpu": failed to copy: httpReadSeeker: failed open: unexpected status code https://localhost:7439/v2/rayproject/ray/blobs/sha256:4bc954eb910af405a9ee95a0d504f14370aa97e028f19f50779604c01b4ea00b: 401 Unauthorized
  Warning  Failed     24m                  kubelet            Failed to pull image "localhost:7439/rayproject/ray:2.38.0-py311-gpu": failed to pull and unpack image "localhost:7439/rayproject/ray:2.38.0-py311-gpu": failed to copy: httpReadSeeker: failed open: unexpected status code https://localhost:7439/v2/rayproject/ray/blobs/sha256:a14a8a8a6ebc3813d37a448205bf2c059e7b0dde5dda741babfffc327f32638c: 401 Unauthorized
  Normal   Pulling    24m (x3 over 27m)    kubelet            Pulling image "localhost:7439/rayproject/ray:2.38.0-py311-gpu"
  Warning  Failed     21m (x4 over 26m)    kubelet            Error: ErrImagePull
  Warning  Failed     21m                  kubelet            Failed to pull image "localhost:7439/rayproject/ray:2.38.0-py311-gpu": failed to pull and unpack image "localhost:7439/rayproject/ray:2.38.0-py311-gpu": failed to resolve reference "localhost:7439/rayproject/ray:2.38.0-py311-gpu": unexpected status from HEAD request to http://localhost:7439/v2/rayproject/ray/manifests/2.38.0-py311-gpu: 504 Gateway Timeout
  Warning  Failed     20m (x8 over 26m)    kubelet            Error: ImagePullBackOff
  Normal   BackOff    111s (x70 over 26m)  kubelet            Back-off pulling image "localhost:7439/rayproject/ray:2.38.0-py311-gpu"

in event of the pod

in kube image keeper no relevant logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants