-
Notifications
You must be signed in to change notification settings - Fork 270
Is this normal for RKE2 Restoring a Snapshot to Existing Nodes #6261
Replies: 2 comments · 9 replies
-
You should see the etcd pod start up after a short period of time. Can you attach the full logs from the restore, instead of just a screenshot of the last few errors? You might also try running rke2-killall.sh prior to restoring. |
Beta Was this translation helpful? Give feedback.
All reactions
-
[root@rke21 snapshots]# rke2 server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rke21-1719435602
WARN[0000] Unknown flag --containerd-config-file found in config.yaml, skipping
WARN[0000] Unknown flag --use-service-account-credentials found in config.yaml, skipping
WARN[0000] not running in CIS mode
INFO[0000] Applying Pod Security Admission Configuration
INFO[0000] Static pod cleanup in progress
INFO[0000] Logging temporary containerd to /var/lib/rancher/rke2/agent/containerd/containerd.log
INFO[0000] Running temporary containerd /var/lib/rancher/rke2/bin/containerd -c /var/lib/rancher/rke2/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/rke2/agent/containerd
INFO[0010] Removing pod etcd-rke21
INFO[0010] Removing pod kube-apiserver-rke21
INFO[0020] Static pod cleanup completed successfully
INFO[0020] Starting rke2 v1.30.0+rke2r1 (60e06c4dbccff996f717af8f4c532971f57264b4)
INFO[0020] Managed etcd cluster bootstrap already complete and initialized
INFO[0020] Pre-restore etcd database moved to /var/lib/rancher/rke2/server/db/etcd-old-1719477870
{"level":"info","ts":"2024-06-27T11:44:30.585987+0300","caller":"snapshot/v3_snapshot.go:248","msg":"restoring snapshot","path":"/var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rke21-1719435602","wal-dir":"/var/lib/rancher/rke2/server/db/etcd/member/wal","data-dir":"/var/lib/rancher/rke2/server/db/etcd","snap-dir":"/var/lib/rancher/rke2/server/db/etcd/member/snap","stack":"go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore\n\t/go/pkg/mod/github.com/k3s-io/etcd/etcdutl/[email protected]/snapshot/v3_snapshot.go:254\ngithub.com/k3s-io/k3s/pkg/etcd.(*ETCD).Restore\n\t/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/etcd/etcd.go:1486\ngithub.com/k3s-io/k3s/pkg/etcd.(*ETCD).Reset\n\t/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/etcd/etcd.go:410\ngithub.com/k3s-io/k3s/pkg/cluster.(*Cluster).start\n\t/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/cluster/managed.go:71\ngithub.com/k3s-io/k3s/pkg/cluster.(*Cluster).Start\n\t/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/cluster/cluster.go:91\ngithub.com/k3s-io/k3s/pkg/daemons/control.prepare\n\t/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/daemons/control/server.go:297\ngithub.com/k3s-io/k3s/pkg/daemons/control.Server\n\t/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/daemons/control/server.go:36\ngithub.com/k3s-io/k3s/pkg/server.StartServer\n\t/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/server/server.go:56\ngithub.com/k3s-io/k3s/pkg/cli/server.run\n\t/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/cli/server/server.go:495\ngithub.com/k3s-io/k3s/pkg/cli/server.RunWithControllers\n\t/go/pkg/mod/github.com/k3s-io/[email protected]/pkg/cli/server/server.go:48\ngithub.com/rancher/rke2/pkg/rke2.Server\n\t/source/pkg/rke2/rke2.go:123\ngithub.com/rancher/rke2/pkg/cli/cmds.ServerRun\n\t/source/pkg/cli/cmds/server.go:167\ngithub.com/urfave/cli.HandleAction\n\t/go/pkg/mod/github.com/urfave/[email protected]/app.go:524\ngithub.com/urfave/cli.Command.Run\n\t/go/pkg/mod/github.com/urfave/[email protected]/command.go:175\ngithub.com/urfave/cli.(*App).Run\n\t/go/pkg/mod/github.com/urfave/[email protected]/app.go:277\nmain.main\n\t/source/main.go:23\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:271"}
{"level":"info","ts":"2024-06-27T11:44:30.816859+0300","caller":"membership/store.go:141","msg":"Trimming membership information from the backend..."}
{"level":"info","ts":"2024-06-27T11:44:30.838616+0300","caller":"membership/cluster.go:421","msg":"added member","cluster-id":"6300d4a278043cb","local-member-id":"0","added-peer-id":"86e02918fafb02ec","added-peer-peer-urls":["https://10.0.0.2:2380"]}
{"level":"info","ts":"2024-06-27T11:44:30.861646+0300","caller":"snapshot/v3_snapshot.go:269","msg":"restored snapshot","path":"/var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rke21-1719435602","wal-dir":"/var/lib/rancher/rke2/server/db/etcd/member/wal","data-dir":"/var/lib/rancher/rke2/server/db/etcd","snap-dir":"/var/lib/rancher/rke2/server/db/etcd/member/snap"}
INFO[0020] Starting etcd for new cluster, cluster-reset=true
INFO[0020] Server node token is available at /var/lib/rancher/rke2/server/token
INFO[0020] Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/k3s/containerd/containerd.sock: connect: connection refused"
INFO[0020] To join server node to cluster: rke2 server -s https://192.168.8.79:9345 -t ${SERVER_NODE_TOKEN}
INFO[0020] Agent node token is available at /var/lib/rancher/rke2/server/agent-token
INFO[0020] To join agent node to cluster: rke2 agent -s https://192.168.8.79:9345 -t ${AGENT_NODE_TOKEN}
INFO[0020] Wrote kubeconfig /etc/rancher/rke2/rke2.yaml
ERRO[0020] Failed to set /etc/rancher/rke2/rke2.yaml to mode ----------: strconv.ParseInt: parsing "384": invalid syntax
INFO[0020] Run: rke2 kubectl
WARN[0020] remove /var/lib/rancher/rke2/agent/etc/rke2-agent-load-balancer.json: no such file or directory
INFO[0020] Adding server to load balancer rke2-agent-load-balancer: 127.0.0.1:9345
INFO[0020] Running load balancer rke2-agent-load-balancer 127.0.0.1:6444 -> [127.0.0.1:9345] [default: 127.0.0.1:9345]
INFO[0020] Running load balancer rke2-api-server-agent-load-balancer 127.0.0.1:6443 -> [] [default: ]
INFO[0020] Handling backend connection request [rke2agent1]
INFO[0021] Handling backend connection request [rke2agent2]
INFO[0021] Password verified locally for node rke21
INFO[0021] certificate CN=rke21 signed by CN=rke2-server-ca@1715943396: notBefore=2024-05-17 10:56:36 +0000 UTC notAfter=2025-06-27 08:44:32 +0000 UTC
INFO[0022] certificate CN=system:node:rke21,O=system:nodes signed by CN=rke2-client-ca@1715943396: notBefore=2024-05-17 10:56:36 +0000 UTC notAfter=2025-06-27 08:44:32 +0000 UTC
INFO[0022] Using private registry config file at /etc/rancher/rke2/registries.yaml
INFO[0022] Module overlay was already loaded
INFO[0022] Module nf_conntrack was already loaded
INFO[0022] Module br_netfilter was already loaded
INFO[0022] Module iptable_nat was already loaded
INFO[0022] Module iptable_filter was already loaded
INFO[0022] Runtime image 10.0.0.10:5000/rancher/rke2-runtime:v1.30.0-rke2r1 bin and charts directories already exist; skipping extract
INFO[0022] Updated manifest /var/lib/rancher/rke2/server/manifests/harvester-csi-driver.yaml to set cluster configuration values
INFO[0022] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-calico-crd.yaml to set cluster configuration values
INFO[0022] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-ingress-nginx.yaml to set cluster configuration values
INFO[0022] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-snapshot-validation-webhook.yaml to set cluster configuration values
INFO[0022] Updated manifest /var/lib/rancher/rke2/server/manifests/harvester-cloud-provider.yaml to set cluster configuration values
INFO[0022] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-calico.yaml to set cluster configuration values
INFO[0022] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-coredns.yaml to set cluster configuration values
INFO[0022] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-metrics-server.yaml to set cluster configuration values
INFO[0022] Updated manifest /var/lib/rancher/rke2/server/manifests/rancher-vsphere-csi.yaml to set cluster configuration values
INFO[0022] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-canal.yaml to set cluster configuration values
INFO[0022] No cluster configuration value changes necessary for manifest /var/lib/rancher/rke2/server/manifests/rke2-ingress-nginx-config.yaml
INFO[0022] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-multus.yaml to set cluster configuration values
INFO[0022] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-snapshot-controller.yaml to set cluster configuration values
INFO[0022] Updated manifest /var/lib/rancher/rke2/server/manifests/rancher-vsphere-cpi.yaml to set cluster configuration values
INFO[0022] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-cilium.yaml to set cluster configuration values
INFO[0022] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-flannel.yaml to set cluster configuration values
INFO[0022] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-snapshot-controller-crd.yaml to set cluster configuration values
INFO[0022] Using containerd template at /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
INFO[0022] Logging containerd to /var/lib/rancher/rke2/agent/containerd/containerd.log
INFO[0022] Running containerd -c /var/lib/rancher/rke2/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/rke2/agent/containerd
INFO[0023] Waiting for containerd startup: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/k3s/containerd/containerd.sock: connect: connection refused"
INFO[0024] containerd is now running
INFO[0024] Importing images from /var/lib/rancher/rke2/agent/images/rke2-images.linux-amd64.tar.zst
{"level":"warn","ts":"2024-06-27T11:44:36.168221+0300","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00097c000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
WARN[0025] Failed to get apiserver address from etcd: context deadline exceeded
{"level":"warn","ts":"2024-06-27T11:44:41.168497+0300","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00103b880/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
WARN[0030] Failed to get apiserver address from etcd: context deadline exceeded
{"level":"warn","ts":"2024-06-27T11:44:46.168755+0300","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000bf1500/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
WARN[0035] Failed to get apiserver address from etcd: context deadline exceeded
INFO[0040] Pod for etcd not synced (pod sandbox not found), retrying
{"level":"warn","ts":"2024-06-27T11:44:51.169411+0300","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008f6000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
WARN[0040] Failed to get apiserver address from etcd: context deadline exceeded
{"level":"warn","ts":"2024-06-27T11:44:56.172243+0300","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008aa000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
WARN[0045] Failed to get apiserver address from etcd: context deadline exceeded
{"level":"warn","ts":"2024-06-27T11:45:00.867443+0300","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000503a40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
INFO[0050] Failed to test data store connection: context deadline exceeded
INFO[0050] Waiting for etcd server to become available |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Beta Was this translation helpful? Give feedback.
All reactions
-
@brandond I have waited like 15 minute and etcd server has not became available. |
Beta Was this translation helpful? Give feedback.
All reactions
-
@brandond I am sharing kube-api server logs too. |
Beta Was this translation helpful? Give feedback.
All reactions
-
@brandond This is my server config.yaml for rke2-servers cat << EOF >> /etc/rancher/rke2/config.yaml
secrets-encryption: true
token: fuzzybunnyslippers
tls-san:
- rancher.exemple.com
node-ip: "10.0.0.2"
write-kubeconfig-mode: 0600
etcd-expose-metrics: true
use-service-account-credentials: true
kube-controller-manager-arg:
- bind-address=127.0.0.1
- use-service-account-credentials=true
- tls-min-version=VersionTLS12
- tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
kube-scheduler-arg:
- tls-min-version=VersionTLS12
- tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
kube-apiserver-arg:
- tls-min-version=VersionTLS12
- tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
- authorization-mode=RBAC,Node
- anonymous-auth=false
- audit-policy-file=/etc/rancher/rke2/audit-policy.yaml
- audit-log-mode=blocking-strict
- audit-log-maxage=30
- audit-log-path=/var/lib/rancher/rke2/server/logs/audit.log
kubelet-arg:
- protect-kernel-defaults=true
- read-only-port=0
- authorization-mode=Webhook
- streaming-connection-idle-timeout=5m
EOF
### Configure RKE2 Audit Policy
cat << EOF >> /etc/rancher/rke2/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
metadata:
name: rke2-audit-policy
rules:
- level: Metadata
resources:
- group: ""
resources: ["secrets"]
- level: RequestResponse
resources:
- group: ""
resources: ["*"]
EOF
cat << EOF >> /etc/rancher/rke2/registries.yaml
mirrors:
docker.io:
endpoint:
- "http://10.0.0.10:5000"
EOF |
Beta Was this translation helpful? Give feedback.
All reactions
-
You need to quote this to make it a string, not a hex integer: If you fix that, do things work better?
There should be additional messages in the log when the import finishes processing this file, assuming it actually contains any images. Is something going on that is causing this import to fail or take an exceptionally long time to complete? You might check the log at |
Beta Was this translation helpful? Give feedback.
All reactions
-
contaınerdlogs.txt
|
Beta Was this translation helpful? Give feedback.
All reactions
-
@brandond This is log from --cluster-reset and restoring from etcd, thank you for your interest.
|
Beta Was this translation helpful? Give feedback.
All reactions
-
What is going on with this node that it takes over a minute just to import the core images? This should take a few seconds at most. It seems like other things that normally occur quickly also take much longer. What is the disk/cpu/memory situation on this host? |
Beta Was this translation helpful? Give feedback.
All reactions
-
@brandond thank you very much for your help. In my opinion it can be about longhorn. I have 2 cluster. |
Beta Was this translation helpful? Give feedback.
-
RKE2 Restoring a Snapshot to Existing Nodes, ı applied below commands rke2 version is v1.30.0 +rke2r1
rke2 etcd-snapshot save --name before-delete-deployment
rke2 etcd-snapshot ls
systemctl stop rke2-server // for all server nodes
rke2 server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/rke2/server/db/snapshots/before-delete-deployment-rke21-1719250944
Getting these error after apply the cluster-reset command
Beta Was this translation helpful? Give feedback.
All reactions