Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VPA updater errors with messages ~"fail to get pod controller: pod=kube-scheduler-XYZ err=Unhandled targetRef v1 / Node / XYZ, last error node is not a valid owner" #7378

Open
apilny-akamai opened this issue Oct 10, 2024 · 21 comments
Assignees
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@apilny-akamai
Copy link

apilny-akamai commented Oct 10, 2024

Which component are you using?: vertical-pod-autoscaler

What version of the component are you using?: 1.1.2

Component version:

What k8s version are you using (kubectl version)?: kubectl 1.25

What did you expect to happen?: VPA updater does not error with fail to get pod controller: pod=kube-scheduler-XYZ err=Unhandled targetRef v1 / Node / XYZ, last error node is not a valid owner

What happened instead?: vpa-updater log contains
`
│ E1010 12:38:44.476232 1 api.go:153] fail to get pod controller: pod=kube-apiserver-x-master-1 err=Unhandled targetRef v1 / Node / x-master-1, last error node is not a valid owner │

│ E1010 12:38:44.477788 1 api.go:153] fail to get pod controller: pod=kube-controller-manager-master-1 err=Unhandled targetRef v1 / Node / x-master-1, last error node is not a valid owner │

│ E1010 12:38:44.547767 1 api.go:153] fail to get pod controller: pod=etcd-x-master-1 err=Unhandled targetRef v1 / Node / x-master-1, last error node is not a valid owner │

│ E1010 12:38:44.554646 1 api.go:153] fail to get pod controller: pod=kube-scheduler-x-master-1 err=Unhandled targetRef v1 / Node / x-master-1, last error node is not a valid owner │
`

How to reproduce it (as minimally and precisely as possible):
Update VPA from 0.4 to 1.1.2 and observ the vpa-updater log.

Anything else we need to know?: I've tried to update to 1.2.1 and the error is in the log again. Did not happen with vpa 0.4. I can see this error message also in already fixed issue with panic/SIGSEGV problem but nowhere else.

kube-controller-manager Pod Spec (generated by kubeadm with a very little patch in IPs)

spec:
  containers:
  - command:
    - kube-controller-manager
    - --allocate-node-cidrs=true
    - --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --bind-address=127.0.0.1
    - --client-ca-file=/etc/kubernetes/pki/ca.crt
    - --cloud-provider=external
    - --cluster-cidr=10.1.0.0/16
    - --cluster-name=kubernetes
    - --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
    - --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
    - --controllers=*,bootstrapsigner,tokencleaner
    - --kubeconfig=/etc/kubernetes/controller-manager.conf
    - --leader-elect=true
    - --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
    - --root-ca-file=/etc/kubernetes/pki/ca.crt
    - --service-account-private-key-file=/etc/kubernetes/pki/sa.key
    - --service-cluster-ip-range=10.254.0.0/16
    - --use-service-account-credentials=true
    image: registry.k8s.io/kube-controller-manager:v1.25.16
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10257
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: kube-controller-manager
    resources:
      requests:
        cpu: 200m
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10257
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    volumeMounts:
    - mountPath: /etc/ssl/certs
      name: ca-certs
      readOnly: true
    - mountPath: /etc/ca-certificates
      name: etc-ca-certificates
      readOnly: true
    - mountPath: /etc/pki
      name: etc-pki
      readOnly: true
    - mountPath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec
      name: flexvolume-dir
    - mountPath: /etc/kubernetes/pki
      name: k8s-certs
      readOnly: true
    - mountPath: /etc/kubernetes/controller-manager.conf
      name: kubeconfig
      readOnly: true
    - mountPath: /usr/local/share/ca-certificates
      name: usr-local-share-ca-certificates
      readOnly: true
    - mountPath: /usr/share/ca-certificates
      name: usr-share-ca-certificates
      readOnly: true
  hostNetwork: true
  priority: 2000001000
  priorityClassName: system-node-critical
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  volumes:
  - hostPath:
      path: /etc/ssl/certs
      type: DirectoryOrCreate
    name: ca-certs
  - hostPath:
      path: /etc/ca-certificates
      type: DirectoryOrCreate
    name: etc-ca-certificates
  - hostPath:
      path: /etc/pki
      type: DirectoryOrCreate
    name: etc-pki
  - hostPath:
      path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec
      type: DirectoryOrCreate
    name: flexvolume-dir
  - hostPath:
      path: /etc/kubernetes/pki
      type: DirectoryOrCreate
    name: k8s-certs
  - hostPath:
      path: /etc/kubernetes/controller-manager.conf
      type: FileOrCreate
    name: kubeconfig
  - hostPath:
      path: /usr/local/share/ca-certificates
      type: DirectoryOrCreate
    name: usr-local-share-ca-certificates
  - hostPath:
      path: /usr/share/ca-certificates
      type: DirectoryOrCreate
    name: usr-share-ca-certificates
@apilny-akamai apilny-akamai added the kind/bug Categorizes issue or PR as related to a bug. label Oct 10, 2024
@apilny-akamai apilny-akamai changed the title VPA updater errors with "fail to get pod controller: pod=kube-scheduler-XYZ err=Unhandled targetRef v1 / Node / XYZ, last error node is not a valid owner" VPA updater errors with messages ~"fail to get pod controller: pod=kube-scheduler-XYZ err=Unhandled targetRef v1 / Node / XYZ, last error node is not a valid owner" Oct 10, 2024
@adrianmoisey
Copy link
Member

/area vertical-pod-autoscaler

@adrianmoisey
Copy link
Member

Would it be possible to see the spec of the Pod that this is failing on?
Which variant of Kubernetes are you running this on?

@adrianmoisey
Copy link
Member

/triage needs-information

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Oct 10, 2024
@apilny-akamai
Copy link
Author

We use standard kubeadm, K8s Rev: v1.25.16. I've updated description with an example Pod Spec.

@adrianmoisey
Copy link
Member

Hi. It seems like you added the VPA spec. I'm looking for the spec of the Pod kube-controller-manager-master-1

@apilny-akamai
Copy link
Author

Hi. It seems like you added the VPA spec. I'm looking for the spec of the Pod kube-controller-manager-master-1

Thank you and sorry, fixed in description.

@adrianmoisey
Copy link
Member

Sorry, I need the metadata too.
I need to see the Owner of this Pod, since that is what the VPA seems to be erroring about

@apilny-akamai
Copy link
Author

apilny-akamai commented Oct 15, 2024

No problem, here are the metadata:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-controller-manager
    tier: control-plane
  name: kube-controller-manager
  namespace: kube-system

@adrianmoisey
Copy link
Member

The problem here is that this Pod doesn't have an ownerReferences field.
For example:

$ kubectl get pod local-metrics-server-7d8c48bbd8-v5sp5 -o yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2024-09-26T10:07:15Z"
  generateName: local-metrics-server-7d8c48bbd8-
  labels:
    app.kubernetes.io/instance: local-metrics-server
    app.kubernetes.io/name: metrics-server
    pod-template-hash: 7d8c48bbd8
  name: local-metrics-server-7d8c48bbd8-v5sp5
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: local-metrics-server-7d8c48bbd8
    uid: 4381b7b3-4206-4ece-aab4-f91b3beceb71
  resourceVersion: "570"
  uid: 0281b5a4-d7dc-4b4a-b59e-f561f3207b31

The VPA requires a Pod to have an owner.

@adrianmoisey
Copy link
Member

/close

@k8s-ci-robot
Copy link
Contributor

@adrianmoisey: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@adrianmoisey
Copy link
Member

/assign

@Michkov
Copy link

Michkov commented Nov 18, 2024

We are getting this error with static pods:

  - apiVersion: v1
    controller: true
    kind: Node
    name: test-master-1
    uid: ff9885c0-8c3d-4c59-998e-f8aa7213e65f

It's handled in the code here -

if wellKnownController(groupKind.Kind) == node {
// Some pods specify nodes as their owners. This causes performance problems
// in big clusters when VPA tries to get all nodes. We know nodes aren't
// valid controllers so we can skip trying to fetch them.
return nil, fmt.Errorf("node is not a valid owner")

Based on the comment the node controller is skipped on purpose -> in that case it could provide info message with some higher log level, or can be ignored completely. Reporting this as error is confusing.

@adrianmoisey
Copy link
Member

We are getting this error with static pods:

  - apiVersion: v1
    controller: true
    kind: Node
    name: test-master-1
    uid: ff9885c0-8c3d-4c59-998e-f8aa7213e65f

It's handled in the code here -

if wellKnownController(groupKind.Kind) == node {
// Some pods specify nodes as their owners. This causes performance problems
// in big clusters when VPA tries to get all nodes. We know nodes aren't
// valid controllers so we can skip trying to fetch them.
return nil, fmt.Errorf("node is not a valid owner")

Based on the comment the node controller is skipped on purpose -> in that case it could provide info message with some higher log level, or can be ignored completely. Reporting this as error is confusing.

Correct me if I'm wrong, but the error message is only produced when a VPA object exists that targets Pods that are owned by the Node?
If that's the case, I think the error message is valid, since it's saying that there's a problem.

@adrianmoisey
Copy link
Member

Also, would it be possible for someone to create steps to reproduce this using kind?

@Michkov
Copy link

Michkov commented Nov 18, 2024

This error is produced when any VPA object exists -> not pointing to static pods.

Unable to reproduce with kind but easy to reproduce with kubeadm. Example how to install - https://blog.radwell.codes/2022/07/single-node-kubernetes-cluster-via-kubeadm-on-ubuntu-22-04/ (kubeadm installation is using old non-existing repos - instead use https://v1-30.docs.kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/#installing-kubeadm-kubelet-and-kubectl)

@adrianmoisey
Copy link
Member

/reopen

@k8s-ci-robot
Copy link
Contributor

@adrianmoisey: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot reopened this Nov 18, 2024
@Michkov
Copy link

Michkov commented Nov 18, 2024

with the kubeadm I can see that ownerReference on node, but the error is not there. Trying to find reproducer.

@adrianmoisey
Copy link
Member

I can reproduce it in kind.

  1. Start kind cluster
  2. Apply VPA example hamster.yaml
  3. Delete kube-scheduler-kind-control-plane pod in kube-system namespace

I get the following error in the admission-controller logs:

E1118 13:45:09.044165       1 api.go:153] fail to get pod controller: pod=kube-system/kube-scheduler-kind-control-plane err=Unhandled targetRef v1 / Node / kind-control-plane, last error node is not a valid owner

@adrianmoisey
Copy link
Member

I agree that that shouldn't be bubbled up as an error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

4 participants