Need better conntrack management for UDP services (especially DNS) #127259

matteriben · 2024-09-10T02:31:55Z

What happened?

I created deployments with hostname, subdomain and headless service. I had the pods query their DNS records and log the results.

It typically took ~30 seconds for name resolution to be correct, though in some cases it could be much faster. Name resolution seems to fail occasionally returning NXDOMAIN. When the pod is deleted and recreated results are similar though the previous A record may also be returned.

I initially ran very short duration tests that stopped after observing a single intermittent failure. I later ran some longer duration tests and observed that these intermittent failures continue to occur and vary in duration.

First are some results of short duration tests after creating resources with reproducer-1.yml

Fairly typical, took ~30 seconds for initial successful response.

+ kubectl -n name-space logs --timestamps pod/example-9-85b9cf77c4-74dpb
2024-09-09T23:05:14.821639151Z ip address: 10.244.0.165
2024-09-09T23:05:46.613337669Z 10.244.0.165      example-9.headless.name-space.svc.cluster.local  example-9.headless.name-space.svc.cluster.local example-9.headless
2024-09-09T23:05:46.613549795Z success
2024-09-09T23:06:45.000865071Z 10.244.0.165      example-9.headless.name-space.svc.cluster.local  example-9.headless.name-space.svc.cluster.local example-9.headless
2024-09-09T23:06:45.001438197Z done

Note the intermittent failure nearly ~30 seconds after the pod started:

+ kubectl -n name-space logs --timestamps pod/example-46-5f88d64f8f-fkkmf
2024-09-09T23:05:26.142870309Z ip address: 10.244.0.205
2024-09-09T23:05:56.156259513Z 10.244.0.205      example-46.headless.name-space.svc.cluster.local  example-46.headless.name-space.svc.cluster.local example-46.headless
2024-09-09T23:05:56.156283138Z success
2024-09-09T23:05:56.159916727Z failure
2024-09-09T23:05:56.166636988Z 10.244.0.205      example-46.headless.name-space.svc.cluster.local  example-46.headless.name-space.svc.cluster.local example-46.headless
2024-09-09T23:05:56.166718155Z success
2024-09-09T23:05:56.166743405Z done

Note the intermittent failure nearly ~60 seconds after the pod started:

+ kubectl -n name-space logs --timestamps pod/example-8-7d65cf6595-5kgm8
2024-09-09T23:05:15.292906336Z ip address: 10.244.0.164
2024-09-09T23:05:45.299029486Z 10.244.0.164      example-8.headless.name-space.svc.cluster.local  example-8.headless.name-space.svc.cluster.local example-8.headless
2024-09-09T23:05:45.299171153Z success
2024-09-09T23:06:12.934370064Z 10.244.0.164      example-8.headless.name-space.svc.cluster.local  example-8.headless.name-space.svc.cluster.local example-8.headless
2024-09-09T23:06:12.934805231Z failure
2024-09-09T23:06:12.935585357Z 10.244.0.164      example-8.headless.name-space.svc.cluster.local  example-8.headless.name-space.svc.cluster.local example-8.headless
2024-09-09T23:06:12.935622357Z success
2024-09-09T23:06:12.935626232Z done

Correct name resolution nearly ten times faster then the typical ~30 seconds.

+ kubectl -n name-space logs --timestamps pod/example-50-5d7bd9f7dd-ldm9z
2024-09-09T23:05:25.994666234Z ip address: 10.244.0.203
2024-09-09T23:05:28.507822913Z 10.244.0.203      example-50.headless.name-space.svc.cluster.local  example-50.headless.name-space.svc.cluster.local example-50.headless
2024-09-09T23:05:28.507850872Z success
2024-09-09T23:05:28.512090337Z failure
2024-09-09T23:05:28.517020845Z 10.244.0.203      example-50.headless.name-space.svc.cluster.local  example-50.headless.name-space.svc.cluster.local example-50.headless
2024-09-09T23:05:28.517073470Z success
2024-09-09T23:05:28.517076387Z done

Next are some short duration examples after deleting the pod created with reproducer-1.yml

Resolved to previous IP address until ~30 seconds when name resolution become correct.

+ kubectl -n name-space logs --timestamps pod/example-32-94f578796-ggr5r
2024-09-09T23:28:58.365987901Z ip address: 10.244.0.189
2024-09-09T23:28:58.366879902Z 10.244.0.135      example-32.headless.name-space.svc.cluster.local  example-32.headless.name-space.svc.cluster.local example-32.headless
2024-09-09T23:28:58.366941985Z success
2024-09-09T23:29:28.372333094Z 10.244.0.135      example-32.headless.name-space.svc.cluster.local  example-32.headless.name-space.svc.cluster.local example-32.headless
2024-09-09T23:30:29.538217372Z 10.244.0.189      example-32.headless.name-space.svc.cluster.local  example-32.headless.name-space.svc.cluster.local example-32.headless
2024-09-09T23:30:29.540434333Z done

In this case name resolution failed briefly ~90 seconds after pod started.

+ kubectl -n name-space logs --timestamps pod/example-33-5c94568945-zwhk8
2024-09-09T23:28:56.224260600Z ip address: 10.244.0.177
2024-09-09T23:28:56.225344685Z 10.244.0.136      example-33.headless.name-space.svc.cluster.local  example-33.headless.name-space.svc.cluster.local example-33.headless
2024-09-09T23:28:56.225403852Z success
2024-09-09T23:29:26.229834542Z 10.244.0.136      example-33.headless.name-space.svc.cluster.local  example-33.headless.name-space.svc.cluster.local example-33.headless
2024-09-09T23:30:22.876064342Z 10.244.0.177      example-33.headless.name-space.svc.cluster.local  example-33.headless.name-space.svc.cluster.local example-33.headless
2024-09-09T23:30:22.878206470Z failure
2024-09-09T23:30:22.881135099Z 10.244.0.177      example-33.headless.name-space.svc.cluster.local  example-33.headless.name-space.svc.cluster.local example-33.headless
2024-09-09T23:30:22.881363766Z success
2024-09-09T23:30:22.881381516Z done

Here name resolution alternated between the old & new address, and failed briefly before correctly providing new address.

+ kubectl -n name-space logs --timestamps pod/example-34-5f8cdcc9bb-zp2xj
2024-09-09T23:28:56.311666142Z ip address: 10.244.0.178
2024-09-09T23:28:58.814520539Z 10.244.0.178      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:28:58.814552873Z success
2024-09-09T23:28:58.816665626Z 10.244.0.178      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:28:58.845242167Z 10.244.0.146      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:28:58.845250417Z 10.244.0.178      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:28:58.845252042Z 10.244.0.146      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:28:58.845253708Z 10.244.0.178      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:28:58.845255333Z 10.244.0.146      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:28:58.845256750Z 10.244.0.178      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:28:58.845258125Z 10.244.0.146      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:29:03.874183954Z 10.244.0.178      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:29:03.874214370Z 10.244.0.146      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:29:03.874216745Z 10.244.0.178      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:29:03.874218495Z 10.244.0.146      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
...
2024-09-09T23:29:28.801997330Z 10.244.0.178      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:29:28.801998872Z 10.244.0.146      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:29:28.802000372Z 10.244.0.178      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:29:28.802001830Z 10.244.0.146      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:30:04.697987578Z 10.244.0.178      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:30:04.698009453Z 10.244.0.146      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:30:04.698011870Z 10.244.0.178      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:30:04.698013661Z 10.244.0.146      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:30:04.698017411Z 10.244.0.178      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:30:04.699209288Z failure
2024-09-09T23:30:04.701096499Z 10.244.0.178      example-34.headless.name-space.svc.cluster.local  example-34.headless.name-space.svc.cluster.local example-34.headless
2024-09-09T23:30:04.701108082Z success
2024-09-09T23:30:04.701110207Z done

Finally an example of a longer duration test with reproducer-2.yml

Failures continue to occur and there appears to be a failure lasting ~5 seconds from 00:58:06-00:58:11.

+ kubectl -n name-space logs --timestamps pod/example-44-7b4c6db795-jnpx4
2024-09-10T00:53:45.606690224Z ip address: 10.244.0.148
2024-09-10T00:54:15.645760871Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:54:15.645830788Z success
2024-09-10T00:54:21.547652384Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:54:21.549761139Z failure
2024-09-10T00:54:21.551739643Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:54:21.551747935Z success
2024-09-10T00:55:08.732400872Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:55:08.733033040Z failure
2024-09-10T00:55:08.733892585Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:55:08.733955002Z success
2024-09-10T00:55:13.974409562Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:55:13.974909230Z failure
2024-09-10T00:55:13.975787899Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:55:13.975875275Z success
2024-09-10T00:55:19.037338608Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:55:19.037842693Z failure
2024-09-10T00:55:19.038832488Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:55:19.038843155Z success
2024-09-10T00:55:29.117418683Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:55:29.117966018Z failure
2024-09-10T00:55:29.118763229Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:55:29.118802687Z success
2024-09-10T00:55:44.373593325Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:55:44.374101410Z failure
2024-09-10T00:55:44.374768245Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:55:44.374816787Z success
2024-09-10T00:57:10.127070908Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:57:10.128437322Z failure
2024-09-10T00:57:10.133007897Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:57:10.134041104Z success
2024-09-10T00:57:51.941566804Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:57:51.941980888Z failure
2024-09-10T00:57:51.942865181Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:57:51.942923889Z success
2024-09-10T00:58:06.189370517Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:58:06.189968810Z failure
2024-09-10T00:58:11.194641064Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:58:11.194660897Z success
2024-09-10T00:58:31.853444367Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:58:31.854188701Z failure
2024-09-10T00:58:31.855070078Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T00:58:31.855520870Z success
2024-09-10T01:00:14.012500903Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T01:00:14.013042733Z failure
2024-09-10T01:00:14.014231435Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T01:00:14.014433351Z success
2024-09-10T01:00:25.099501939Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T01:00:25.100104894Z failure
2024-09-10T01:00:25.101076931Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T01:00:25.101219472Z success
2024-09-10T01:01:28.106594377Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T01:01:28.108120337Z failure
2024-09-10T01:01:28.109668172Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T01:01:28.109720922Z success
2024-09-10T01:01:47.173618917Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T01:01:47.176251003Z failure
2024-09-10T01:01:47.178596298Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T01:01:47.178728923Z success
2024-09-10T01:02:34.829412651Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T01:02:34.829895652Z failure
2024-09-10T01:02:34.831115611Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T01:02:34.831143778Z success
2024-09-10T01:02:45.651782651Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T01:02:45.652329402Z failure
2024-09-10T01:02:45.653055861Z 10.244.0.148      example-44.headless.name-space.svc.cluster.local  example-44.headless.name-space.svc.cluster.local example-44.headless
2024-09-10T01:02:45.653182319Z success

What did you expect to happen?

I expected that once name resolution provided the correct address it would not fail intermittently or subsequently return the wrong (old) address.

It would also be nice if it were possible that name resolution could start sooner more often, and ideally never return the wrong (old) address.

How can we reproduce it (as minimally and precisely as possible)?

I initially used this configuration and script:
reproducer-1.yml.txt
script-1.sh.txt

The script above stopped after the first failure, to see more failures I used these:
reproducer-2.yml.txt
script-2.sh.txt

Anything else we need to know?

This may be related to some existing issues.

The mostly ~30 second delay: #92559

The intermittent failures: coredns/coredns#6518

Assuming I need to workaround the intermittent failure, are there any downsides to modifying the hosts file directly (not through HostAliases) other than losing changes when the container exits?

Kubernetes version

$ kubectl version
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.2

Cloud provider

kind version 0.22.0

OS version

# On Linux:
$ cat /etc/os-release
NAME="Fedora Linux"
VERSION="40.20240529.0 (Silverblue)"
ID=fedora
VERSION_ID=40
VERSION_CODENAME=""
PLATFORM_ID="platform:f40"
PRETTY_NAME="Fedora Linux 40.20240529.0 (Silverblue)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:40"
DEFAULT_HOSTNAME="fedora"
HOME_URL="https://silverblue.fedoraproject.org"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora-silverblue/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://github.com/fedora-silverblue/issue-tracker/issues"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=40
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=40
SUPPORT_END=2025-05-13
VARIANT="Silverblue"
VARIANT_ID=silverblue
OSTREE_VERSION='40.20240529.0'
$ uname -a
Linux fedora 6.8.10-300.fc40.aarch64 #1 SMP PREEMPT_DYNAMIC Fri May 17 21:52:12 UTC 2024 aarch64 GNU/Linux

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-09-10T02:32:05Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

neolit123 · 2024-09-10T04:36:00Z

/sig network

aojea · 2024-09-10T10:57:35Z

@matteriben do you have any network policy in your cluster?

matteriben · 2024-09-10T11:24:52Z

@matteriben do you have any network policy in your cluster?

This was tested using only the defaults with kind, and applying the attached yaml files.

aojea · 2024-09-10T12:02:47Z

@matteriben do we know for sure if the problem is because coredns is not being able to answer or there can be some networking problem dropping packets? https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#are-dns-queries-being-received-processed

matteriben · 2024-09-10T15:48:41Z

@matteriben do we know for sure if the problem is because coredns is not being able to answer or there can be some networking problem dropping packets? https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#are-dns-queries-being-received-processed

@aojea
For my actual problem I was able to capture an occasional unexpected NXDOMAIN response using tcpdump on the pod. However, with the attached "reproducer scripts" (or a version with fewer pods) I turned on coredns log and ran tcpdump on the pod, but I have not been able to verify the NXDOMAIN response corresponding to these intermittent "failures". I assume the intermittent "failures" I'm seeing with the "reproducer scripts" is mostly (if not entirely) due to dropped packets.

aojea · 2024-09-10T16:15:31Z

I would take some metrics, specially the ones for conntrack and udp status of netstats, the prometheus node exported gives them by default last time I checked https://github.com/prometheus/node_exporter

You have an example of deployment the promehtheus stack in kind here https://github.com/aojea/kind-networking-plugins/tree/main/monitoring ,a more involved example here https://gist.github.com/aojea/f9ca1a51e2afd03621744c95bfdab5b8

I would look at conntrack -S and dmesg too,

matteriben · 2024-09-10T17:50:28Z

I would take some metrics, specially the ones for conntrack and udp status of netstats, the prometheus node exported gives them by default last time I checked https://github.com/prometheus/node_exporter

You have an example of deployment the promehtheus stack in kind here https://github.com/aojea/kind-networking-plugins/tree/main/monitoring ,a more involved example here https://gist.github.com/aojea/f9ca1a51e2afd03621744c95bfdab5b8

I would look at conntrack -S and dmesg too,

I ran the node_exporter following the example for kind, I'm able to access the web-ui for it, but I'm not sure how to find those metrics in it.

I can see dropped packets reported on the node though.

dmesg:

[191237.948145] nf_conntrack: nf_conntrack: table full, dropping packet
[191237.953267] nf_conntrack: nf_conntrack: table full, dropping packet
[191237.981079] nf_conntrack: nf_conntrack: table full, dropping packet
[191240.473915] net_ratelimit: 3 callbacks suppressed
[191240.473919] nf_conntrack: nf_conntrack: table full, dropping packet
[191240.487935] nf_conntrack: nf_conntrack: table full, dropping packet
[191240.513799] nf_conntrack: nf_conntrack: table full, dropping packet

conntrack:

# conntrack -S
cpu=0   	found=0 invalid=0 insert=0 insert_failed=0 drop=8 early_drop=34 error=0 search_restart=0 clash_resolve=0 chaintoolong=0 
cpu=1   	found=0 invalid=0 insert=0 insert_failed=0 drop=14 early_drop=35 error=0 search_restart=0 clash_resolve=0 chaintoolong=0 
cpu=2   	found=0 invalid=0 insert=0 insert_failed=0 drop=22 early_drop=35 error=0 search_restart=0 clash_resolve=0 chaintoolong=0 
cpu=3   	found=0 invalid=0 insert=0 insert_failed=0 drop=16 early_drop=46 error=0 search_restart=0 clash_resolve=0 chaintoolong=0 
cpu=4   	found=0 invalid=0 insert=0 insert_failed=0 drop=24 early_drop=40 error=0 search_restart=0 clash_resolve=0 chaintoolong=0 
cpu=5   	found=0 invalid=0 insert=0 insert_failed=0 drop=12 early_drop=48 error=0 search_restart=0 clash_resolve=0 chaintoolong=0 
cpu=6   	found=0 invalid=0 insert=0 insert_failed=0 drop=20 early_drop=50 error=0 search_restart=0 clash_resolve=0 chaintoolong=0 
cpu=7   	found=0 invalid=0 insert=0 insert_failed=0 drop=20 early_drop=36 error=0 search_restart=0 clash_resolve=0 chaintoolong=0

aojea · 2024-09-11T14:41:18Z

I see, so it seems you are saturating the conntrack table, there is a lot of literature about it

Check your actual values and update accordingly, these are mines

cat /proc/sys/net/netfilter/nf_conntrack_max
262144

https://adil.medium.com/how-to-fix-conntrack-table-full-dropping-packets-in-kubernetes-07f561a432aa
https://morganwu277.github.io/2018/05/26/Solve-production-issue-of-nf-conntrack-table-full-dropping-packet/

thockin · 2024-09-12T16:43:39Z

This seems like an opportunity to do better with conntrack - either automatically or with some API.

For example, we know DNS/UDP will only get one response (I think?) per request, so maybe we can tell the kernel it's OK to clean up conntrack records after 1 reply? We know there are some send-only UDP protocols (logs, metrics) so maybe we could even avoid conntrack entirely for those? It's not clear if we can do it automatically (e.g. any time we see port 53 on UDP assume it is DNS) or need some API clue (e.g. appProtocol) or some more direct API (e.g. a new service.spec.ports[].udpTimeout).

@danwinship since a lot of the mechanism would fall int your turf :) Should we open a different issue to discuss that?

Edit: We could also reopen the older NOTRACK issue, to mitigate some pointless use of conntrack (e.g. DNS to upstream resolvers)

danwinship · 2024-09-12T18:34:23Z

It's not clear if we can do it automatically (e.g. any time we see port 53 on UDP assume it is DNS) or need some API clue (e.g. appProtocol)

53 is an IANA reserved port; you really shouldn't be using it for anything other than DNS. (And so I don't think appProtocol is the right answer here; that's for cases where the port alone is ambiguous. Though, I guess we could say "if the port is 53 and appProtocol is set to something other than domain, then it's not DNS so don't do the DNS-specific hacks"...)

or some more direct API (e.g. a new service.spec.ports[].udpTimeout)

I don't think we want to get that precise in the API; that immediately reminds me of sessionAffinity.clientIP.timeoutSeconds which we had to basically deprecate because it was too hard to implement "correctly" for everyone except iptables kube-proxy.

Edit: We could also reopen the older NOTRACK issue, to mitigate some pointless use of conntrack (e.g. DNS to upstream resolvers)

Is there a specific bug you're thinking of? I didn't find it in a quick search.

FTR, I was confused about what OpenShift does; we add NOTRACK rules for vxlan/geneve, but not for DNS. (Maybe there used to be DNS rules?)

It appears that NodeLocalDNS uses NOTRACK, though I'd also really like to kill-and-replace NodeLocalDNS, because it's basically a bad reimplementation of the Service API using link-local IPv4 IPs in a non-portable way.

thockin · 2024-09-12T19:08:44Z

NOTRACK: #109672

aojea · 2024-09-13T06:22:54Z

NOTRACK: #109672

conntrack table is used by the NAT functionality, the statement is local and non-natted traffic

It appears that NodeLocalDNS uses NOTRACK, though I'd also really like to kill-and-replace NodeLocalDNS, because it's basically a bad reimplementation of the Service API using link-local IPv4 IPs in a non-portable way.

This ... I think the problem here is DNS, I do think DNS traffic should be special and resolved within the node and add NOTRACK there.

I started playing this week with a local dns cache aojea/kindnet#64 , it is very simple and lightweight ... is there interest in add this option as a kubernetes-sig project?

thockin · 2024-09-26T18:10:48Z

It's not realistic to expect EVERYONE to have node-local DNS resolution. It was never a requirement before.

So I think this becomes a series of smaller problems:

Can we NOTRACK things that don't need it? Kube-proxy should exempt local traffic from going through conntrack #109672
Can we do better conntrack lifecycle/timeouts for UDP things that need tracking but either do not expect a response or expect a known number of responses?

Should we retitle this to cover the second?

thockin · 2024-10-24T19:14:43Z

Discussed briefly at sig-net today. Retitling this issue.

We all seem to agree that we SHOULD do better wrt conntrack. That might include NOTRACK (#109672 has context) for connections that provably do not need it. That might include auto-configuration of conntrack to have faster timeouts for some classes of traffic (e.g. UDP/53). That might include explicit config of service ports (e.g. add API surface to describe "connection" semantics).

To make forward progress on this, it probably needs a champion who can drive it, collect ideas, POC some things, and see what's practical and what's not.

danwinship · 2024-10-24T19:39:13Z

That might include explicit config of service ports (e.g. add API surface to describe "connection" semantics).

There are a bunch of use cases for this...

-j NOTRACK / shorter conntrack timeouts make sense for some protocols but not others.
Setting up OVS offload / nftables flowtables / eBPF socket splicing after the first packet in a connection saves CPU processing time on further packets, but if your service deals with mostly single-request connections then there won't be any "further packets" and the CPU time you spent setting up offload was just wasted.
"prefer same node" is a good optimization for services where you want to minimize latency and aren't worried about endpoint overload (like when you have a DNS pod on every node).
GRO-based UDP packet aggregation improves bandwidth and reduces CPU/context-switching for UDP services, but causes a noticeable hit to latency for same-host connections.
... ?

aojea · 2024-10-30T20:55:18Z

-j NOTRACK / shorter conntrack timeouts make sense for some protocols but not others.

AFAIK if you use DNAT or any sort of NAT in the kernel you need conntrack, see section 4.5 https://arthurchiao.art/blog/conntrack-design-and-implementation/#3-implementation-netfilter-conntrack , so any implementation of DNS that uses Services will need conntrack with kube-proxy

matteriben added the kind/bug Categorizes issue or PR as related to a bug. label Sep 10, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 10, 2024

matteriben changed the title ~~DNS A record for pod with headless service may be delayed, wrong, or missing.~~ DNS A record for pod with headless service may be delayed, wrong, or intermittently missing. Sep 10, 2024

matteriben changed the title ~~DNS A record for pod with headless service may be delayed, wrong, or intermittently missing.~~ DNS A record for pod with headless service may be delayed, wrong, and intermittently missing. Sep 10, 2024

matteriben mentioned this issue Sep 10, 2024

Intermittent NXDOMAIN for headless service (part 2!) coredns/coredns#6518

Open

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 10, 2024

thockin assigned aojea Sep 12, 2024

thockin self-assigned this Sep 26, 2024

thockin changed the title ~~DNS A record for pod with headless service may be delayed, wrong, and intermittently missing.~~ Need better conntrack management for UDP services (especially DNS) Oct 24, 2024

thockin added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 24, 2024

thockin unassigned thockin and aojea Oct 24, 2024

thockin mentioned this issue Oct 24, 2024

Kube-proxy should exempt local traffic from going through conntrack #109672

Closed

danwinship mentioned this issue Nov 7, 2024

Flowtables #128392

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need better conntrack management for UDP services (especially DNS) #127259

Need better conntrack management for UDP services (especially DNS) #127259

matteriben commented Sep 10, 2024 •

edited

Loading

k8s-ci-robot commented Sep 10, 2024

neolit123 commented Sep 10, 2024

aojea commented Sep 10, 2024

matteriben commented Sep 10, 2024

aojea commented Sep 10, 2024

matteriben commented Sep 10, 2024

aojea commented Sep 10, 2024

matteriben commented Sep 10, 2024

aojea commented Sep 11, 2024

thockin commented Sep 12, 2024 •

edited

Loading

danwinship commented Sep 12, 2024

thockin commented Sep 12, 2024

aojea commented Sep 13, 2024

thockin commented Sep 26, 2024

thockin commented Oct 24, 2024

danwinship commented Oct 24, 2024

aojea commented Oct 30, 2024 •

edited

Loading

Need better conntrack management for UDP services (especially DNS) #127259

Need better conntrack management for UDP services (especially DNS) #127259

Comments

matteriben commented Sep 10, 2024 • edited Loading

What happened?

First are some results of short duration tests after creating resources with reproducer-1.yml

Next are some short duration examples after deleting the pod created with reproducer-1.yml

Finally an example of a longer duration test with reproducer-2.yml

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Sep 10, 2024

neolit123 commented Sep 10, 2024

aojea commented Sep 10, 2024

matteriben commented Sep 10, 2024

aojea commented Sep 10, 2024

matteriben commented Sep 10, 2024

aojea commented Sep 10, 2024

matteriben commented Sep 10, 2024

aojea commented Sep 11, 2024

thockin commented Sep 12, 2024 • edited Loading

danwinship commented Sep 12, 2024

thockin commented Sep 12, 2024

aojea commented Sep 13, 2024

thockin commented Sep 26, 2024

thockin commented Oct 24, 2024

danwinship commented Oct 24, 2024

aojea commented Oct 30, 2024 • edited Loading

matteriben commented Sep 10, 2024 •

edited

Loading

thockin commented Sep 12, 2024 •

edited

Loading

aojea commented Oct 30, 2024 •

edited

Loading