-
Notifications
You must be signed in to change notification settings - Fork 39.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need better conntrack management for UDP services (especially DNS) #127259
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/sig network |
@matteriben do you have any network policy in your cluster? |
This was tested using only the defaults with |
@matteriben do we know for sure if the problem is because coredns is not being able to answer or there can be some networking problem dropping packets? https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#are-dns-queries-being-received-processed |
@aojea |
I would take some metrics, specially the ones for conntrack and udp status of netstats, the prometheus node exported gives them by default last time I checked https://github.com/prometheus/node_exporter You have an example of deployment the promehtheus stack in kind here https://github.com/aojea/kind-networking-plugins/tree/main/monitoring ,a more involved example here https://gist.github.com/aojea/f9ca1a51e2afd03621744c95bfdab5b8 I would look at |
I ran the node_exporter following the example for kind, I'm able to access the web-ui for it, but I'm not sure how to find those metrics in it. I can see dropped packets reported on the node though. dmesg:
conntrack:
|
I see, so it seems you are saturating the conntrack table, there is a lot of literature about it Check your actual values and update accordingly, these are mines
https://adil.medium.com/how-to-fix-conntrack-table-full-dropping-packets-in-kubernetes-07f561a432aa |
This seems like an opportunity to do better with conntrack - either automatically or with some API. For example, we know DNS/UDP will only get one response (I think?) per request, so maybe we can tell the kernel it's OK to clean up conntrack records after 1 reply? We know there are some send-only UDP protocols (logs, metrics) so maybe we could even avoid conntrack entirely for those? It's not clear if we can do it automatically (e.g. any time we see port 53 on UDP assume it is DNS) or need some API clue (e.g. appProtocol) or some more direct API (e.g. a new service.spec.ports[].udpTimeout). @danwinship since a lot of the mechanism would fall int your turf :) Should we open a different issue to discuss that? Edit: We could also reopen the older NOTRACK issue, to mitigate some pointless use of conntrack (e.g. DNS to upstream resolvers) |
53 is an IANA reserved port; you really shouldn't be using it for anything other than DNS. (And so I don't think
I don't think we want to get that precise in the API; that immediately reminds me of
Is there a specific bug you're thinking of? I didn't find it in a quick search. FTR, I was confused about what OpenShift does; we add NOTRACK rules for vxlan/geneve, but not for DNS. (Maybe there used to be DNS rules?) It appears that NodeLocalDNS uses NOTRACK, though I'd also really like to kill-and-replace NodeLocalDNS, because it's basically a bad reimplementation of the Service API using link-local IPv4 IPs in a non-portable way. |
NOTRACK: #109672 |
conntrack table is used by the NAT functionality, the statement is local and non-natted traffic
This ... I think the problem here is DNS, I do think DNS traffic should be special and resolved within the node and add NOTRACK there. I started playing this week with a local dns cache aojea/kindnet#64 , it is very simple and lightweight ... is there interest in add this option as a kubernetes-sig project? |
It's not realistic to expect EVERYONE to have node-local DNS resolution. It was never a requirement before. So I think this becomes a series of smaller problems:
Should we retitle this to cover the second? |
Discussed briefly at sig-net today. Retitling this issue. We all seem to agree that we SHOULD do better wrt conntrack. That might include NOTRACK (#109672 has context) for connections that provably do not need it. That might include auto-configuration of conntrack to have faster timeouts for some classes of traffic (e.g. UDP/53). That might include explicit config of service ports (e.g. add API surface to describe "connection" semantics). To make forward progress on this, it probably needs a champion who can drive it, collect ideas, POC some things, and see what's practical and what's not. |
There are a bunch of use cases for this...
|
AFAIK if you use DNAT or any sort of NAT in the kernel you need conntrack, see section 4.5 https://arthurchiao.art/blog/conntrack-design-and-implementation/#3-implementation-netfilter-conntrack , so any implementation of DNS that uses Services will need conntrack with kube-proxy |
What happened?
I created deployments with hostname, subdomain and headless service. I had the pods query their DNS records and log the results.
It typically took ~30 seconds for name resolution to be correct, though in some cases it could be much faster. Name resolution seems to fail occasionally returning NXDOMAIN. When the pod is deleted and recreated results are similar though the previous A record may also be returned.
I initially ran very short duration tests that stopped after observing a single intermittent failure. I later ran some longer duration tests and observed that these intermittent failures continue to occur and vary in duration.
First are some results of short duration tests after creating resources with reproducer-1.yml
Fairly typical, took ~30 seconds for initial successful response.
Note the intermittent failure nearly ~30 seconds after the pod started:
Note the intermittent failure nearly ~60 seconds after the pod started:
Correct name resolution nearly ten times faster then the typical ~30 seconds.
Next are some short duration examples after deleting the pod created with reproducer-1.yml
Resolved to previous IP address until ~30 seconds when name resolution become correct.
In this case name resolution failed briefly ~90 seconds after pod started.
Here name resolution alternated between the old & new address, and failed briefly before correctly providing new address.
Finally an example of a longer duration test with reproducer-2.yml
Failures continue to occur and there appears to be a failure lasting ~5 seconds from 00:58:06-00:58:11.
What did you expect to happen?
I expected that once name resolution provided the correct address it would not fail intermittently or subsequently return the wrong (old) address.
It would also be nice if it were possible that name resolution could start sooner more often, and ideally never return the wrong (old) address.
How can we reproduce it (as minimally and precisely as possible)?
I initially used this configuration and script:
reproducer-1.yml.txt
script-1.sh.txt
The script above stopped after the first failure, to see more failures I used these:
reproducer-2.yml.txt
script-2.sh.txt
Anything else we need to know?
This may be related to some existing issues.
The mostly ~30 second delay: #92559
The intermittent failures: coredns/coredns#6518
Assuming I need to workaround the intermittent failure, are there any downsides to modifying the hosts file directly (not through HostAliases) other than losing changes when the container exits?
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: