[Bug]: Increased and erratic memory in the Nginx Pods leading to OOM Kills - appears to be introduced by v3.7.x #6860

MarkTopping · 2024-11-25T14:07:23Z

Version

3.7.0

What Kubernetes platforms are you running on?

AKS Azure

Steps to reproduce

I believe that changes in version 3.7.0 or 3.7.1 have introduced a memory consumption issue.

We had to rollback a version bump from v3.6.2 to v3.7.1 today after our Nginx IC Pods all crashed due to OOM Kills. To make matters worse, due to Bug 4604 the Pods then failed to restart (without manual intervention) leading to obvious impact.

Our subsequent investigation after our outage revealed that the memory consumption on the Nginx Pods changed quite dramatically after the release as shown by the following 2 charts.

1st Example
In our least used environment we didn't incur any OOM Kills, but todays investigation revealed how memory usage has both increased, and also become more 'spikey' since we performed the upgrade:

2nd Example
This screenshot shows the IC Pods memory consumption after a release of v3.7.1 into a more busy environment and a subsequent rollback this morning.

What this graph doesn't capture is that the memory went above the 1500MiB line for all Pods in the deployment and thus were OOM Killed. This isn't shown because the metrics are exported every minute and so we just have the last datapoint that happened to be collected before the OOM Kill.

I guess it's worth noting that we also bumped our Helm Chart (not just the image version) with our release. The only notable change with that chart was the explicit creation of the Leader Election resource which I think Nginx used to just create by itself after deployment.

Some environment notes:

Azure AKS - 1.30.5
Using feature: Mergable Ingress Types
Ingress resource count: 516
IC Pod Count: 6
Memory Request & Limit: 1500MiB per pod
ReadOnlyRootFileSystem: true

github-actions · 2024-11-25T14:07:37Z

Hi @MarkTopping thanks for reporting!

Be sure to check out the docs and the Contributing Guidelines while you wait for a human to take a look at this 🙂

Cheers!

vepatel · 2024-11-25T15:51:00Z

Hi @MarkTopping thanks for opening the issue. 3.7.1 release uses Nginx 1.27.2 https://forum.nginx.org/read.php?27,300237,300237, which now caches SSL certificates, secret keys, and CRLs on start or during reconfiguration.
Can you please confirm

you see same issue in 3.7.0
you're using limits in deployments

thanks

MarkTopping · 2024-11-25T16:49:13Z

Hi @MarkTopping thanks for opening the issue. 3.7.1 release uses Nginx 1.27.2 https://forum.nginx.org/read.php?27,300237,300237, which now caches SSL certificates, secret keys, and CRLs on start or during reconfiguration. Can you please confirm

you see same issue in 3.7.0

you're using limits in deployments

thanks

Hi @vepatel

Thank you for your response.

I'm haven't tested 3.7.0 - and sadly it's not just a matter of having a go for you - todays outage caused quite a bit of disruption so it isn't something I can replicate as and when I see fit. Unless you have particular reason to believe that 3.7.0 would address an issue that was introduced specifically in 3.7.1?

Re limits - yes, indeed we are. The graphs kind of hide it, but requests and limits are both set; and for memory they are both equal to one another. We have set that limit though to be 4x higher than what we typically see each Nginx Pod consuming - hence a lot of headroom.

A question for you please... thanks for the link... but it doesn't state the implications of the changes. Would I be right in assuming that an increase in memory consumption is expected due to the caching behaviour introduced? Certs aren't exactly big - so I'd assume that would only result in a fairly small memory increase anyway?

MarkTopping · 2024-11-27T09:22:06Z

I'm just following up with another chart and depiction of how the memory usage has adversely changed in v3.7

I've redeployed version 3.7.1 with a Request and Limit of 3000MiB. This was in the hope of seeing just how high the memory would spike but without incurring the OOM Kills that happened earlier in the week.

Here is the result and a view over the past 5 days:

The blue line doesn't worry me. It shows an approximate 40% increase in memory which consumers might need to account for but it's pretty stable at least.

The orange line however shows just how spiky the memory usage has become between versions 3.6 and 3.7.

In my case those spikes have become ~3x larger and they surpassed the memory limits (which were quite generous IMO). I guess it's worth understanding whether this is/was truly known and is by design from the contributors? That at least would confirm whether or not this should be considered a bug.

I for one certainly find the memory profile of 3.6 far more pleasing and easier to right-size my environment for.

jjngx · 2024-11-27T10:00:48Z

Thank you @MarkTopping for providing details. We are investigating the memory spikes.

jjngx · 2024-11-29T13:41:18Z

memory consumption after a release of v3.7.1 into a more busy environment

@MarkTopping could you please provide more detailed information about the busy environment please? What traffic pattern you observe in the affected cluster? Does the changes in traffic trigger autoscaling?

MarkTopping added bug An issue reporting a potential bug needs triage An issue that needs to be triaged labels Nov 25, 2024

github-project-automation bot added this to NGINX Ingress Controller Nov 25, 2024

github-project-automation bot moved this to Todo ☑ in NGINX Ingress Controller Nov 25, 2024

MarkTopping changed the title ~~[Bug]:~~ [Bug]: Increased and erratic memory in the Nginx Pods leading to OOM Kills - appears to be introduced by v3.7.x Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Increased and erratic memory in the Nginx Pods leading to OOM Kills - appears to be introduced by v3.7.x #6860

[Bug]: Increased and erratic memory in the Nginx Pods leading to OOM Kills - appears to be introduced by v3.7.x #6860

MarkTopping commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

vepatel commented Nov 25, 2024

MarkTopping commented Nov 25, 2024

MarkTopping commented Nov 27, 2024

jjngx commented Nov 27, 2024

jjngx commented Nov 29, 2024

[Bug]: Increased and erratic memory in the Nginx Pods leading to OOM Kills - appears to be introduced by v3.7.x #6860

[Bug]: Increased and erratic memory in the Nginx Pods leading to OOM Kills - appears to be introduced by v3.7.x #6860

Comments

MarkTopping commented Nov 25, 2024

Version

What Kubernetes platforms are you running on?

Steps to reproduce

github-actions bot commented Nov 25, 2024

vepatel commented Nov 25, 2024

MarkTopping commented Nov 25, 2024

MarkTopping commented Nov 27, 2024

jjngx commented Nov 27, 2024

jjngx commented Nov 29, 2024