Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Increased and erratic memory in the Nginx Pods leading to OOM Kills - appears to be introduced by v3.7.x #6860

Open
MarkTopping opened this issue Nov 25, 2024 · 6 comments
Labels
bug An issue reporting a potential bug needs triage An issue that needs to be triaged

Comments

@MarkTopping
Copy link

Version

3.7.0

What Kubernetes platforms are you running on?

AKS Azure

Steps to reproduce

I believe that changes in version 3.7.0 or 3.7.1 have introduced a memory consumption issue.

We had to rollback a version bump from v3.6.2 to v3.7.1 today after our Nginx IC Pods all crashed due to OOM Kills. To make matters worse, due to Bug 4604 the Pods then failed to restart (without manual intervention) leading to obvious impact.

Our subsequent investigation after our outage revealed that the memory consumption on the Nginx Pods changed quite dramatically after the release as shown by the following 2 charts.

1st Example
In our least used environment we didn't incur any OOM Kills, but todays investigation revealed how memory usage has both increased, and also become more 'spikey' since we performed the upgrade:

Image

2nd Example
This screenshot shows the IC Pods memory consumption after a release of v3.7.1 into a more busy environment and a subsequent rollback this morning.

Image

What this graph doesn't capture is that the memory went above the 1500MiB line for all Pods in the deployment and thus were OOM Killed. This isn't shown because the metrics are exported every minute and so we just have the last datapoint that happened to be collected before the OOM Kill.

I guess it's worth noting that we also bumped our Helm Chart (not just the image version) with our release. The only notable change with that chart was the explicit creation of the Leader Election resource which I think Nginx used to just create by itself after deployment.

Some environment notes:

  • Azure AKS - 1.30.5
  • Using feature: Mergable Ingress Types
  • Ingress resource count: 516
  • IC Pod Count: 6
  • Memory Request & Limit: 1500MiB per pod
  • ReadOnlyRootFileSystem: true
@MarkTopping MarkTopping added bug An issue reporting a potential bug needs triage An issue that needs to be triaged labels Nov 25, 2024
Copy link

Hi @MarkTopping thanks for reporting!

Be sure to check out the docs and the Contributing Guidelines while you wait for a human to take a look at this 🙂

Cheers!

@MarkTopping MarkTopping changed the title [Bug]: [Bug]: Increased and erratic memory in the Nginx Pods leading to OOM Kills - appears to be introduced by v3.7.x Nov 25, 2024
@vepatel
Copy link
Contributor

vepatel commented Nov 25, 2024

Hi @MarkTopping thanks for opening the issue. 3.7.1 release uses Nginx 1.27.2 https://forum.nginx.org/read.php?27,300237,300237, which now caches SSL certificates, secret keys, and CRLs on start or during reconfiguration.
Can you please confirm

  • you see same issue in 3.7.0
  • you're using limits in deployments

thanks

@MarkTopping
Copy link
Author

Hi @MarkTopping thanks for opening the issue. 3.7.1 release uses Nginx 1.27.2 https://forum.nginx.org/read.php?27,300237,300237, which now caches SSL certificates, secret keys, and CRLs on start or during reconfiguration. Can you please confirm

  • you see same issue in 3.7.0
  • you're using limits in deployments

thanks

Hi @vepatel

Thank you for your response.

I'm haven't tested 3.7.0 - and sadly it's not just a matter of having a go for you - todays outage caused quite a bit of disruption so it isn't something I can replicate as and when I see fit. Unless you have particular reason to believe that 3.7.0 would address an issue that was introduced specifically in 3.7.1?

Re limits - yes, indeed we are. The graphs kind of hide it, but requests and limits are both set; and for memory they are both equal to one another. We have set that limit though to be 4x higher than what we typically see each Nginx Pod consuming - hence a lot of headroom.

A question for you please... thanks for the link... but it doesn't state the implications of the changes. Would I be right in assuming that an increase in memory consumption is expected due to the caching behaviour introduced? Certs aren't exactly big - so I'd assume that would only result in a fairly small memory increase anyway?

@MarkTopping
Copy link
Author

I'm just following up with another chart and depiction of how the memory usage has adversely changed in v3.7

I've redeployed version 3.7.1 with a Request and Limit of 3000MiB. This was in the hope of seeing just how high the memory would spike but without incurring the OOM Kills that happened earlier in the week.

Here is the result and a view over the past 5 days:
Image

The blue line doesn't worry me. It shows an approximate 40% increase in memory which consumers might need to account for but it's pretty stable at least.

The orange line however shows just how spiky the memory usage has become between versions 3.6 and 3.7.

In my case those spikes have become ~3x larger and they surpassed the memory limits (which were quite generous IMO). I guess it's worth understanding whether this is/was truly known and is by design from the contributors? That at least would confirm whether or not this should be considered a bug.

I for one certainly find the memory profile of 3.6 far more pleasing and easier to right-size my environment for.

@jjngx
Copy link
Contributor

jjngx commented Nov 27, 2024

Thank you @MarkTopping for providing details. We are investigating the memory spikes.

@jjngx
Copy link
Contributor

jjngx commented Nov 29, 2024

memory consumption after a release of v3.7.1 into a more busy environment

@MarkTopping could you please provide more detailed information about the busy environment please? What traffic pattern you observe in the affected cluster? Does the changes in traffic trigger autoscaling?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An issue reporting a potential bug needs triage An issue that needs to be triaged
Projects
Status: Todo ☑
Development

No branches or pull requests

3 participants