-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes the requeueing bug which requeues a reconcile request in a wrong workqueue #259
Conversation
…g workqueue Signed-off-by: Alper Rifat Ulucinar <[email protected]>
@@ -35,11 +36,12 @@ func Setup(mgr ctrl.Manager, o tjcontroller.Options) error { | |||
if o.SecretStoreConfigGVK != nil { | |||
cps = append(cps, connection.NewDetailsManager(mgr.GetClient(), *o.SecretStoreConfigGVK, connection.WithTLSConfig(o.ESSOptions.TLSConfig))) | |||
} | |||
eventHandler := handler.NewEventHandler(handler.WithLogger(o.Logger.WithValues("gvk", {{ .TypePackageAlias }}{{ .CRD.Kind }}_GroupVersionKind))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am writing this comment to make sure that what I understand is correct.
As far as I understand, before, we were using a universal event handler for each resource controller that belonged to the provider. With this PR, we allow each resource to use its event handler. This solves the error in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, correct. Previously, the event handler was unintentionally per API group (an API group was sharing an event handler) or per-provider. We are now fixing this so that an event handler is per GVK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ulucinar LGTM!
@ulucinar Could share the tests to do load tests? |
@csantanapr we use the tool here: https://github.com/upbound/uptest/tree/main/cmd/perf As we move Upjet and all the related providers and tooling upstream, we will also document them. No documentation exists currently for this tool, however, here's an example of how to call it:
|
Description of your changes
During some load tests with
upbound/[email protected]
, we've spotted a bug where a reconcile request might be requeued in a wrong workqueue. The bug manifests itself in the provider debug logs with messages similar to the following:The issue here is that the reconciler is attempting to observe a resource using a wrong kind and this results in a not-found error. The root cause of the issue is that we are unintentionally sharing the event handlers across all the reconcilers in a given API group. The fix is to have separate event handlers per GVK.
This bug causes the reconciler to do unnecessary work. Reconciliation opportunities are lost for a given (GVK,name) pair because another pair is observed. This bug does not affect the periodic reconciles done at each poll period and hence an affected resource is eventually successfully reconciled. But successful reconciliation needs more attempts. This results in, for instance, increased TTR, and similar.
This PR also improves some debug logging that proved to be useful in interpreting the results we observed in the load tests. We are also planning to expose some further custom metrics reporting these measurements to improve monitoring.
I have:
make reviewable
to ensure this PR is ready for review.backport release-x.y
labels to auto-backport this PR if necessary.How has this code been tested
Performed load tests with 3000 MRs. Based on just one observation, initially it took ~2h for a resource affected by this bug to transition to the ready state under heavy load (99% total CPU (over all available CPU) utilization). With the fix, it took ~1857 s on average for 23 tries (experiments) with a single resource on top of 3000 MRs.