Fixes the requeueing bug which requeues a reconcile request in a wrong workqueue #259

ulucinar · 2023-08-18T08:53:39Z

Description of your changes

During some load tests with upbound/[email protected], we've spotted a bug where a reconcile request might be requeued in a wrong workqueue. The bug manifests itself in the provider debug logs with messages similar to the following:

2023-08-10T14:47:56Z    DEBUG   provider-aws    Reconciling     {"controller": "managed/iam.aws.upbound.io/v1beta1, kind=rolepolicyattachment", "request": {"name":"rachel-sweeney-auth-service-default-policy-h5ck6"}}
2023-08-10T14:47:56Z    DEBUG   provider-aws    Cannot get managed resource     {"controller": "managed/iam.aws.upbound.io/v1beta1, kind=rolepolicyattachment", "request": {"name":"rachel-sweeney-auth-service-default-policy-h5ck6"}, "error": "RolePolicyAttachment.iam.aws.upbound.io \"rachel-sweeney-auth-service-default-policy-h5ck6\" not found"}
2023-08-10T14:48:10Z    DEBUG   provider-aws    Reconciling     {"controller": "managed/iam.aws.upbound.io/v1beta1, kind=policy", "request": {"name":"rachel-sweeney-auth-service-default-policy-h5ck6"}}
2023-08-10T14:48:10Z    DEBUG   provider-aws    External resource is up to date {"controller": "managed/iam.aws.upbound.io/v1beta1, kind=policy", "request": {"name":"rachel-sweeney-auth-service-default-policy-h5ck6"}, "uid": "7f8b9e51-34c4-475e-8653-53cb2adb364d", "version": "51370046", "external-name": "rachel-sweeney-auth-service-default-policy-h5ck6", "requeue-after": "2023-08-10T14:49:10Z"}
2023-08-10T14:50:04Z    DEBUG   provider-aws    Reconciling     {"controller": "managed/iam.aws.upbound.io/v1beta1, kind=policy", "request": {"name":"rachel-sweeney-auth-service-default-policy-h5ck6"}}
2023-08-10T14:50:04Z    DEBUG   provider-aws    External resource is up to date {"controller": "managed/iam.aws.upbound.io/v1beta1, kind=policy", "request": {"name":"rachel-sweeney-auth-service-default-policy-h5ck6"}, "uid": "7f8b9e51-34c4-475e-8653-53cb2adb364d", "version": "51370046", "external-name": "rachel-sweeney-auth-service-default-policy-h5ck6", "requeue-after": "2023-08-10T14:51:04Z"}

The issue here is that the reconciler is attempting to observe a resource using a wrong kind and this results in a not-found error. The root cause of the issue is that we are unintentionally sharing the event handlers across all the reconcilers in a given API group. The fix is to have separate event handlers per GVK.

This bug causes the reconciler to do unnecessary work. Reconciliation opportunities are lost for a given (GVK,name) pair because another pair is observed. This bug does not affect the periodic reconciles done at each poll period and hence an affected resource is eventually successfully reconciled. But successful reconciliation needs more attempts. This results in, for instance, increased TTR, and similar.

This PR also improves some debug logging that proved to be useful in interpreting the results we observed in the load tests. We are also planning to expose some further custom metrics reporting these measurements to improve monitoring.

I have:

Read and followed Crossplane's contribution process.
Run make reviewable to ensure this PR is ready for review.
Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

Performed load tests with 3000 MRs. Based on just one observation, initially it took ~2h for a resource affected by this bug to transition to the ready state under heavy load (99% total CPU (over all available CPU) utilization). With the fix, it took ~1857 s on average for 23 tries (experiments) with a single resource on top of 3000 MRs.

…g workqueue Signed-off-by: Alper Rifat Ulucinar <[email protected]>

sergenyalcin · 2023-08-18T09:20:45Z

pkg/pipeline/templates/controller.go.tmpl

@@ -35,11 +36,12 @@ func Setup(mgr ctrl.Manager, o tjcontroller.Options) error {
 	if o.SecretStoreConfigGVK != nil {
 		cps = append(cps, connection.NewDetailsManager(mgr.GetClient(), *o.SecretStoreConfigGVK, connection.WithTLSConfig(o.ESSOptions.TLSConfig)))
 	}
+	eventHandler := handler.NewEventHandler(handler.WithLogger(o.Logger.WithValues("gvk", {{ .TypePackageAlias }}{{ .CRD.Kind }}_GroupVersionKind)))


I am writing this comment to make sure that what I understand is correct.

As far as I understand, before, we were using a universal event handler for each resource controller that belonged to the provider. With this PR, we allow each resource to use its event handler. This solves the error in the PR description.

Yes, correct. Previously, the event handler was unintentionally per API group (an API group was sharing an event handler) or per-provider. We are now fixing this so that an event handler is per GVK.

sergenyalcin

Thanks @ulucinar LGTM!

csantanapr · 2023-09-22T10:36:34Z

@ulucinar Could share the tests to do load tests?
What kind of MR do you create? Real cloud resource ?

jeanduplessis · 2023-09-22T12:15:35Z

@csantanapr we use the tool here: https://github.com/upbound/uptest/tree/main/cmd/perf

As we move Upjet and all the related providers and tooling upstream, we will also document them. No documentation exists currently for this tool, however, here's an example of how to call it:

go run main.go --mrs ./cmd/perf/manifests/ecr.yaml=<number-of-mrs-to-provision> --provider-pod <pod-name> --provider-namespace upbound-system --node <node-ip> --step-duration 1s

# For 10 MRs:
go run main.go --mrs ./cmd/perf/manifests/ecr.yaml=10 --provider-pod <pod-name> --provider-namespace upbound-system --node <node-ip> --step-duration 1s

Fixes the requeueing bug which requeues a reconcile request in a wron…

4cfa964

…g workqueue Signed-off-by: Alper Rifat Ulucinar <[email protected]>

ulucinar requested a review from sergenyalcin as a code owner August 18, 2023 08:53

ulucinar requested review from sergenyalcin and removed request for sergenyalcin August 18, 2023 08:53

sergenyalcin reviewed Aug 18, 2023

View reviewed changes

sergenyalcin approved these changes Aug 18, 2023

View reviewed changes

ulucinar merged commit bd528e4 into crossplane:main Aug 18, 2023
4 checks passed

ulucinar deleted the fix-requeue branch August 18, 2023 11:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes the requeueing bug which requeues a reconcile request in a wrong workqueue #259

Fixes the requeueing bug which requeues a reconcile request in a wrong workqueue #259

ulucinar commented Aug 18, 2023 •

edited

Loading

sergenyalcin Aug 18, 2023

ulucinar Aug 18, 2023 •

edited

Loading

sergenyalcin left a comment

csantanapr commented Sep 22, 2023

jeanduplessis commented Sep 22, 2023

Fixes the requeueing bug which requeues a reconcile request in a wrong workqueue #259

Fixes the requeueing bug which requeues a reconcile request in a wrong workqueue #259

Conversation

ulucinar commented Aug 18, 2023 • edited Loading

Description of your changes

How has this code been tested

sergenyalcin Aug 18, 2023

Choose a reason for hiding this comment

ulucinar Aug 18, 2023 • edited Loading

Choose a reason for hiding this comment

sergenyalcin left a comment

Choose a reason for hiding this comment

csantanapr commented Sep 22, 2023

jeanduplessis commented Sep 22, 2023

ulucinar commented Aug 18, 2023 •

edited

Loading

ulucinar Aug 18, 2023 •

edited

Loading