-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix restart batchjob infinite loop #1228
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@@ -83,7 +88,7 @@ func (s *syncer) reconcile(ctx context.Context) error { | |||
return fmt.Errorf("batchjob %s: failed to reconcile kubejob: %w", batchJob.Name, err) | |||
} | |||
|
|||
if i%syncStatusForEveryNumberOfBatchJobsReconciled == 0 { | |||
if i%syncStatusForEveryNumberOfBatchJobsReconciled == (syncStatusForEveryNumberOfBatchJobsReconciled - 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we only syncing after 10 jobs, and not on every iteration? And what happens if a job has less than 10, eg. 5, when will they get synced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The batch controller will always perform a status update as the final step in OnSync:
radix-operator/pkg/apis/batch/syncer.go
Line 63 in 3a31375
return s.syncStatus(ctx, s.reconcile(ctx)) |
The reason we update status for every 10 jobs is to give some kind of feedback about the sync progress when a batch contains many jobe. If a batch has 100-200 jobs, it can easily take more than a minute to iterate and process them all, and instead of updating status at the end, we update it for every 10 jobs to give status feedback faster just to let the user know that "things are in progress"
A bug in RadixBatch status update function caused jobStatus.Phase to be set to empty string. Since this is not an allowed values, the update failed causing the RB to be resynced over and over because the jobStatus.Restart was not updated either.