remove deletedJobs queue in cache model #3686
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
To troubleshoot this issue, my colleague and I worked until 3 AM. I sincerely hope that this fix will be merged into the community repository.
Background
In the controller component, the cache module has a separate deletedJobs queue specifically for handling job deletions. The job-controller also has a queue to process pod and job events. In edge cases, a situation may occur where a job is deleted from etcd but not from the cache. This leads to pods being created by the job-controller and immediately deleted by the gc-controller, causing a repetitive loop until the controller is restarted.
Proposed Solution
Deprecate the
deletedJobs
queue and use the queue within the job-controller uniformly. Add anIsDeleteJobAction
field to theRequest
struct to flag job delete events. When the job-controller's processing function detectsIsDeleteJobAction=true
, it will directly delete the job from the cache.Fixes #3601
Fixes #3357