Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove deletedJobs queue in cache model #3686

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Wang-Kai
Copy link
Contributor

@Wang-Kai Wang-Kai commented Aug 21, 2024

To troubleshoot this issue, my colleague and I worked until 3 AM. I sincerely hope that this fix will be merged into the community repository.

Background

In the controller component, the cache module has a separate deletedJobs queue specifically for handling job deletions. The job-controller also has a queue to process pod and job events. In edge cases, a situation may occur where a job is deleted from etcd but not from the cache. This leads to pods being created by the job-controller and immediately deleted by the gc-controller, causing a repetitive loop until the controller is restarted.

  1. Received a pod update event and started processing the pod update (processing thread thread_1).
  2. Received a job delete event and started processing the job deletion (processing thread thread_2).
  3. Time 3: thread_2 sets jobInfo.Job to nil and adds jobInfo to the deletedJobs queue.
  4. Time 4: thread_1 completes the job status update and assigns a value to jobInfo.Job.
  5. Time 5: A worker picks up jobInfo from the deletedJobs queue and finds that jobTerminated(job) == false, leading to repeated retries without success.
  6. Time 6: Eventually, the job no longer exists in etcd but still exists in the cache.
  7. Time 7: The gc-controller notices that the job no longer exists, so it cascades and deletes the pod.
  8. Time 8: The job-controller detects that the pod was deleted, but since the job still exists in the cache, it proceeds to recreate the pod.

Proposed Solution

Deprecate the deletedJobs queue and use the queue within the job-controller uniformly. Add an IsDeleteJobAction field to the Request struct to flag job delete events. When the job-controller's processing function detects IsDeleteJobAction=true, it will directly delete the job from the cache.

Fixes #3601
Fixes #3357

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign shinytang6
You can assign the PR to them by writing /assign @shinytang6 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 21, 2024
@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 25, 2024
@volcano-sh-bot
Copy link
Contributor

@Wang-Kai: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pod are repeatedly created and deleted volcano controller repeatedly create pod after job had deleted
2 participants