-
Notifications
You must be signed in to change notification settings - Fork 240
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: proposal for async pod delete handling
Signed-off-by: Evan Baker <[email protected]>
- Loading branch information
Showing
3 changed files
with
96 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# CNI Async Delete | ||
|
||
```mermaid | ||
sequenceDiagram | ||
participant CRI | ||
participant CNI | ||
participant CNS | ||
CRI->>+CNI: Delete Pod | ||
CNI->>+CNS: Release IP | ||
alt CNS Responds | ||
alt IP Released | ||
CNS->>CNI: Released IP | ||
CNI->>CRI: Clean up Pod | ||
else Error response | ||
CNS->>CNI: Error | ||
CNI->>CRI: Delete failed, retry | ||
else CNS unresponsive | ||
CNS->>-CNI: [No response] | ||
CNI->>Filesystem queue: Write delete Pod intent | ||
Filesystem queue->>CNI: | ||
CNI->>-CRI: Clean up Pod | ||
end | ||
end | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
# CNS Async Delete | ||
|
||
#### Components | ||
|
||
```mermaid | ||
sequenceDiagram | ||
participant CNI | ||
participant Filesystem queue | ||
participant CNS | ||
loop | ||
CNS->>Filesystem queue: List-watch for Pod deletes | ||
Filesystem queue->>CNS: | ||
end | ||
CNI->>+CNS: Release IP | ||
alt CNS Responds | ||
alt IP Released | ||
CNS->>CNI: Released IP | ||
else Error response | ||
CNS->>CNI: Error | ||
else CNS unresponsive | ||
CNS->>-CNI: [No response] | ||
CNI->>Filesystem queue: Write delete Pod intent | ||
end | ||
end | ||
``` | ||
|
||
#### CNS Internals | ||
|
||
```mermaid | ||
sequenceDiagram | ||
participant CNI | ||
participant FS Watcher | ||
participant Release IP API | ||
participant IPAM | ||
loop | ||
FS Watcher->>FS Watcher: List-watch for Pod deletes | ||
end | ||
alt Async delete events | ||
FS Watcher->>+Release IP API: Release IP | ||
else Sync delete events | ||
CNI->>Release IP API: Release IP | ||
end | ||
Release IP API->>+IPAM: Release IP | ||
alt IP Released | ||
IPAM->>Release IP API: Released IP | ||
else Error response | ||
IPAM->>-Release IP API: Error | ||
end | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
## Asyc Delete | ||
|
||
### Introduction | ||
|
||
In AKS with Azure CNI, the Azure CNS service manages the CNI IPAM. The `azure-vnet` CNI plugin (and any CNI using delegated IPAM through `azure-ipam`) makes IP requests to the CNS API to request an IP during Pod createion or to release an IP during Pod deletion. The CNS API is a synchronous API, which means that the IP request is not completed until the IP is allocated or released in CNS internal IPAM state. | ||
|
||
There is a deadlock scenario possible when the CNS API is not available (due to daemonset rollouts or for other reason): | ||
If the Node is fully saturated with Pods (scheduled pods == maxPods), and CNS is not running (a CNS daemonset rollout _deletes_ the existing Pod, then schedules the upgraded Pod), the scheduler will attempt to preempt a low priority Pod to make room for the CNS Pod. However, with no CNS Pod currently running, the CNI delete call will fail, and the Pod will be stuck in the `Terminating` state since the CRI cannot clean up the netns. The scheduler will not be able to schedule the CNS Pod, and the Node will deadlock without manual intervention to decrease the Pod pressure. | ||
|
||
### Proposal | ||
|
||
To address this deadlock issue, the CNI calls to CNS to release an IP address from a Pod need to be made asynchronously with a failsafe in such a way that if CNS is unavailable, it can recover these events when it does eventually start. | ||
|
||
### Design | ||
|
||
The CNI plugins (`azure-vnet`, `azure-ipam`) will be modified to treat a non-response from CNS during IP release as a non-fatal error and execution will proceed. A positive error response will still be treated as a real error and returned to the CRI for retry. | ||
|
||
If the Pod IP release was not acknowledged by CNS, the CNI plugins will fall back to a file-system backed system to save these events. When the CNI does not get a response, it will write that Pod ID to a "release queue" directory/file, and proceed with cleaning up the Pod netns. | ||
|
||
When CNS starts, it will create a watch on the "release queue" directory/file, and process the Pod IDs in the queue. IPs for those Pods will then be released in CNS IPAM state. | ||
|
||
This will allow the CNI to recover from the CNS unavailability, unwedging the Pod deletion process, and allowing the scheduler to start the CNS Pod to get back to steady-state. |