Skip to content

Commit

Permalink
docs: proposal for async pod delete handling
Browse files Browse the repository at this point in the history
Signed-off-by: Evan Baker <[email protected]>
  • Loading branch information
rbtr authored Aug 11, 2023
1 parent c633bf5 commit b35bfd7
Show file tree
Hide file tree
Showing 3 changed files with 96 additions and 0 deletions.
24 changes: 24 additions & 0 deletions docs/feature/async-delete/cni.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# CNI Async Delete

```mermaid
sequenceDiagram
participant CRI
participant CNI
participant CNS
CRI->>+CNI: Delete Pod
CNI->>+CNS: Release IP
alt CNS Responds
alt IP Released
CNS->>CNI: Released IP
CNI->>CRI: Clean up Pod
else Error response
CNS->>CNI: Error
CNI->>CRI: Delete failed, retry
else CNS unresponsive
CNS->>-CNI: [No response]
CNI->>Filesystem queue: Write delete Pod intent
Filesystem queue->>CNI:
CNI->>-CRI: Clean up Pod
end
end
```
50 changes: 50 additions & 0 deletions docs/feature/async-delete/cns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# CNS Async Delete

#### Components

```mermaid
sequenceDiagram
participant CNI
participant Filesystem queue
participant CNS
loop
CNS->>Filesystem queue: List-watch for Pod deletes
Filesystem queue->>CNS:
end
CNI->>+CNS: Release IP
alt CNS Responds
alt IP Released
CNS->>CNI: Released IP
else Error response
CNS->>CNI: Error
else CNS unresponsive
CNS->>-CNI: [No response]
CNI->>Filesystem queue: Write delete Pod intent
end
end
```

#### CNS Internals

```mermaid
sequenceDiagram
participant CNI
participant FS Watcher
participant Release IP API
participant IPAM
loop
FS Watcher->>FS Watcher: List-watch for Pod deletes
end
alt Async delete events
FS Watcher->>+Release IP API: Release IP
else Sync delete events
CNI->>Release IP API: Release IP
end
Release IP API->>+IPAM: Release IP
alt IP Released
IPAM->>Release IP API: Released IP
else Error response
IPAM->>-Release IP API: Error
end
```
22 changes: 22 additions & 0 deletions docs/feature/async-delete/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
## Asyc Delete

### Introduction

In AKS with Azure CNI, the Azure CNS service manages the CNI IPAM. The `azure-vnet` CNI plugin (and any CNI using delegated IPAM through `azure-ipam`) makes IP requests to the CNS API to request an IP during Pod createion or to release an IP during Pod deletion. The CNS API is a synchronous API, which means that the IP request is not completed until the IP is allocated or released in CNS internal IPAM state.

There is a deadlock scenario possible when the CNS API is not available (due to daemonset rollouts or for other reason):
If the Node is fully saturated with Pods (scheduled pods == maxPods), and CNS is not running (a CNS daemonset rollout _deletes_ the existing Pod, then schedules the upgraded Pod), the scheduler will attempt to preempt a low priority Pod to make room for the CNS Pod. However, with no CNS Pod currently running, the CNI delete call will fail, and the Pod will be stuck in the `Terminating` state since the CRI cannot clean up the netns. The scheduler will not be able to schedule the CNS Pod, and the Node will deadlock without manual intervention to decrease the Pod pressure.

### Proposal

To address this deadlock issue, the CNI calls to CNS to release an IP address from a Pod need to be made asynchronously with a failsafe in such a way that if CNS is unavailable, it can recover these events when it does eventually start.

### Design

The CNI plugins (`azure-vnet`, `azure-ipam`) will be modified to treat a non-response from CNS during IP release as a non-fatal error and execution will proceed. A positive error response will still be treated as a real error and returned to the CRI for retry.

If the Pod IP release was not acknowledged by CNS, the CNI plugins will fall back to a file-system backed system to save these events. When the CNI does not get a response, it will write that Pod ID to a "release queue" directory/file, and proceed with cleaning up the Pod netns.

When CNS starts, it will create a watch on the "release queue" directory/file, and process the Pod IDs in the queue. IPs for those Pods will then be released in CNS IPAM state.

This will allow the CNI to recover from the CNS unavailability, unwedging the Pod deletion process, and allowing the scheduler to start the CNS Pod to get back to steady-state.

0 comments on commit b35bfd7

Please sign in to comment.