Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16276 doc: Address engine unavailability (#15456) #15496

Open
wants to merge 1 commit into
base: release/2.6
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 3 additions & 8 deletions docs/admin/pool_operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -930,15 +930,10 @@ Administrator can set the default pool redundancy factor by environment variable
dead and the number of failed fault domain exceeds or is going to exceed the pool
redundancy factor, it will not change pool map immediately. Instead, it will give
critical log message:
```
intolerable unavailability: engine rank x
In this case, the system administrator should check and try to recover those
failed engines and bring them back with:
dmg system start --ranks=x
one by one. A reintegrate call is not needed.

For true unrecoverable failures, the administrator can still exclude engines.
However, data loss is expected as the number of unrecoverable failures exceeds
the pool redundancy factor.
```
To recover, see [Servers or engines become unavailable](troubleshooting.md#engines-become-unavailable).

## Recovering Container Ownership

Expand Down
15 changes: 15 additions & 0 deletions docs/admin/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -554,6 +554,21 @@ To resolve the issue:

Alternately, the administrator may erase and re-format the DAOS system to start over fresh using the new addresses.

### Engines become unavailable

Engines may become unavailable due to server power losses and reboots, network switch failures, etc. After staying unavailable for a certain period of time, these engines may become "excluded" or "errored" in `dmg system query` output. Once the states of all engines stabilize (see [`CRT_EVENT_DELAY`](env_variables.md)), each pool will check whether there is enough redundancy (see [Pool RF](pool_operations.md#pool-redundancy-factor)) to tolerate the unavailability of the "excluded" or "errored" engines. If there is enough redundancy, these engines will be excluded from the pool ("disabled ranks" in `dmg pool query --health-only` output); otherwise, the pool will perform no exclusion and may become temporarily unavailable (as seen by timeouts of `dmg pool query`, `dmg pool list`, etc.). Similarly, when engines become available, whenever the states of all engines stabilize, each pool will perform the aforementioned check for any unavailable engines that remain.

To restore availability as well as capacity and performance, try to start all "excluded" or "errored" engines. Starting all of them at the same time minimizes the chance of triggering rebuild jobs. In many cases, the following command suffices:
```
$ dmg system start
```
If some pools remain unavailable (e.g., `dmg pool list` keeps timing out) after the previous step, restart the whole system:
```
$ dmg system stop --force
$ dmg system start
```
If some engines have been excluded from certain pools, and they are available again, reintegrate them to the pools.

## Diagnostic and Recovery Tools

!!! WARNING : Please be careful and use this tool under supervision of DAOS support team.
Expand Down
Loading