Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Improve NVMe controller init robustness #515

Closed
makubacki opened this issue Aug 9, 2023 · 0 comments · Fixed by #516
Closed

[Feature]: Improve NVMe controller init robustness #515

makubacki opened this issue Aug 9, 2023 · 0 comments · Fixed by #516
Assignees
Labels
state:needs-triage Needs to triaged to determine next steps type:feature-request A new feature proposal urgency:low Little to no impact

Comments

@makubacki
Copy link
Member

Feature Overview

Some SSD devices have been observed to stop responding during boot after initially succeeding during initialization. When the failure occurs, the following message is observed from NvmeExpressPassThru:

NvmExpressPassThru: Timeout occurs for an NVMe command.
[ref]

Consequently, NvmeControllerInit() is called and the following assert is encountered:

ASSERT ((Private->Cap.Mpsmin + 12) <= EFI_PAGE_SHIFT);
[ref]

It is suspected that the Cap.Mpsmin value is invalid. In any case, this can cause an assertion to stop DEBUG builds and reset on RELEASE builds preventing recovery from UEFI shell or USB boot.

Solution Overview

  • Perform a cfg read of DID/VID in NvmeControllerInit() to verify the device is accessible
  • Return EFI_DEVICE_ERROR instead of asserting if the Cap.Mpsmin value is invalid

Alternatives Considered

No response

Urgency

Low

Are you going to implement the feature request?

I will implement the feature

Do you need maintainer feedback?

No maintainer feedback needed

Anything else?

No response

@makubacki makubacki added state:needs-triage Needs to triaged to determine next steps type:feature-request A new feature proposal labels Aug 9, 2023
@makubacki makubacki self-assigned this Aug 9, 2023
@github-actions github-actions bot added the urgency:low Little to no impact label Aug 9, 2023
makubacki added a commit that referenced this issue Aug 14, 2023
## Description

Resolves #515

The main change the first commit:

```
MdeModulePkg/NvmExpressDxe: Improve NVMe controller init robustness

It has been observed that some faulty NVMe devices may return
invalid values. The code in NvmExpressDxe recognizes the controller
is not responding correctly and issues an ASSERT() often in DEBUG
builds or a reset in RELEASE builds.

The following changes are made to NvmeControllerInit() to prevent a
faulty NVMe device from halting the overall boot:

1. Check the Vendor ID and Device ID to verify they are read properly
   and if not, return EFI_DEVICE_ERROR
2. Replace the ASSERT() when Memory Page Size Minimum (Cap.Mpsmin)
   with an error message and return EFI_DEVICE_ERROR
```

The additional changes are added due to improvements found in nearby
code.

2:

```
MdeModulePkg/NvmExpressDxe: Correct function parameter modifer

Updates the `Cap` parameter for `ReadNvmeControllerCapabilities()`
to be `OUT` since the buffer pointed to is written within the
function.
```

3:

```
MdePkg/Nvme.h: Add missing NVMe capability descriptions

Adds missing structure member documentation.
```

This will also be sent to the edk2 mailing list, however, it is being
checked
in here first due to some devices that may be able to benefit from the
change
and the upcoming 202308 freeze.

- [ ] Impacts functionality?
- **Functionality** - Does the change ultimately impact how firmware
functions?
- Examples: Add a new library, publish a new PPI, update an algorithm,
...
- [ ] Impacts security?
- **Security** - Does the change have a direct security impact on an
application,
    flow, or firmware?
  - Examples: Crypto algorithm change, buffer overflow fix, parameter
    validation improvement, ...
- [ ] Breaking change?
- **Breaking change** - Will anyone consuming this change experience a
break
    in build or boot behavior?
- Examples: Add a new library class, move a module to a different repo,
call
    a function in a new library class in a pre-existing module, ...
- [ ] Includes tests?
  - **Tests** - Does the change include any explicit test code?
  - Examples: Unit tests, integration tests, robot tests, ...
- [ ] Includes documentation?
- **Documentation** - Does the change contain explicit documentation
additions
    outside direct code modifications (and comments)?
- Examples: Update readme file, add feature readme file, link to
documentation
    on an a separate Web page, ...

## How This Was Tested

- Verified NVMe boot on a physical system
- Verified NVMe driver binding on QEMU Q35 with an emulated NVMe device

## Integration Instructions

N/A

---------

Signed-off-by: Michael Kubacki <[email protected]>
kenlautner pushed a commit that referenced this issue Oct 17, 2023
## Description

Resolves #515

The main change the first commit:

```
MdeModulePkg/NvmExpressDxe: Improve NVMe controller init robustness

It has been observed that some faulty NVMe devices may return
invalid values. The code in NvmExpressDxe recognizes the controller
is not responding correctly and issues an ASSERT() often in DEBUG
builds or a reset in RELEASE builds.

The following changes are made to NvmeControllerInit() to prevent a
faulty NVMe device from halting the overall boot:

1. Check the Vendor ID and Device ID to verify they are read properly
   and if not, return EFI_DEVICE_ERROR
2. Replace the ASSERT() when Memory Page Size Minimum (Cap.Mpsmin)
   with an error message and return EFI_DEVICE_ERROR
```

The additional changes are added due to improvements found in nearby
code.

2:

```
MdeModulePkg/NvmExpressDxe: Correct function parameter modifer

Updates the `Cap` parameter for `ReadNvmeControllerCapabilities()`
to be `OUT` since the buffer pointed to is written within the
function.
```

3:

```
MdePkg/Nvme.h: Add missing NVMe capability descriptions

Adds missing structure member documentation.
```

This will also be sent to the edk2 mailing list, however, it is being
checked
in here first due to some devices that may be able to benefit from the
change
and the upcoming 202308 freeze.

- [ ] Impacts functionality?
- **Functionality** - Does the change ultimately impact how firmware
functions?
- Examples: Add a new library, publish a new PPI, update an algorithm,
...
- [ ] Impacts security?
- **Security** - Does the change have a direct security impact on an
application,
    flow, or firmware?
  - Examples: Crypto algorithm change, buffer overflow fix, parameter
    validation improvement, ...
- [ ] Breaking change?
- **Breaking change** - Will anyone consuming this change experience a
break
    in build or boot behavior?
- Examples: Add a new library class, move a module to a different repo,
call
    a function in a new library class in a pre-existing module, ...
- [ ] Includes tests?
  - **Tests** - Does the change include any explicit test code?
  - Examples: Unit tests, integration tests, robot tests, ...
- [ ] Includes documentation?
- **Documentation** - Does the change contain explicit documentation
additions
    outside direct code modifications (and comments)?
- Examples: Update readme file, add feature readme file, link to
documentation
    on an a separate Web page, ...

## How This Was Tested

- Verified NVMe boot on a physical system
- Verified NVMe driver binding on QEMU Q35 with an emulated NVMe device

## Integration Instructions

N/A

---------

Signed-off-by: Michael Kubacki <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state:needs-triage Needs to triaged to determine next steps type:feature-request A new feature proposal urgency:low Little to no impact
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant