Skip to content

error injection

Mauro Carvalho Chehab edited this page Jul 10, 2024 · 14 revisions

Firmware error injection using QEMU

Setting up an error injection environment on QEMU using firmware-first mode is not hard. Basically, it requires QMP support at QEMU level to do the error injection.

ARM processor QEMU error injection

1. Build QEMU

There's a QEMU patch adding support for error-injection at was based on this patch series. An enhanced version of it is at:

https://gitlab.com/mchehab_kernel/qemu/-/tree/arm-error-inject-v2?ref_type=heads

Compiling QEMU with such patch on it adds a QMP extension to error injection compatible with UAPI 2.9A errata.

In order to build QEMU with just arm support, you can do:

git clone https://gitlab.com/mchehab_kernel/qemu -b arm-error-inject
mkdir qemu/build
cd qemu/build
../configure --target-list=aarch64-softmmu --enable-slirp
make

Alternatively, you can just run configure, but on such case it will build support for multiple architectures.

2. Download an arm64 filesystem

An arm64 image is needed. For instance, you can download one from http://cdimage.debian.org/cdimage/cloud/sid/daily/.

3. Generate an arm64 UEFI bios for QEMU

Please see section EDK2 in https://people.kernel.org/jic23/howto-test-cxl-enablement-on-arm64-using-qemu for some instructions.

4. Compile the Linux Kernel with RAS enabled

In order to test it, Linux kernel should be built for ARM64 on its default config, plus with RAS features enabled, e. g.:

 make defconfig
 ./scripts/config  -e CONFIG_FTRACE -e CONFIG_FTRACE_SYSCALLS -e CONFIG_TRACEPOINTS -e CONFIG_TRACING -e CONFIG_ENABLE_DEFAULT_TRACERS
-e CONFIG_FUNCTION_TRACER -e CONFIG_BRANCH_PROFILE_NONE -e CONFIG_PROBE_EVENTS -e CONFIG_TRACEPOINT_BENCHMARK -e CONFIG_STACK_TRACER
 make olddefconfig
 make all

5. Place rasdaemon at the image

For such step, you could build on your local machine (don't forget to cross-compile to arm64, if your machine has another architecture).

Alternatively, boot your image with QEMU, and build rasdaemon locally there.

Booting the image with QEMU

Executing QEMU and booting the image can be done with the help of arm_einj.py utility.

Simple error injection with default values:

$ arm_einj.py
     {"QMP": {"version": {"qemu": {"micro": 50, "minor": 0, "major": 9},
      "package": "v9.0.0-1742-g0431f23ae43d-dirty"}, "capabilities": ["oob"]}}
{ "execute": "qmp_capabilities" }
     {"return": {}}
{ "execute": "arm-inject-error", "arguments": {"error": [{"type": ["cache-error"]}]} }
     {"return": {}}

Kernel output for such event (without running rasdaemon):

[  110.323068] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[  110.323473] {1}[Hardware Error]: event severity: recoverable
[  110.323759] {1}[Hardware Error]:  Error 0, type: recoverable
[  110.324079] {1}[Hardware Error]:   section_type: ARM processor error
[  110.324393] {1}[Hardware Error]:   MIDR: 0x0000000000000000
[  110.324645] {1}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000080000000
[  110.325077] {1}[Hardware Error]:   running state: 0x0
[  110.325377] {1}[Hardware Error]:   Power State Coordination Interface state: 0
[  110.325807] {1}[Hardware Error]:   Error info structure 0:
[  110.326176] {1}[Hardware Error]:   num errors: 1
[  110.326400] {1}[Hardware Error]:    first error captured
[  110.326630] {1}[Hardware Error]:    propagated error captured
[  110.326954] {1}[Hardware Error]:    error_type: 0x02: cache error
[  110.327227] {1}[Hardware Error]:    error_info: 0x000000000054007f
[  110.327506] {1}[Hardware Error]:     transaction type: Instruction
[  110.327776] {1}[Hardware Error]:     cache error, operation type: Instruction fetch
[  110.328092] {1}[Hardware Error]:     cache level: 1
[  110.328316] {1}[Hardware Error]:     processor context not corrupted
[  110.328573] {1}[Hardware Error]:     the error has not been corrected
[  110.328860] {1}[Hardware Error]:     PC is imprecise
[  110.329624] [Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error

More complex error injection data can be filled with the help of such utility:

$ arm_einj.py --psci 0x49455020 --no-r -t tlb,bus,vendor vendor tlb bus cache -m 2 \
          --arm mpidr,affinity,running,vendor \
          --ctx-array 0xdead,0xbeef,0xabba,0xbaab \
          --vendor 12,23,53,52 3 123 243 0xff
     {"QMP": {"version": {"qemu": {"micro": 50, "minor": 0, "major": 9},
      "package": "v9.0.0-1742-gb7b88e1da609"}, "capabilities": ["oob"]}}
{ "execute": "qmp_capabilities" }
     {"return": {}}
{ "execute": "arm-inject-error", "arguments": {
    "validation": ["mpidr-valid", "affinity-valid",
                   "running-state-valid", "vendor-specific-valid"],
    "running-state": [], "psci-state": 1229279264, "error": [
        {"type": ["tlb-error", "bus-error", "micro-arch-error"], "multiple-error": 2},
        {"type": ["micro-arch-error"]},
        {"type": ["tlb-error"]},
        {"type": ["bus-error"]},
        {"type": ["cache-error"]
    }],
    "context": [{"register": [57005, 48879, 43962, 47787]}],
    "vendor-specific": [12, 23, 53, 52, 3, 123, 243, 255]} }
     {"return": {}}

Error injection Kernel output:

[  914.821165] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[  914.821639] {3}[Hardware Error]: event severity: recoverable
[  914.821945] {3}[Hardware Error]:  Error 0, type: recoverable
[  914.822232] {3}[Hardware Error]:   section_type: ARM processor error
[  914.822548] {3}[Hardware Error]:   MIDR: 0x0000000000000000
[  914.822833] {3}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000080000000
[  914.823247] {3}[Hardware Error]:   error affinity level: 0
[  914.823524] {3}[Hardware Error]:   running state: 0x0
[  914.823793] {3}[Hardware Error]:   Power State Coordination Interface state: 1229279264
[  914.824265] {3}[Hardware Error]:   Error info structure 0:
[  914.824570] {3}[Hardware Error]:   num errors: 3
[  914.824834] {3}[Hardware Error]:    first error captured
[  914.825133] {3}[Hardware Error]:    propagated error captured
[  914.825433] {3}[Hardware Error]:    error_type: 0x1c: TLB error|bus error|micro-architectural error
[  914.825880] {3}[Hardware Error]:   Error info structure 1:
[  914.826167] {3}[Hardware Error]:   num errors: 1
[  914.826421] {3}[Hardware Error]:    first error captured
[  914.826706] {3}[Hardware Error]:    propagated error captured
[  914.827002] {3}[Hardware Error]:    error_type: 0x10: micro-architectural error
[  914.827385] {3}[Hardware Error]:   Error info structure 2:
[  914.827687] {3}[Hardware Error]:   num errors: 1
[  914.827953] {3}[Hardware Error]:    first error captured
[  914.828236] {3}[Hardware Error]:    propagated error captured
[  914.828539] {3}[Hardware Error]:    error_type: 0x04: TLB error
[  914.828868] {3}[Hardware Error]:    error_info: 0x00000080d6460fff
[  914.829204] {3}[Hardware Error]:     transaction type: Generic
[  914.829504] {3}[Hardware Error]:     TLB error, operation type: Generic read (type of instruction or data request cannot be determined)
[  914.830080] {3}[Hardware Error]:     TLB level: 1
[  914.830332] {3}[Hardware Error]:     processor context corrupted
[  914.830647] {3}[Hardware Error]:     the error has been corrected
[  914.830987] {3}[Hardware Error]:     PC is imprecise
[  914.831217] {3}[Hardware Error]:     Program execution can be restarted reliably at the PC associated with the error.
[  914.831618] {3}[Hardware Error]:   Error info structure 3:
[  914.831858] {3}[Hardware Error]:   num errors: 1
[  914.832077] {3}[Hardware Error]:    first error captured
[  914.832305] {3}[Hardware Error]:    propagated error captured
[  914.832549] {3}[Hardware Error]:    error_type: 0x08: bus error
[  914.832814] {3}[Hardware Error]:    error_info: 0x0000000078da03ff
[  914.833084] {3}[Hardware Error]:     transaction type: Generic
[  914.833333] {3}[Hardware Error]:     bus error, operation type: Prefetch
[  914.833612] {3}[Hardware Error]:     affinity level at which the bus error occurred: 3
[  914.833933] {3}[Hardware Error]:     processor context not corrupted
[  914.834193] {3}[Hardware Error]:     the error has not been corrected
[  914.834454] {3}[Hardware Error]:     PC is precise
[  914.834667] {3}[Hardware Error]:     Program execution can be restarted reliably at the PC associated with the error.
[  914.835071] {3}[Hardware Error]:     participation type: Generic
[  914.835329] {3}[Hardware Error]:     address space: External Memory Access
[  914.835614] {3}[Hardware Error]:   Error info structure 4:
[  914.835853] {3}[Hardware Error]:   num errors: 1
[  914.836138] {3}[Hardware Error]:    first error captured
[  914.836380] {3}[Hardware Error]:    propagated error captured
[  914.836630] {3}[Hardware Error]:    error_type: 0x02: cache error
[  914.836900] {3}[Hardware Error]:    error_info: 0x000000000054007f
[  914.837170] {3}[Hardware Error]:     transaction type: Instruction
[  914.837432] {3}[Hardware Error]:     cache error, operation type: Instruction fetch
[  914.837742] {3}[Hardware Error]:     cache level: 1
[  914.837967] {3}[Hardware Error]:     processor context not corrupted
[  914.838220] {3}[Hardware Error]:     the error has not been corrected
[  914.838466] {3}[Hardware Error]:     PC is imprecise
[  914.838701] {3}[Hardware Error]:   Context info structure 0:
[  914.838967] {3}[Hardware Error]:    register context type: AArch64 EL1 context registers
[  914.839335] {3}[Hardware Error]:    00000000: 0000dead 00000000 0000beef 00000000
[  914.839688] {3}[Hardware Error]:    00000010: 0000abba 00000000 0000baab 00000000
[  914.840018] {3}[Hardware Error]:    00000020: 00000000 00000000
[  914.840294] {3}[Hardware Error]:   Vendor specific error info has 8 bytes:
[  914.840591] {3}[Hardware Error]:    00000000: 3435170c fff37b03                    ..54.{..
[  914.840996] [Firmware Warn]: GHES: Unhandled processor error type 0x1c: TLB error|bus error|micro-architectural error
[  914.841412] [Firmware Warn]: GHES: Unhandled processor error type 0x10: micro-architectural error
[  914.841769] [Firmware Warn]: GHES: Unhandled processor error type 0x04: TLB error
[  914.842084] [Firmware Warn]: GHES: Unhandled processor error type 0x08: bus error
[  914.842395] [Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error

Hardware error injection

Some machines can optionally do firmware (and/or hardware) error injection. This is usually done by setting up some special features at the BIOS level to enable EINJ features. Those are hardware-specific and may require special BIOS used on hardware development by OEM vendors.