-
Notifications
You must be signed in to change notification settings - Fork 81
error injection
Setting up an error injection environment on QEMU using firmware-first mode is not hard. Basically, it requires QMP support at QEMU level to do the error injection.
There's a QEMU patch adding support for error-injection at was based on this patch series. An enhanced version of it is at:
https://gitlab.com/mchehab_kernel/qemu/-/tree/arm-error-inject-v2?ref_type=heads
Compiling QEMU with such patch on it adds a QMP extension to error injection compatible with UAPI 2.9A errata.
In order to build QEMU with just arm support, you can do:
git clone https://gitlab.com/mchehab_kernel/qemu -b arm-error-inject mkdir qemu/build cd qemu/build ../configure --target-list=aarch64-softmmu --enable-slirp make
Alternatively, you can just run configure, but on such case it will build support for multiple architectures.
An arm64 image is needed. For instance, you can download one from http://cdimage.debian.org/cdimage/cloud/sid/daily/.
Please see section EDK2 in https://people.kernel.org/jic23/howto-test-cxl-enablement-on-arm64-using-qemu for some instructions.
In order to test it, Linux kernel should be built for ARM64 on its default config, plus with RAS features enabled, e. g.:
make defconfig ./scripts/config -e CONFIG_FTRACE -e CONFIG_FTRACE_SYSCALLS -e CONFIG_TRACEPOINTS -e CONFIG_TRACING -e CONFIG_ENABLE_DEFAULT_TRACERS -e CONFIG_FUNCTION_TRACER -e CONFIG_BRANCH_PROFILE_NONE -e CONFIG_PROBE_EVENTS -e CONFIG_TRACEPOINT_BENCHMARK -e CONFIG_STACK_TRACER make olddefconfig make all
For such step, you could build on your local machine (don't forget to cross-compile to arm64, if your machine has another architecture).
Alternatively, boot your image with QEMU, and build rasdaemon locally there.
Executing QEMU and booting the image can be done with the help of arm_einj.py utility.
Simple error injection with default values:
$ arm_einj.py {"QMP": {"version": {"qemu": {"micro": 50, "minor": 0, "major": 9}, "package": "v9.0.0-1742-g0431f23ae43d-dirty"}, "capabilities": ["oob"]}} { "execute": "qmp_capabilities" } {"return": {}} { "execute": "arm-inject-error", "arguments": {"error": [{"type": ["cache-error"]}]} } {"return": {}}
Kernel output for such event (without running rasdaemon):
[ 110.323068] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 [ 110.323473] {1}[Hardware Error]: event severity: recoverable [ 110.323759] {1}[Hardware Error]: Error 0, type: recoverable [ 110.324079] {1}[Hardware Error]: section_type: ARM processor error [ 110.324393] {1}[Hardware Error]: MIDR: 0x0000000000000000 [ 110.324645] {1}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000080000000 [ 110.325077] {1}[Hardware Error]: running state: 0x0 [ 110.325377] {1}[Hardware Error]: Power State Coordination Interface state: 0 [ 110.325807] {1}[Hardware Error]: Error info structure 0: [ 110.326176] {1}[Hardware Error]: num errors: 1 [ 110.326400] {1}[Hardware Error]: first error captured [ 110.326630] {1}[Hardware Error]: propagated error captured [ 110.326954] {1}[Hardware Error]: error_type: 0x02: cache error [ 110.327227] {1}[Hardware Error]: error_info: 0x000000000054007f [ 110.327506] {1}[Hardware Error]: transaction type: Instruction [ 110.327776] {1}[Hardware Error]: cache error, operation type: Instruction fetch [ 110.328092] {1}[Hardware Error]: cache level: 1 [ 110.328316] {1}[Hardware Error]: processor context not corrupted [ 110.328573] {1}[Hardware Error]: the error has not been corrected [ 110.328860] {1}[Hardware Error]: PC is imprecise [ 110.329624] [Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error
More complex error injection data can be filled with the help of such utility:
$ arm_einj.py --psci 0x49455020 --no-r -t tlb,bus,vendor vendor tlb bus cache -m 2 \ --arm mpidr,affinity,running,vendor \ --ctx-array 0xdead,0xbeef,0xabba,0xbaab \ --vendor 12,23,53,52 3 123 243 0xff {"QMP": {"version": {"qemu": {"micro": 50, "minor": 0, "major": 9}, "package": "v9.0.0-1742-gb7b88e1da609"}, "capabilities": ["oob"]}} { "execute": "qmp_capabilities" } {"return": {}} { "execute": "arm-inject-error", "arguments": { "validation": ["mpidr-valid", "affinity-valid", "running-state-valid", "vendor-specific-valid"], "running-state": [], "psci-state": 1229279264, "error": [ {"type": ["tlb-error", "bus-error", "micro-arch-error"], "multiple-error": 2}, {"type": ["micro-arch-error"]}, {"type": ["tlb-error"]}, {"type": ["bus-error"]}, {"type": ["cache-error"] }], "context": [{"register": [57005, 48879, 43962, 47787]}], "vendor-specific": [12, 23, 53, 52, 3, 123, 243, 255]} } {"return": {}}
Error injection Kernel output:
[ 914.821165] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 [ 914.821639] {3}[Hardware Error]: event severity: recoverable [ 914.821945] {3}[Hardware Error]: Error 0, type: recoverable [ 914.822232] {3}[Hardware Error]: section_type: ARM processor error [ 914.822548] {3}[Hardware Error]: MIDR: 0x0000000000000000 [ 914.822833] {3}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000080000000 [ 914.823247] {3}[Hardware Error]: error affinity level: 0 [ 914.823524] {3}[Hardware Error]: running state: 0x0 [ 914.823793] {3}[Hardware Error]: Power State Coordination Interface state: 1229279264 [ 914.824265] {3}[Hardware Error]: Error info structure 0: [ 914.824570] {3}[Hardware Error]: num errors: 3 [ 914.824834] {3}[Hardware Error]: first error captured [ 914.825133] {3}[Hardware Error]: propagated error captured [ 914.825433] {3}[Hardware Error]: error_type: 0x1c: TLB error|bus error|micro-architectural error [ 914.825880] {3}[Hardware Error]: Error info structure 1: [ 914.826167] {3}[Hardware Error]: num errors: 1 [ 914.826421] {3}[Hardware Error]: first error captured [ 914.826706] {3}[Hardware Error]: propagated error captured [ 914.827002] {3}[Hardware Error]: error_type: 0x10: micro-architectural error [ 914.827385] {3}[Hardware Error]: Error info structure 2: [ 914.827687] {3}[Hardware Error]: num errors: 1 [ 914.827953] {3}[Hardware Error]: first error captured [ 914.828236] {3}[Hardware Error]: propagated error captured [ 914.828539] {3}[Hardware Error]: error_type: 0x04: TLB error [ 914.828868] {3}[Hardware Error]: error_info: 0x00000080d6460fff [ 914.829204] {3}[Hardware Error]: transaction type: Generic [ 914.829504] {3}[Hardware Error]: TLB error, operation type: Generic read (type of instruction or data request cannot be determined) [ 914.830080] {3}[Hardware Error]: TLB level: 1 [ 914.830332] {3}[Hardware Error]: processor context corrupted [ 914.830647] {3}[Hardware Error]: the error has been corrected [ 914.830987] {3}[Hardware Error]: PC is imprecise [ 914.831217] {3}[Hardware Error]: Program execution can be restarted reliably at the PC associated with the error. [ 914.831618] {3}[Hardware Error]: Error info structure 3: [ 914.831858] {3}[Hardware Error]: num errors: 1 [ 914.832077] {3}[Hardware Error]: first error captured [ 914.832305] {3}[Hardware Error]: propagated error captured [ 914.832549] {3}[Hardware Error]: error_type: 0x08: bus error [ 914.832814] {3}[Hardware Error]: error_info: 0x0000000078da03ff [ 914.833084] {3}[Hardware Error]: transaction type: Generic [ 914.833333] {3}[Hardware Error]: bus error, operation type: Prefetch [ 914.833612] {3}[Hardware Error]: affinity level at which the bus error occurred: 3 [ 914.833933] {3}[Hardware Error]: processor context not corrupted [ 914.834193] {3}[Hardware Error]: the error has not been corrected [ 914.834454] {3}[Hardware Error]: PC is precise [ 914.834667] {3}[Hardware Error]: Program execution can be restarted reliably at the PC associated with the error. [ 914.835071] {3}[Hardware Error]: participation type: Generic [ 914.835329] {3}[Hardware Error]: address space: External Memory Access [ 914.835614] {3}[Hardware Error]: Error info structure 4: [ 914.835853] {3}[Hardware Error]: num errors: 1 [ 914.836138] {3}[Hardware Error]: first error captured [ 914.836380] {3}[Hardware Error]: propagated error captured [ 914.836630] {3}[Hardware Error]: error_type: 0x02: cache error [ 914.836900] {3}[Hardware Error]: error_info: 0x000000000054007f [ 914.837170] {3}[Hardware Error]: transaction type: Instruction [ 914.837432] {3}[Hardware Error]: cache error, operation type: Instruction fetch [ 914.837742] {3}[Hardware Error]: cache level: 1 [ 914.837967] {3}[Hardware Error]: processor context not corrupted [ 914.838220] {3}[Hardware Error]: the error has not been corrected [ 914.838466] {3}[Hardware Error]: PC is imprecise [ 914.838701] {3}[Hardware Error]: Context info structure 0: [ 914.838967] {3}[Hardware Error]: register context type: AArch64 EL1 context registers [ 914.839335] {3}[Hardware Error]: 00000000: 0000dead 00000000 0000beef 00000000 [ 914.839688] {3}[Hardware Error]: 00000010: 0000abba 00000000 0000baab 00000000 [ 914.840018] {3}[Hardware Error]: 00000020: 00000000 00000000 [ 914.840294] {3}[Hardware Error]: Vendor specific error info has 8 bytes: [ 914.840591] {3}[Hardware Error]: 00000000: 3435170c fff37b03 ..54.{.. [ 914.840996] [Firmware Warn]: GHES: Unhandled processor error type 0x1c: TLB error|bus error|micro-architectural error [ 914.841412] [Firmware Warn]: GHES: Unhandled processor error type 0x10: micro-architectural error [ 914.841769] [Firmware Warn]: GHES: Unhandled processor error type 0x04: TLB error [ 914.842084] [Firmware Warn]: GHES: Unhandled processor error type 0x08: bus error [ 914.842395] [Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error
Some machines can optionally do firmware (and/or hardware) error injection. This is usually done by setting up some special features at the BIOS level to enable EINJ features. Those are hardware-specific and may require special BIOS used on hardware development by OEM vendors.