We use EDAC to monitor errors on DIMM.
I would like to know how EDAC gets notifications about these errors?
I saw a keyword: firmware first mode.
Is this setting set on BIOS or Memory controller?
dmesg
log:
dmesg | grep -i edac
[ 0.346813] EDAC MC: Ver: 3.0.0
[ 97.989717] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[ 97.989727] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[ 97.989738] EDAC sbridge: Seeking for: PCI ID 8086:2f60
[ 97.989742] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[ 97.989745] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[ 97.989748] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[ 97.989751] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[ 97.989754] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[ 97.989757] EDAC sbridge: Seeking for: PCI ID 8086:2faa
...
...
[ 97.989927] EDAC MC0: Giving out device to module sb_edac controller Haswell SrcID#0_Ha#0: DEV 0000:ff:12.0 (INTERRUPT)
[ 97.989927] EDAC sbridge: Ver: 1.1.2
It depends, both are possible.
Memory controllers report errors once; reading them clears them. Thus, if both the system firmware and the operating system try to handle EDAC reporting, races ensue and errors can be missed. So a correctly-configured system will either handle EDAC reporting directly, keeping the system firmware out of the loop, or use GHES (the appropriate ACPI driver) to receive errors from the system firmware (this is “firmware first” mode).
With a “direct” EDAC driver, the operating system handles the machine check exception (MCE) from the memory controller, and does whatever is appropriate. With a GHES driver, the system firmware handles the MCE, and informs the operating system.
You’ll see different logs depending on the scenario. The boot logs should tell you what the configuration is, and if a memory error occurs, you’ll see a “software event” in the logs in firmware first mode, a “hardware event” otherwise.
The settings can be a combination of firmware settings and operating system configuration. On most “low-end” ECC-capable systems, there’s no corresponding firmware configuration (just access to the logs), and it’s all up to the operating system. Higher-end servers will have settings in their firmware configuration and a description of how to configure them (and the operating system) in their manual.
この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。
侵害の場合は、連絡してください[email protected]
コメントを追加