X86服务器Linux系统对于MCE的Log解析

Linux Log信息主要包括两部份内容:

  1. APEI的GHES(Generic Hardware Error Soure,通用硬件错误源)解析部分,对应message中包含[Hardware Error]的输出信息
  2. mcelog解析部分,对应message中包含mcelog[2402]:的输出信息

Apr 19 06:08:04 S08 kernel: core: [Hardware Error]: Machine check events logged

Apr 19 06:08:04 S08 kernel: {1}[Hardware Error]: Hardware error fcrom APEI Generic Hardware Error Soure: 0

Apr 19 06:08:04 S08 kernel: {1}[Hardware Error]: It has been corrected by h/w and requires no further action

Apr 19 06:08:04 S08 kernel: {1}[Hardware Error]: event severity: corrected

Apr 19 06:08:04 S08 kernel: {1}[Hardware Error]: Error 0, type: corrected

Apr 19 06:08:04 S08 kernel: {1}[Hardware Error]: section_type: memory error

Apr 19 06:08:04 S08 kernel: {1}[Hardware Error]: error_status: 0x0000000000000400

Apr 19 06:08:04 S08 kernel: {1}[Hardware Error]: physical_address: 0x00000033196956c0

Apr 19 06:08:04 S08 kernel: {1}[Hardware Error]: physical_address_mask: 0x00003fffffffffc0

Apr 19 06:08:04 S08 kernel: {1}[Hardware Error]: node: 0 card: 1 module: 0 rank: 3 bank: 8 device: 2 row: 51429 column: 304

Apr 19 06:08:04 S08 kernel: {1}[Hardware Error]: error_type: 2, single-bit ECC

Apr 19 06:08:04 S08 kernel: {1}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000

Apr 19 06:08:04 S08 kernel: core: [Hardware Error]: Machine check events logged

Apr 19 06:08:04 S08 mcelog[2402]: mcelog: Family 6 Model 143 CPU: only decoding architectural errors

Apr 19 06:08:04 S08 mcelog[2402]: Hardware event. This is not a software error.

Apr 19 06:08:04 S08 mcelog[2402]: MCE 0

Apr 19 06:08:04 S08 mcelog[2402]: CPU 0 BANK 14 TSC 3474b9ffda1b

Apr 19 06:08:04 S08 mcelog[2402]: MISC b00104647289886 ADDR 33196956c0

Apr 19 06:08:04 S08 mcelog[2402]: TIME 1650319684 Tue Apr 19 06:08:04 2022

Apr 19 06:08:04 S08 mcelog[2402]: MCG status:

Apr 19 06:08:04 S08 mcelog[2402]: MCi status:

Apr 19 06:08:04 S08 mcelog[2402]: Corrected error

Apr 19 06:08:04 S08 mcelog[2402]: MCi_MISC register valid

Apr 19 06:08:04 S08 mcelog[2402]: MCi_ADDR register valid

Apr 19 06:08:04 S08 mcelog[2402]: MCA: MEMORY CONTROLLER RD_CHANNEL1_ERR

Apr 19 06:08:04 S08 mcelog[2402]: Transaction: Memory read error

Apr 19 06:08:04 S08 mcelog[2402]: STATUS 8c00004200800091 MCGSTATUS 0

Apr 19 06:08:04 S08 mcelog[2402]: MCGCAP f000c15 APICID 0 SOCKETID 0

Apr 19 06:08:04 S08 mcelog[2402]: MICROCODE 8d0004a0

Apr 19 06:08:04 S08 mcelog[2402]: CPUID Vendor Intel Family 6 Model 143 Step 3

GHES解析部分

GHES(Generic Hardware Error Soure)为ACPI框架中APEI(ACPI Platform Error Interfaces)的一部分。平台固件可以使用GHES向OSPM(Operating System-directed configuration and Power Management,操作系统直接能源管理)提供硬件错误信息(Hardware Error)。

当硬件错误发生时,OSPM可以通过error handler读取GHES结构中的Error Status Block来获取硬件错误信息。并由Linux kernel中的相关代码对获取到的信息进行解析后输出到message中。

Linux内核中代码的struct结构定义与ACPI/UEFI Spec中的定义,对应关系如下图:
在这里插入图片描述

Mcelog解析部分

参见网站http://www.mcelog.org/

mcelog会对machine check类型的错误,如内存、IO、CPU等进行记录和解析。

本次内存CE错误输出的信息中:

  • Apr 19 06:08:04 S08 mcelog[2402]: CPU 0 BANK 14 TSC 3474b9ffda1b
    BANK 14 可由对应CPU的EDS手册查到对应的IMC和Channel

  • Apr 19 06:08:04 S08 mcelog[2402]: MISC b00104647289886 ADDR 33196956c0

    对应MCA BANK中的MISR与ADDR寄存器

  • Apr 19 06:08:04 S08 mcelog[2402]: STATUS 8c00004200800091 MCGSTATUS 0

    对应MCA BANK中的STATUS寄存器

参考资料

  1. Intel® 64 and IA-32 ArchitecturesSoftware developers Manual - Volume3B Chapter15
  2. http://www.mcelog.org/
  3. Advanced Configuration and Power Interface (ACPI) Specification
  4. Unified Extensible Firmware Interface (UEFI)Specification
Logo

更多推荐