Not sure why I got an MCE error related to ECC when my RAM is not ECC

The error messages started with a “no action required”… not sure if that means the kernel knows the system isn’t using ECC RAM or not… but either way it’s weird that I got this error when I don’t use ECC RAM.

$ journalctl -k --priority err --boot 0
Jan 07 12:24:08 AM4-5600X-Linux kernel: [Hardware Error]: Deferred error, no action required.
Jan 07 12:24:08 AM4-5600X-Linux kernel: [Hardware Error]: CPU:1 (19:21:0) MC17_STATUS[Over|-|-|-|-|SyndV|Deferred|-|-]: 0xc1e692b0c646fd27
Jan 07 12:24:08 AM4-5600X-Linux kernel: [Hardware Error]: IPID: 0x0000000000000000, Syndrome: 0x0000000000000000
Jan 07 12:24:08 AM4-5600X-Linux kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 6, DCQ SRAM ECC error.
Jan 07 12:24:08 AM4-5600X-Linux kernel: [Hardware Error]: cache level: L3/GEN, tx: DATA
full inxi
$ inxi -Fazym
System:
  Kernel: 5.15.12-1-MANJARO x86_64 bits: 64 compiler: gcc v: 11.1.0
    parameters: BOOT_IMAGE=/boot/vmlinuz-5.15-x86_64
    root=UUID=5d67a7c6-6cdf-446d-92f6-b7be1f0fb13d rw apparmor=1
    security=apparmor udev.log_priority=3 sysrq_always_enabled=1
  Desktop: KDE Plasma 5.23.4 tk: Qt 5.15.2 wm: kwin_x11 vt: 1 dm: SDDM
    Distro: Manjaro Linux base: Arch Linux
Machine:
  Type: Desktop Mobo: Micro-Star model: MEG X570 UNIFY (MS-7C35) v: 2.0
    serial: <superuser required> UEFI: American Megatrends LLC. v: A.90
    date: 05/17/2021
Memory:
  RAM: total: 31.33 GiB used: 10.18 GiB (32.5%)
  RAM Report:
    permissions: Unable to run dmidecode. Root privileges required.
CPU:
  Info: model: AMD Ryzen 5 5600X bits: 64 type: MT MCP arch: Zen 3
    family: 0x19 (25) model-id: 0x21 (33) stepping: 0 microcode: 0xA201009
  Topology: cpus: 1x cores: 6 tpc: 2 threads: 12 smt: enabled cache:
    L1: 384 KiB desc: d-6x32 KiB; i-6x32 KiB L2: 3 MiB desc: 6x512 KiB
    L3: 32 MiB desc: 1x32 MiB
  Speed (MHz): avg: 2912 high: 3841 min/max: 2200/4650 boost: enabled
    scaling: driver: acpi-cpufreq governor: schedutil cores: 1: 3841 2: 3057
    3: 3178 4: 2840 5: 2214 6: 2461 7: 2969 8: 2879 9: 2879 10: 2876 11: 2879
    12: 2879 bogomips: 88836
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
  Vulnerabilities:
  Type: itlb_multihit status: Not affected
  Type: l1tf status: Not affected
  Type: mds status: Not affected
  Type: meltdown status: Not affected
  Type: spec_store_bypass
    mitigation: Speculative Store Bypass disabled via prctl and seccomp
  Type: spectre_v1
    mitigation: usercopy/swapgs barriers and __user pointer sanitization
  Type: spectre_v2 mitigation: Full AMD retpoline, IBPB: conditional,
    IBRS_FW, STIBP: always-on, RSB filling
  Type: srbds status: Not affected
  Type: tsx_async_abort status: Not affected
Graphics:
  Device-1: AMD Navi 21 [Radeon RX 6800/6800 XT / 6900 XT]
    vendor: XFX Limited XFX Speedster MERC 319 driver: amdgpu v: kernel
    bus-ID: 2f:00.0 chip-ID: 1002:73bf class-ID: 0300
  Display: x11 server: X.Org 1.21.1.2 compositor: kwin_x11 driver:
    loaded: amdgpu,ati unloaded: modesetting,radeon alternate: fbdev,vesa
    display-ID: :0 screens: 1
  Screen-1: 0 s-res: 5120x1440 s-dpi: 96 s-size: 1354x381mm (53.3x15.0")
    s-diag: 1407mm (55.4")
  Monitor-1: DisplayPort-0 res: 2560x1440 dpi: 93
    size: 698x392mm (27.5x15.4") diag: 801mm (31.5")
  Monitor-2: DisplayPort-1 res: 2560x1440 hz: 144 dpi: 93
    size: 697x392mm (27.4x15.4") diag: 800mm (31.5")
  OpenGL: renderer: AMD Radeon RX 6800 XT (SIENNA_CICHLID DRM 3.42.0
    5.15.12-1-MANJARO LLVM 13.0.0)
    v: 4.6 Mesa 21.3.2 direct render: Yes
Audio:
  Device-1: AMD Navi 21 HDMI Audio [Radeon RX 6800/6800 XT / 6900 XT]
    driver: snd_hda_intel v: kernel bus-ID: 2f:00.1 chip-ID: 1002:ab28
    class-ID: 0403
  Device-2: AMD Starship/Matisse HD Audio vendor: Micro-Star MSI
    driver: snd_hda_intel v: kernel bus-ID: 31:00.4 chip-ID: 1022:1487
    class-ID: 0403
  Device-3: Corsair CORSAIR VIRTUOSO SE USB Gaming Headset type: USB
    driver: hid-generic,snd-usb-audio,usbhid bus-ID: 3-4:3 chip-ID: 1b1c:0a3d
    class-ID: 0300 serial: <filter>
  Sound Server-1: ALSA v: k5.15.12-1-MANJARO running: yes
  Sound Server-2: sndio v: N/A running: no
  Sound Server-3: JACK v: 1.9.19 running: no
  Sound Server-4: PulseAudio v: 15.0 running: no
  Sound Server-5: PipeWire v: 0.3.42 running: yes
Network:
  Device-1: Realtek RTL8125 2.5GbE vendor: Micro-Star MSI driver: r8169
    v: kernel port: f000 bus-ID: 27:00.0 chip-ID: 10ec:8125 class-ID: 0200
  IF: enp39s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
Bluetooth:
  Device-1: Intel AX200 Bluetooth type: USB driver: btusb v: 0.8 bus-ID: 1-4:2
    chip-ID: 8087:0029 class-ID: e001
  Report: rfkill ID: hci0 rfk-id: 0 state: up address: see --recommends
RAID:
  Supported mdraid levels: raid1
  Device-1: md127 maj-min: 9:127 type: mdraid level: mirror status: active
    size: 7.28 TiB
  Info: report: 2/2 UU blocks: 7813893120 chunk-size: N/A super-blocks: 1.2
  Components: Online:
  0: sdb1 maj-min: 8:17 size: 7.28 TiB
  1: sdc1 maj-min: 8:33 size: 7.28 TiB
Drives:
  Local Storage: total: 19.33 TiB used: 9 TiB (46.5%)
  SMART Message: Unable to run smartctl. Root privileges required.
  ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Western Digital
    model: WDS100T1X0E-00AFY0 size: 931.51 GiB block-size: physical: 512 B
    logical: 512 B speed: 63.2 Gb/s lanes: 4 type: SSD serial: <filter>
    rev: 613200WD temp: 38.9 C scheme: GPT
  ID-2: /dev/nvme1n1 maj-min: 259:3 vendor: Western Digital
    model: WDS100T3X0C-00SJG0 size: 931.51 GiB block-size: physical: 512 B
    logical: 512 B speed: 31.6 Gb/s lanes: 4 type: SSD serial: <filter>
    rev: 102000WD temp: 35.9 C scheme: GPT
  ID-3: /dev/nvme2n1 maj-min: 259:1 vendor: Western Digital
    model: WDS100T1X0E-00AFY0 size: 931.51 GiB block-size: physical: 512 B
    logical: 512 B speed: 63.2 Gb/s lanes: 4 type: SSD serial: <filter>
    rev: 613200WD temp: 41.9 C scheme: GPT
  ID-4: /dev/nvme3n1 maj-min: 259:7 vendor: Western Digital
    model: WDS200T2B0C-00PXH0 size: 1.82 TiB block-size: physical: 512 B
    logical: 512 B speed: 31.6 Gb/s lanes: 4 type: SSD serial: <filter>
    rev: 21705000 temp: 30.9 C scheme: GPT
  ID-5: /dev/sda maj-min: 8:0 vendor: Samsung model: SSD 840 EVO 250GB
    size: 232.89 GiB block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s
    type: SSD serial: <filter> rev: DB6Q scheme: GPT
  SMART Message: Unknown smartctl error. Unable to generate data.
  ID-6: /dev/sdb maj-min: 8:16 vendor: Western Digital
    model: WD80EFAX-68KNBN0 size: 7.28 TiB block-size: physical: 4096 B
    logical: 512 B speed: 6.0 Gb/s type: HDD rpm: 5400 serial: <filter>
    rev: 0A81 scheme: GPT
  ID-7: /dev/sdc maj-min: 8:32 vendor: Western Digital
    model: WD80EFAX-68KNBN0 size: 7.28 TiB block-size: physical: 4096 B
    logical: 512 B speed: 6.0 Gb/s type: HDD rpm: 5400 serial: <filter>
    rev: 0A81 scheme: GPT
Partition:
  ID-1: / raw-size: 931.22 GiB size: 915.53 GiB (98.32%)
    used: 460.84 GiB (50.3%) fs: ext4 dev: /dev/nvme2n1p2 maj-min: 259:5
  ID-2: /boot/efi raw-size: 300 MiB size: 299.4 MiB (99.80%)
    used: 288 KiB (0.1%) fs: vfat dev: /dev/nvme2n1p1 maj-min: 259:4
Swap:
  Kernel: swappiness: 10 (default 60) cache-pressure: 75 (default 100)
  ID-1: swap-1 type: file size: 38 GiB used: 117.8 MiB (0.3%) priority: -2
    file: /swapfile
Sensors:
  System Temperatures: cpu: N/A mobo: N/A gpu: amdgpu temp: 51.0 C mem: 50.0 C
  Fan Speeds (RPM): N/A gpu: amdgpu fan: 0
Info:
  Processes: 363 Uptime: 2d 13h 11m wakeups: 0 Init: systemd v: 250
  tool: systemctl Compilers: gcc: 11.1.0 Packages: pacman: 1463 lib: 404
  flatpak: 0 Shell: Bash v: 5.1.12 running-in: konsole inxi: 3.3.11
sudo inxi targeted to just Memory
Memory:
  RAM: total: 31.33 GiB used: 10.15 GiB (32.4%)
  Array-1: capacity: 128 GiB slots: 4 EC: None max-module-size: 32 GiB
    note: est.
  Device-1: DIMM 0 size: No Module Installed
  Device-2: DIMM 1 size: 16 GiB speed: 3600 MT/s type: DDR4
    detail: synchronous unbuffered (unregistered) bus-width: 64 bits
    total: 64 bits manufacturer: G.Skill part-no: F4-3600C16-16GTZNC
    serial: N/A
  Device-3: DIMM 0 size: No Module Installed
  Device-4: DIMM 1 size: 16 GiB speed: 3600 MT/s type: DDR4
    detail: synchronous unbuffered (unregistered) bus-width: 64 bits
    total: 64 bits manufacturer: G.Skill part-no: F4-3600C16-16GTZNC
    serial: N/A

I’m wondering if the new 5.15.12-1 kernel requires the recent AGESA 1.2.0.5 update. There was a BIOS update released for my motherboard that includes it in late Dec… but I thought I’d wait to see if others have experienced this issue before I dove into that rabbit hole.

Note: The only system change I made today was installing the new manjaro-pipewire package… maybe I should have rebooted?

Try to remove 1 of 2 RAMs, if the same error appears?

If no error, then try to check another of these RAMs.

Thank you for the reply @Zesko

There are only 2 sticks of RAM in the system, and they aren’t ECC.

Hmm, I’m wondering if the CPU uses ECC for it’s cache… maybe this is purely an internal CPU error? If so, it might strengthen my AGESA 1.2.0.5 BIOS update thought.

Try to downgrade the BIOS.

I’ve started with a reboot for now and will monitor to see if the system triggers this error again. If it does, I will try these options in sequence:

  1. Keep my BIOS as is (AGESA 1.2.0.2) and rock back to an earlier Kernel. BIOS with AGESA 1.2.0.2 never triggered these errors with kernel 5.15.6 or 5.15.7
  2. Keep kernel 5.15.12 and update my BIOS so I move from AGESA 1.2.0.2 => 1.2.0.5

If the errors persist across all these options… then it just might be time to RMA something… once I figure out if this is CPU or RAM.

EDIT: Interesting post I read @ Linux on 3700x: spontaneous reboots caued by MCE - AMD Community makes me think two things:

  1. This new ECC error is CPU focused
  2. It could very well be related to my other issue that I thought was resolved @ System auto-rebooted... mce: [Hardware Error] in dmesg related to CPU

I have the similar issue, but the different error log.

My CPU Ryzen 3600 crashed very rarely and forced my PC to reboot immediately, when watching/opening YouTube in the browser Vivaldi with hardware acceleration.

Jan 05 16:39:33 zesko kernel: mce: [Hardware Error]: CPU 11: Machine Check: 0 Bank 5: bea0000000000108
Jan 05 16:39:33 zesko kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff9cf29d14 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Jan 05 16:39:33 zesko kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1641400770 SOCKET 0 APIC d microcode 8701021

Jan 05 16:41:17 zesko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring vcn_dec timeout, signaled seq=6293, emitted seq=6294
Jan 05 16:41:17 zesko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process vivaldi-bin pid 2084 thread vivaldi-bi:cs0 pid 2114

This happened after the stable update of Linux kernel 5.15.12 and mesa.
My guess is GPU RX 5700 triggered the CPU to reboot the system.

I tried to turn off PBO in BIOS setting of the motherboard ASUS prime B450, so far it does not crash yet. I will wait and see.


Before the stable Manjaro update, the error looks different without MCE error, but use the same behavior (It crashed when watching YouTube)

Okt 17 20:22:31 zesko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring vcn_dec timeout, signaled seq=2315893, emitted seq=2315895
Okt 17 20:22:31 zesko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process vivaldi-bin pid 45233 thread vivaldi-bi:cs0 pid 45253
Okt 17 20:22:40 zesko kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Okt 17 20:22:40 zesko kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Okt 17 20:22:40 zesko kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

It looks like your issue: DE froze with graphic glitches... lots of kernel, drm, and amdgpu entries in journal


I read a bit other forums, that someone has experienced that MSI motherboards have more problems than ASUS motherboard.

hmm, that would be annoying if it’s true… as I felt the MSI-Unify was pretty much the perfect spec/cost ratio for my move to AM4-x570 for the 5600X CPU… but any company can build bad/good boards; sometimes it’s its own lottery :wink:

For me, MCE error is perhaps related to RAM’s frequency that is too low 2133 MHz in ASUS BIOS setting by default.
I changed 2133 MHz to “D.O.C.P” like XMP as 3200 MHz (The optimized frequency of the RAM) 1 week ago. So far no MCE error comes up.