Ryzen mce Issue at Boot

I’ve been looking into the issue of Hardware Errors occurring with Ryzen CPUs running Linux. I have a new PC running Manjaro XFCE on a Ryzen 9 5950X CPU. I’ve checked RAM extensively, and it’s fine. The system is now basically running at XMP settings, though I have down-clocked the RAM from its rated 4000MHz to 3800MHz to sync MCLK, UCLK and FCLK at 1900MHz. I’ve noticed the following:

  1. On a cold boot (computer off, turn it on), the Hardware Error does not occur.
  2. On a restart, the Hardware Error does occur.
  3. The boot process starts with a single core. It is at the point where Linux enables all cores that the error occurs. It usually points to a random memory location. Here is an example:

[ 0.588103] smpboot: CPU0: AMD Ryzen 9 5950X 16-Core Processor (family: 0x19, model: 0x21, stepping: 0x0)
[ 0.588145] mce: [Hardware Error]: Machine check events logged
[ 0.588146] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 27: d82000000002080b
[ 0.588148] mce: [Hardware Error]: TSC 0 MISC d01205c100000000 SYND 5a020001 IPID 1002e00000500
[ 0.588151] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1615416808 SOCKET 0 APIC 0 microcode a201009

  1. After the error, boot proceeds normally as if nothing has happened. HOWEVER, I sometimes notice strange behavior on the desktop. For example, a toast notification pops up telling me “Sticky keys are enabled” even though I have all assistive modes turned off. I have NOT experienced any freezes.

This problem goes back to the first generation Ryzen processors, may affect all Linux distributions, and AMD apparently doesn’t think it’s a problem because they haven’t addressed it. I have tried dozens of different settings in the BIOS, but nothing makes any difference. My solution for now is to cold boot the machine, get a clean boot, and just leave it running. It only uses about 55 watts at idle.

I would like to ask if anyone else can confirm this pattern for me. If there is a consistent pattern, we may be able to find a solution. Right now there are many suggested “fixes” on the web, but none of them seem to be definitive solutions.

have you try with these boot kernel options

processor.max_cstate=5 
iommu=pt 
rcu_nocbs=0_31

I’ve tried processor.max_cstate=5 before, but it didn’t help. IOMMU is disabled in BIOS. I’ll try the set, but right now I’m mainly interested in determining if there is a consistent pattern to these errors.

Okay, I enabled IOMMU in BIOS and tried all three kernel parameters together. The error still occurs.