System auto-rebooted... mce: [Hardware Error] in dmesg related to CPU

The Troubleshooting => Random Reboots section of Ryzen - ArchWiki suggests…

With Ryzen 5, particularly the enthusiast models of 5950X and 5900X there seem to be some slight instability issues under Linux, related possibly to the 5.11+ kernel, as shown by this kernel bug. After investigating and reading reports on the Internet I discovered that out of the box, windows seems to run the CPUs at higher voltage and lower peak frequencies, compared to the stock linux kernel, which depending on your draw from the silicone lottery could cause a host of random application crashes or hardware errors that lead to reboots. You will recognise those by dmesg logs that look like:

kernel: mce: [Hardware Error]: Machine check events logged kernel: 
mce: [Hardware Error]: CPU 22: Machine Check: 0 Bank 1: bc800800060c0859 
lightbringer kernel: mce: [Hardware Error]: TSC 0 ADDR 7ea8f5b00 MISC d012000000000000 IPID 100b000000000 
lightbringer kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1636645367 SOCKET 0 APIC d microcode a201016

The CPU ID and the Processor number may vary. To solve this problem you need to supply higher voltage to your CPU so that it is stable when running at peak frequencies. The easiest way to achieve this is to use the AMD curve optimiser which is accessible via your motherboard’s bios. Access it and put a positive offset of 4 points, which will increase the voltage your CPU is getting at higher loads. It will limit overclocking potential due to higher heat dissipation requirements, but it will run stable. For more details check this forum post. When I did this for my 5950X, my processor stabilised and the frequency and voltage ranges were more similar to those observed under windows.

My 5600X is also a Ryzen 5… will look into this BIOS setting after my RAID Scrubbing completes… feeling a bit more comfortable with this after reviewing Voltage Curve Optimizer Overclocking for Zen 3 – Explained and seeing how this applies the desired (small) voltage increase

It should also be kept in mind that “entering 10” means an offset of 30-50mv in either direction as each “count” is equal to + or – 3 to 5mV. It is quite a complicated overclocking procedure but at the end of the day, this is the best method to overclock a Ryzen 5000 series CPU.

As with any CPU overclock, testing is extremely crucial and requires a lot of patience. Since we are dealing with automatic voltage adjustments while undervolting, the CPU might crash under idle conditions a lot due to aggressive undervolting while idle. On the contrary, stress testing might show that your CPU is completely stable.

I may also explore this BIOS option as well from Ryzen - ArchWiki as a preventative measure…

Freeze on shutdown, reboot and suspend
Note: With the latest AGESA firmware version 1.2.0.2 this problem might no longer occur.

This seems to be related to the C6 c-state, that does not seem to be well supported (if at all) in Linux.

To fix this issue, go into your BIOS settings for your motherboard and search for an option labeled something like this: “Power idle control”. Change its value to “Typical current idle”. Note that these names are dependent on what the motherboard manufacturer calls them, so they may be a little different in your particular case.

Other less ideal solutions include disabling c-states in the BIOS or adding processor.max_cstates=1 to your kernel command line arguments.

Note: If you are just reading this solution and want to see more details/links and screenshots showing where I found the options in my BIOS… keep reading the thread. I think it is also important to note that a BIOS update was also performed to ensure I also had “AGESA 1.2.0.2” (as noted above)

Note 2: Unfortunately/Fortunately so far… I’ve only had a random reboot trigger once (between July 11, 2021 and today Dec 9, 2021)… so it’s difficult to validate that “these BIOS changes were 100% my fix”… other than putting some faith in the Arch post being correct; until it proves not to be.

1 Like