System auto-rebooted... mce: [Hardware Error] in dmesg related to CPU

yes, AMD Ryzen 5 5600X. I also found that large and hard to follow kernel bug thread, and found another similar issue/thread reported here @ Ryzen mce Issue at Boot

I was somewhat glad to read in your thread that this happens very infrequently for you, considering some people reported very frequent issues. I quite honestly hope this first occurrence for me is my last.

One of the comments in the kernel thread suggested some 3000 series CPU users felt changes AMD made for the 5000 series (my series) introduced the problem… but who knows? With so much packed into a constantly evolving kernel, it’s not outside of the realm of possibility there is a hard to nail-down issue in flight.

Did you happen to read the (year old) attached “possible fix” diff on that kernel bug @ https://bugzilla.kernel.org/attachment.cgi?id=290035&action=diff ? Found it interesting that the focus was on AMDGPU… and there have been a lot of ongoing AMDGPU changes.

The “Ryzen mce Issue at Boot” turned out to be an error caused by too aggressive RAM timings. I have no recollection of what they were, but I eventually worked it out.

Hey, how’d you know I referenced your thread? :slight_smile:

It’s good to hear you found a resolution @javaman . Sounds like a memtest might be a good test for that, although I know my G.Skill Trident Neo’s (with AMP/XMP enabled) passed 27hrs of memtest when I initially installed them.

And then that also makes me think of other newer technologies like “Addressable BAR” that I do believe I also have enabled for my 6800XT GPU.

But I think before I start adjusting things, I’m going to wait until I get a frequency of occurrence baseline greater than one. Hard to know if any change/test meant anything if you don’t at least have a time range as a frame of reference to know if things improved or got worse.

For right now, I’m going to stick with my original thought of stopping to run the Steam “idle game” minimized 24/7. Anything more aggressive at this point is random haystack diving.

I got the earlier post in email. I clicked the link referencing my post, and I said, “Oh, that’s me!”

I ran Memtest for days with no errors. I ran OCCT under Windows and got an error within 5 minutes. I thought I had a bad CPU. I THINK what I did was go back to JDEC settings and gradually tune RAM from there. Or at least I backed off FCLK to 1800. It turned out that despite what Memtest said, FCLK 1900 was NOT stable. It is stable now after much messing around. Mostly. The mce error is gone. If you overclocked your CPU or RAM, I would go back to stock settings and see if that fixes your problem. You can always get a tuning/overclocking guide and start improving things from there.

3 Likes

The closest thing I have to an OC is running the AMP/XMP profile for my RAM… but I’ll definitely keep your suggestion in mind.

I’m going to keep this previous post of mine in mind with this issue.

Commonality between them being that 5.14.10 is my first non 5.13.x kernel… and I had originally delayed using 5.14 with all the AMDGPU issues/fixes I was finding related to it. So I think this may be “kernel growing pains”.

The Troubleshooting => Random Reboots section of Ryzen - ArchWiki suggests…

With Ryzen 5, particularly the enthusiast models of 5950X and 5900X there seem to be some slight instability issues under Linux, related possibly to the 5.11+ kernel, as shown by this kernel bug. After investigating and reading reports on the Internet I discovered that out of the box, windows seems to run the CPUs at higher voltage and lower peak frequencies, compared to the stock linux kernel, which depending on your draw from the silicone lottery could cause a host of random application crashes or hardware errors that lead to reboots. You will recognise those by dmesg logs that look like:

kernel: mce: [Hardware Error]: Machine check events logged kernel: 
mce: [Hardware Error]: CPU 22: Machine Check: 0 Bank 1: bc800800060c0859 
lightbringer kernel: mce: [Hardware Error]: TSC 0 ADDR 7ea8f5b00 MISC d012000000000000 IPID 100b000000000 
lightbringer kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1636645367 SOCKET 0 APIC d microcode a201016

The CPU ID and the Processor number may vary. To solve this problem you need to supply higher voltage to your CPU so that it is stable when running at peak frequencies. The easiest way to achieve this is to use the AMD curve optimiser which is accessible via your motherboard’s bios. Access it and put a positive offset of 4 points, which will increase the voltage your CPU is getting at higher loads. It will limit overclocking potential due to higher heat dissipation requirements, but it will run stable. For more details check this forum post. When I did this for my 5950X, my processor stabilised and the frequency and voltage ranges were more similar to those observed under windows.

My 5600X is also a Ryzen 5… will look into this BIOS setting after my RAID Scrubbing completes… feeling a bit more comfortable with this after reviewing Voltage Curve Optimizer Overclocking for Zen 3 – Explained - Appuals.com and seeing how this applies the desired (small) voltage increase

It should also be kept in mind that “entering 10” means an offset of 30-50mv in either direction as each “count” is equal to + or – 3 to 5mV. It is quite a complicated overclocking procedure but at the end of the day, this is the best method to overclock a Ryzen 5000 series CPU.

As with any CPU overclock, testing is extremely crucial and requires a lot of patience. Since we are dealing with automatic voltage adjustments while undervolting, the CPU might crash under idle conditions a lot due to aggressive undervolting while idle. On the contrary, stress testing might show that your CPU is completely stable.

I may also explore this BIOS option as well from Ryzen - ArchWiki as a preventative measure…

Freeze on shutdown, reboot and suspend
Note: With the latest AGESA firmware version 1.2.0.2 this problem might no longer occur.

This seems to be related to the C6 c-state, that does not seem to be well supported (if at all) in Linux.

To fix this issue, go into your BIOS settings for your motherboard and search for an option labeled something like this: “Power idle control”. Change its value to “Typical current idle”. Note that these names are dependent on what the motherboard manufacturer calls them, so they may be a little different in your particular case.

Other less ideal solutions include disabling c-states in the BIOS or adding processor.max_cstates=1 to your kernel command line arguments.

Note: If you are just reading this solution and want to see more details/links and screenshots showing where I found the options in my BIOS… keep reading the thread. I think it is also important to note that a BIOS update was also performed to ensure I also had “AGESA 1.2.0.2” (as noted above)

Note 2: Unfortunately/Fortunately so far… I’ve only had a random reboot trigger once (between July 11, 2021 and today Dec 9, 2021)… so it’s difficult to validate that “these BIOS changes were 100% my fix”… other than putting some faith in the Arch post being correct; until it proves not to be.

1 Like

Yes, the C6 c-state is a problem… here is an old post of mine:

There I recommend to restrict the C-State to max 5.

1 Like

I can not reproduce this issue when using Ryzen 3600 without X. :man_shrugging:

please explain what “x” means… xfce? x11?

I mean Ryzen 3600X

ah, gotcha… 3600 versus 3600X

The issue is probably in the BIOS of mainboard. What is your mainboard?


Did you update the latest version of BIOS?


Ah

Here is new version of BIOS (Release: 2021-09-27).

I’ll be experimenting with two BIOS tweaked noted above from Ryzen - ArchWiki.

My BIOS is the latest (from earlier this year) “non alpha/beta AGESA” (no letter code after the version)… 7C35vA9 provides Update to AMD ComboAM4PIV2 1.2.0.2… when a newer BIOS with the final/official ‘AGESA 1.2.0.3’ (no a/b/c suffix) is released is when I’ll update my BIOS next.

EDIT:
Oops, just re-reviewed my inxi details and noticed I’m on 7C35vA8 (AGESA 1.2.0.0) and not 7C35vA9 (AGESA 1.2.0.2) like I thought… so definitely a BIOS update to 7C35vA9 is in order!

Okay, BIOS updated to 7C35vA9 with AGESA 1.2.0.2.

Wasn’t able to find the two BIOS settings I wanted, but…

  1. Regarding “AMD curve optimiser” - it might be gated behind the PBO = Auto default setting that I have never played with.

  2. Regarding “Power idle control” - like the Ryzen - ArchWiki mentioned… having AGESA 1.2.0.2 might help here now… although I found in my settings that I had actually disabled C-States at one point… hmm :thinking: I do hate anything that smells like hibernation

There were a couple noteworthy RAM settings I found set to Auto that may be worth looking at in the future… more so the second “Power Down” one:

what is your boot kernel command :
i run a 3600XT with x570 gigabyte
you can add theses parameters

processor.max_cstate=5 amd_iommu=on rcu_nocbs=0_11

the scheduler processor under linux is based on acpi , not cccp optimization
that will come in time

so try to get the best work on Bus memory DDR4
avoid if possible to overclock CPU , linuw still use ( min frequency ( mine is 2200 , 3800 , max frequency to 4500 )
for version upper than 5800x , 5900x , 5950c , better try lower work Cpu frequency - no undervolt )

Hi @stephane , and thank you for chiming in.

My grub is fairly default accept for 2 changes I made:

  • removed quiet because I wanted to see more than the root file system check @ boot
  • added sysrq_always_enabled=1 to support REISUB; just in case
GRUB_CMDLINE_LINUX_DEFAULT="apparmor=1 security=apparmor udev.log_priority=3 sysrq_always_enabled=1"

I currently have C-States disabled in the BIOS. Looking at [SOLVED] Ryzen 5800X + x570 Aorus Pro on Linux: freeze on poweroff/reboot - Linux - Level1Techs Forums the poster had tried playing with C-States, Cool&Quiet, and variations with a couple of the GRUB options you shared… but ultimately found “Power idle control” (like I tried to find per Ryzen - ArchWiki) was their ultimate fix… I’m going to try find that setting again… well I’m glad i took BIOS pic’s, because I think I just found it right under the C-State setting! :laughing:


Going to update it to “Typical current idle” and re-enable C-States.

My BIOS has IOMMU set to AUTO… should I force that to Enabled if I’m adding amd_iommu=on in GRUB? Not sure what AUTO picks. BTW… Is it normal to find no results for lsmod | grep amd_iommu, modinfo amd_iommu, and find /lib/modules/$(uname -r) -type f | grep amd_iommu? Maybe it’s inside the kernel and not an external module?

Other than that, my CPU is not overclocked… all AUTO. My TridentZ Neo RAM is the closest thing I have to an OC by enabling A-XMP (profile 1)… DDR4-3600 16GBx2 CL16-19-19-39 @ 1.35V
Screenshot_20211205_171954

EDIT: BIOS updated…

re-enabled 'C-States' and moved 'Power Supply Idle Control' off AUTO

Now if only I could find an alternative (maybe a small bump in Load Line Calibration?) to the Curve Optimizer 4 point positive adjustment (I don’t want to enable PBO [Pandora’s Box Opening] just for that) then I’d be equal to the Ryzen - ArchWiki recommendations.

I figured out on my motherboard how to apply the “4 point positive curve optimizer” stated as the resolution to Random Reboots (with MCE errors) in the Ryzen - ArchWiki

  1. First I adjusted PBO from “Auto” => “Advanced”, which opened up 4 more sub-options “left on AUTO” (which was comforting), plus exposed the “Curve Optimizer” I was looking for.

    2.Then once inside “Curve Optimizer” I was able to change the CO from “Disabled” => “All Cores”, and magnitude from “0” => “4”… for my mobo, the sign defaulted to “positive”
    Screenshot_20211209_144516
BIOS save recap...

Screenshot_20211209_144859

I do not believe I have seen a temp increase either with the higher voltage (4pts is ~12-20mv) at idle… memory told me my lowest CPU temp was around 36C, and I have observed 35.88C as the “lowest temp” so far.

Unfortunately/Fortunately so far… I’ve only had a random reboot trigger once (between July 11, 2021 and today Dec 9, 2021)… so it’s going to be difficult for me to validate that “this was 100% my fix”… other than putting some faith in the Arch post being correct.

I also have some confidence in both the updated firmware (AGESA 1.2.0.2) and “Power Idle Control = Typical current idle” working to prevent any freezing (had some odd freezing in the past)… so hopefully the trifecta of changes proves to combine and provide stability over time.

  1. BIOS @ AGESA 1.2.0.2 (will explore AGESA 1.2.0.3 once officially released… i.e. non-beta)
  2. Power Idle Control = Typical current idle
  3. Curve Optimizer = +4 on all cores (noting if +4 doesn’t yield good results, step up slowly/incrementally to +8 if required)

Note:

  • leaving “Global C-State Control = AUTO” until I experience a freeze with #1 or #2
  • switched IOMMU in the BIOS from “AUTO” => “enabled”… although I have not yet explored adding “amd_iommu=on” in grub… or if it’s really necessary as lspci -vvv seems to find it…
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7c35
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin ? routed to IRQ 27
        Capabilities: <access denied>
1 Like

Daniel, I am here also to bother you again :slight_smile:

While your method is good to check erorrs, I suggest more powerful one: to base it on event message priority how OS (or apps) classified messages by themselves, cause other important messages like crash segmentation fault, etc. could have no the “error” text or have its synonym only (“broken”, “misbehaving detected”), which leads the grep will filter them out from the output you will see.

So to increase flexibility of filtering error messages you can:

  1. add “ignore case” option (to find error text also):
    sudo dmesg | grep -i Error
    
  2. on the next stage of you development level to switch to built-in filtering by message priority:
    sudo dmesg --level err
    
  3. on your next development level to switch to list all messages of any (current or earlier) boot (Why `journalctl -k` usage preference could be better over `dmesg` while listing kernel events), not only from current boot and only recent ones:
    journalctl -k --priority err --boot 0
    
  4. When see no error, but suspecting some misbehave, switch to lower priority:
    journalctl -k --priority warning --boot 0
    
  5. As --priority warning includes all higher priorities, then for you and to report your issue you could split your output by parts by message priority ranges:
    to show warnings and higher priority (--priority warning) or only warnings (--priority warning..warning) alongside with separate --priority err to make you and readers to be able to recognize which priority level any line has.
    Cause --priority warning is the same as --priority emerg..warning and highly priority messages could lost in a wall of just warning messages.

I think you (as enthusiast) could be interested in that upgrade of “your firmware” to use more flexible error search methods :slight_smile:

2 Likes

Always willing to learn, so many thanks for the tips @alven !

Just to make sure I understand… --boot=0 is just the current boot right… If I wanted previous, I’d need to --boot= -1… or could I do something like --boot=0..-10 for the current and previous 10 boots?


man journalctl is my friend, if I wanted all boots, it looks like I can just specify --boot= all… other than that it appears to have no range option, but can target a specific boot… i.e. --boot= -1

       -b [[ID][±offset]|all], --boot[=[ID][±offset]|all]
           Show messages from a specific boot. This will add a match for "_BOOT_ID=".

           The argument may be empty, in which case logs for the current boot will be shown.

           If the boot ID is omitted, a positive offset will look up the boots starting from the beginning of the journal, and an equal-or-less-than zero
           offset will look up boots starting from the end of the journal. Thus, 1 means the first boot found in the journal in chronological order, 2 the
           second and so on; while -0 is the last boot, -1 the boot before last, and so on. An empty offset is equivalent to specifying -0, except when the
           current boot is not the last boot (e.g. because --directory was specified to look at logs from a different machine).

           If the 32-character ID is specified, it may optionally be followed by offset which identifies the boot relative to the one given by boot ID.
           Negative values mean earlier boots and positive values mean later boots. If offset is not specified, a value of zero is assumed, and the logs for
           the boot given by ID are shown.

           The special argument all can be used to negate the effect of an earlier use of -b.

I can’t imagine a time I’d want to use the boot id… but that’s pretty cool the offset works with it… so if I had a baseline boot I was focused on with a boot id of 33a33dcf74e249e3b3105aa38c89e12a, I could do a…

  • --boot = 33a33dcf74e249e3b3105aa38c89e12a for it’s details
  • --boot = 33a33dcf74e249e3b3105aa38c89e12a -1 for the boot previous to it for compare
  • --boot = 33a33dcf74e249e3b3105aa38c89e12a +1 for the boot after it for compare

Hmm… I think I just thought of a use for boot id… so just like I only had the out of the blue reboot once… instead of having to remember an ever changing offset … I just note the boot id and always be able to reference it…

$ journalctl -k --priority err --boot 5c9af688fbef4d82af93ec5dc7b786cf
-- Journal begins at Tue 2021-07-13 15:47:15 CDT, ends at Thu 2021-12-09 19:01:39 CST. --
Nov 08 19:05:25 AM4-5600X-Linux kernel: mce: [Hardware Error]: CPU 8: Machine Check: 0 Bank 5: bea0000001000108
Nov 08 19:05:25 AM4-5600X-Linux kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffc0ecab9c MISC d0130fff00000000 SYND 4d000000 IPID 500b000000000 
Nov 08 19:05:25 AM4-5600X-Linux kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1636419924 SOCKET 0 APIC 5 microcode a201009

That’s cool, I can always reference that one boot by id.


Well look at that… using $ journalctl -k --priority err --boot all | grep -i mce: I can see that my 1st mce error was actually back on August 11 (Nov 8 was my second)…

$ journalctl -k --priority err --boot all | grep -i mce:
Aug 11 01:19:12 AM4-x5600-Linux kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea0000000000108
Aug 11 01:19:12 AM4-x5600-Linux kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffc0f0873e MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Aug 11 01:19:12 AM4-x5600-Linux kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1628662751 SOCKET 0 APIC 4 microcode a201009
Nov 08 19:05:25 AM4-5600X-Linux kernel: mce: [Hardware Error]: CPU 8: Machine Check: 0 Bank 5: bea0000001000108
Nov 08 19:05:25 AM4-5600X-Linux kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffc0ecab9c MISC d0130fff00000000 SYND 4d000000 IPID 500b000000000 
Nov 08 19:05:25 AM4-5600X-Linux kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1636419924 SOCKET 0 APIC 5 microcode a201009

Yah… I was just under 1 month in on GNU/Linux back then… so I’m not surprised that went straight over my head… probably didn’t know how to look beyond the 1,000 entry limit of KSystemLog yet.

Don’t get me wrong… lots still goes over my head, maybe just a bit less now :rofl:


extra brownie points

I really liked seeing the -- Journal begins at... line that journalctl outputs before the grep… and couldn’t figure out how to do an “OR” to keep it in as well, but found (via DDG university) that egrep can do it easily…

$ journalctl -k --priority emerg --boot all | egrep -i 'mce:|-- Journal'
-- Journal begins at Tue 2021-07-13 15:47:15 CDT, ends at Thu 2021-12-09 19:56:03 CST. --
Aug 11 01:19:12 AM4-x5600-Linux kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea0000000000108
Aug 11 01:19:12 AM4-x5600-Linux kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffc0f0873e MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Aug 11 01:19:12 AM4-x5600-Linux kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1628662751 SOCKET 0 APIC 4 microcode a201009
Nov 08 19:05:25 AM4-5600X-Linux kernel: mce: [Hardware Error]: CPU 8: Machine Check: 0 Bank 5: bea0000001000108
Nov 08 19:05:25 AM4-5600X-Linux kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffc0ecab9c MISC d0130fff00000000 SYND 4d000000 IPID 500b000000000 
Nov 08 19:05:25 AM4-5600X-Linux kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1636419924 SOCKET 0 APIC 5 microcode a201009

I also duplicated this command swapping the priority with “emerg” (0), “alert” (1), and “crit” (2) to see how high these errors were actually flagged… and they survived all the way to “emerg”, so pretty important.

And I also learned from a Distotube video that when I’m in a man page I can type “/” to enter a search term; which helps me find what I am looking for.

1 Like