System auto-rebooted... mce: [Hardware Error] in dmesg related to CPU

Daniel-I · 9 November 2021 01:48

My system rebooted while mousing in Firefox (reading a forum thread), and found these errors post reboot that I am hoping someone can help me understand and/or dig deeper into…

$ sudo dmesg | grep Error
[    0.342642] mce: [Hardware Error]: Machine check events logged
[    0.342643] mce: [Hardware Error]: CPU 8: Machine Check: 0 Bank 5: bea0000001000108
[    0.342650] mce: [Hardware Error]: TSC 0 ADDR ffffffc0ecab9c MISC d0130fff00000000 SYND 4d000000 IPID 500b000000000 
[    0.342654] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1636419924 SOCKET 0 APIC 5 microcode a201009
[    0.346709] RAS: Correctable Errors collector initialized.

First time experiencing a reboot while using my system, I migrated to Manjaro KDE Plasma back in July, and use the stable branch. Actually, that isn’t 100% true… I triggered a reboot once when I was first building/testing my first conky script… and learned I wasn’t alone in having that happen when working through the conky learning curve.

I sat on 5.13.x kernels ever since they were available, and just transitioned to 5.14.10 with the last 2021-10-16 Stable branch update. I thought I’d mention this since swapping kernel branches is a “recent” change… but I’m not sure if that’s significant since the 3-ish weeks since have gone smoothly? But then again, I’m not sure what/how a kernel issue manifests once triggered, so I can’t rule it out.

The only other “recent change” I can think of in the last week or so is that I have been letting a Steam “Idle game” run minimized in the background. Conky typically lists it as a process using ~7% CPU… but that’s split across the cores as no thread was pinned at/near 100%; considering that with SMT on a 6 core CPU, one 100% pinned thread (of 12) would be 8% CPU utilization. So it’s not like I was running with one thread pinned (or very near-pinned) 24/7 for about a week or more.

I learned a bit @ What are Machine Check Exceptions (or MCE)? - Advanced Clustering Technologies, and thought I could learn more from looking at the mcelog file it mentioned… but then learned at Machine-check exception - ArchWiki that feature has been deprecated… so not sure where to go from here.

inxi -F details

$ inxi -Fx
System:    Host: AM4-5600X-Linux Kernel: 5.14.10-1-MANJARO x86_64 bits: 64 compiler: gcc v: 11.1.0 Desktop: KDE Plasma 5.22.5
           Distro: Manjaro Linux base: Arch Linux
Machine:   Type: Desktop System: Micro-Star product: MS-7C35 v: 2.0 serial: <superuser required>
           Mobo: Micro-Star model: MEG X570 UNIFY (MS-7C35) v: 2.0 serial: <superuser required> UEFI: American Megatrends LLC.
           v: A.80 date: 01/22/2021
CPU:       Info: 6-Core model: AMD Ryzen 5 5600X bits: 64 type: MT MCP arch: Zen 3 rev: 0 cache: L2: 3 MiB
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 88825
           Speed: 3452 MHz min/max: 2200/3700 MHz boost: enabled Core speeds (MHz): 1: 3452 2: 2729 3: 2815 4: 3030 5: 4246
           6: 4582 7: 3717 8: 3719 9: 3636 10: 2895 11: 4291 12: 4198
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT]
           vendor: XFX Limited XFX Speedster MERC 319 driver: amdgpu v: kernel bus-ID: 2f:00.0
           Display: x11 server: X.Org 1.20.13 driver: loaded: amdgpu,ati unloaded: modesetting,radeon resolution:
           1: 2560x1440~144Hz 2: 2560x1440~144Hz
           OpenGL: renderer: AMD Radeon RX 6800 XT (SIENNA_CICHLID DRM 3.42.0 5.14.10-1-MANJARO LLVM 12.0.1)
           v: 4.6 Mesa 21.2.3 direct render: Yes
Audio:     Device-1: AMD Navi 21 HDMI Audio [Radeon RX 6800/6800 XT / 6900 XT] driver: snd_hda_intel v: kernel bus-ID: 2f:00.1
           Device-2: Advanced Micro Devices [AMD] Starship/Matisse HD Audio vendor: Micro-Star MSI driver: snd_hda_intel
           v: kernel bus-ID: 31:00.4
           Device-3: Corsair CORSAIR VIRTUOSO SE USB Gaming Headset type: USB driver: hid-generic,snd-usb-audio,usbhid
           bus-ID: 3-4:3
           Sound Server-1: ALSA v: k5.14.10-1-MANJARO running: yes
           Sound Server-2: sndio v: N/A running: no
           Sound Server-3: JACK v: 1.9.19 running: no
           Sound Server-4: PulseAudio v: 15.0 running: yes
           Sound Server-5: PipeWire v: 0.3.38 running: yes
Network:   Device-1: Realtek RTL8125 2.5GbE vendor: Micro-Star MSI driver: r8169 v: kernel port: f000 bus-ID: 27:00.0
           IF: enp39s0 state: up speed: 1000 Mbps duplex: full mac: 2c:f0:5d:ae:5e:89
Bluetooth: Device-1: Intel AX200 Bluetooth type: USB driver: btusb v: 0.8 bus-ID: 1-4:2
           Report: rfkill ID: hci0 rfk-id: 0 state: up address: see --recommends
RAID:      Device-1: md127 type: mdraid level: mirror status: active size: 7.28 TiB
           Info: report: 2/2 UU blocks: 7813893120 chunk-size: N/A
           Components: Online: 0: sdb1 1: sdc1
Drives:    Local Storage: total: 19.33 TiB used: 8.15 TiB (42.1%)
           ID-1: /dev/nvme0n1 vendor: Western Digital model: WDS100T1X0E-00AFY0 size: 931.51 GiB temp: 42.9 C
           ID-2: /dev/nvme1n1 vendor: Western Digital model: WDS100T3X0C-00SJG0 size: 931.51 GiB temp: 38.9 C
           ID-3: /dev/nvme2n1 vendor: Western Digital model: WDS100T1X0E-00AFY0 size: 931.51 GiB temp: 43.9 C
           ID-4: /dev/nvme3n1 vendor: Western Digital model: WDS200T2B0C-00PXH0 size: 1.82 TiB temp: 32.9 C
           ID-5: /dev/sda vendor: Samsung model: SSD 840 EVO 250GB size: 232.89 GiB
           ID-6: /dev/sdb vendor: Western Digital model: WD80EFAX-68KNBN0 size: 7.28 TiB
           ID-7: /dev/sdc vendor: Western Digital model: WD80EFAX-68KNBN0 size: 7.28 TiB
Partition: ID-1: / size: 915.53 GiB used: 459.93 GiB (50.2%) fs: ext4 dev: /dev/nvme2n1p2
           ID-2: /boot/efi size: 299.4 MiB used: 288 KiB (0.1%) fs: vfat dev: /dev/nvme2n1p1
Swap:      ID-1: swap-1 type: file size: 38 GiB used: 0 KiB (0.0%) file: /swapfile
Sensors:   System Temperatures: cpu: 42.0 C mobo: N/A gpu: amdgpu temp: 54.0 C
           Fan Speeds (RPM): N/A gpu: amdgpu fan: 0
Info:      Processes: 332 Uptime: 19m Memory: 31.27 GiB used: 4.79 GiB (15.3%) Init: systemd Compilers: gcc: 11.1.0
           Packages: 1413 Shell: Bash v: 5.1.8 inxi: 3.3.08

federation · 9 November 2021 04:37

Is your CPU a Ryzen? I have a 3600 x and have seen a similar issue

Daniel-I · 9 November 2021 05:07

yes, AMD Ryzen 5 5600X. I also found that large and hard to follow kernel bug thread, and found another similar issue/thread reported here @ Ryzen mce Issue at Boot

I was somewhat glad to read in your thread that this happens very infrequently for you, considering some people reported very frequent issues. I quite honestly hope this first occurrence for me is my last.

One of the comments in the kernel thread suggested some 3000 series CPU users felt changes AMD made for the 5000 series (my series) introduced the problem… but who knows? With so much packed into a constantly evolving kernel, it’s not outside of the realm of possibility there is a hard to nail-down issue in flight.

Did you happen to read the (year old) attached “possible fix” diff on that kernel bug @ https://bugzilla.kernel.org/attachment.cgi?id=290035&action=diff ? Found it interesting that the focus was on AMDGPU… and there have been a lot of ongoing AMDGPU changes.

javaman · 9 November 2021 05:42

The “Ryzen mce Issue at Boot” turned out to be an error caused by too aggressive RAM timings. I have no recollection of what they were, but I eventually worked it out.

Daniel-I · 9 November 2021 05:53

Hey, how’d you know I referenced your thread?

It’s good to hear you found a resolution @javaman . Sounds like a memtest might be a good test for that, although I know my G.Skill Trident Neo’s (with AMP/XMP enabled) passed 27hrs of memtest when I initially installed them.

And then that also makes me think of other newer technologies like “Addressable BAR” that I do believe I also have enabled for my 6800XT GPU.

But I think before I start adjusting things, I’m going to wait until I get a frequency of occurrence baseline greater than one. Hard to know if any change/test meant anything if you don’t at least have a time range as a frame of reference to know if things improved or got worse.

For right now, I’m going to stick with my original thought of stopping to run the Steam “idle game” minimized 24/7. Anything more aggressive at this point is random haystack diving.

javaman · 9 November 2021 06:55

I got the earlier post in email. I clicked the link referencing my post, and I said, “Oh, that’s me!”

I ran Memtest for days with no errors. I ran OCCT under Windows and got an error within 5 minutes. I thought I had a bad CPU. I THINK what I did was go back to JDEC settings and gradually tune RAM from there. Or at least I backed off FCLK to 1800. It turned out that despite what Memtest said, FCLK 1900 was NOT stable. It is stable now after much messing around. Mostly. The mce error is gone. If you overclocked your CPU or RAM, I would go back to stock settings and see if that fixes your problem. You can always get a tuning/overclocking guide and start improving things from there.

Daniel-I · 9 November 2021 17:13

The closest thing I have to an OC is running the AMP/XMP profile for my RAM… but I’ll definitely keep your suggestion in mind.

Daniel-I · 12 November 2021 22:38

I’m going to keep this previous post of mine in mind with this issue.

Commonality between them being that 5.14.10 is my first non 5.13.x kernel… and I had originally delayed using 5.14 with all the AMDGPU issues/fixes I was finding related to it. So I think this may be “kernel growing pains”.

Daniel-I · 5 December 2021 13:30

The Troubleshooting => Random Reboots section of Ryzen - ArchWiki suggests…

With Ryzen 5, particularly the enthusiast models of 5950X and 5900X there seem to be some slight instability issues under Linux, related possibly to the 5.11+ kernel, as shown by this kernel bug. After investigating and reading reports on the Internet I discovered that out of the box, windows seems to run the CPUs at higher voltage and lower peak frequencies, compared to the stock linux kernel, which depending on your draw from the silicone lottery could cause a host of random application crashes or hardware errors that lead to reboots. You will recognise those by dmesg logs that look like:
kernel: mce: [Hardware Error]: Machine check events logged kernel: 
mce: [Hardware Error]: CPU 22: Machine Check: 0 Bank 1: bc800800060c0859 
lightbringer kernel: mce: [Hardware Error]: TSC 0 ADDR 7ea8f5b00 MISC d012000000000000 IPID 100b000000000 
lightbringer kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1636645367 SOCKET 0 APIC d microcode a201016
The CPU ID and the Processor number may vary. To solve this problem you need to supply higher voltage to your CPU so that it is stable when running at peak frequencies. The easiest way to achieve this is to use the AMD curve optimiser which is accessible via your motherboard’s bios. Access it and put a positive offset of 4 points, which will increase the voltage your CPU is getting at higher loads. It will limit overclocking potential due to higher heat dissipation requirements, but it will run stable. For more details check this forum post. When I did this for my 5950X, my processor stabilised and the frequency and voltage ranges were more similar to those observed under windows.

My 5600X is also a Ryzen 5… will look into this BIOS setting after my RAID Scrubbing completes… feeling a bit more comfortable with this after reviewing Voltage Curve Optimizer Overclocking for Zen 3 – Explained and seeing how this applies the desired (small) voltage increase

It should also be kept in mind that “entering 10” means an offset of 30-50mv in either direction as each “count” is equal to + or – 3 to 5mV. It is quite a complicated overclocking procedure but at the end of the day, this is the best method to overclock a Ryzen 5000 series CPU.

As with any CPU overclock, testing is extremely crucial and requires a lot of patience. Since we are dealing with automatic voltage adjustments while undervolting, the CPU might crash under idle conditions a lot due to aggressive undervolting while idle. On the contrary, stress testing might show that your CPU is completely stable.

I may also explore this BIOS option as well from Ryzen - ArchWiki as a preventative measure…

Freeze on shutdown, reboot and suspend
Note: With the latest AGESA firmware version 1.2.0.2 this problem might no longer occur.

This seems to be related to the C6 c-state, that does not seem to be well supported (if at all) in Linux.

To fix this issue, go into your BIOS settings for your motherboard and search for an option labeled something like this: “Power idle control”. Change its value to “Typical current idle”. Note that these names are dependent on what the motherboard manufacturer calls them, so they may be a little different in your particular case.

Other less ideal solutions include disabling c-states in the BIOS or adding processor.max_cstates=1 to your kernel command line arguments.

Note: If you are just reading this solution and want to see more details/links and screenshots showing where I found the options in my BIOS… keep reading the thread. I think it is also important to note that a BIOS update was also performed to ensure I also had “AGESA 1.2.0.2” (as noted above)

Note 2: Unfortunately/Fortunately so far… I’ve only had a random reboot trigger once (between July 11, 2021 and today Dec 9, 2021)… so it’s difficult to validate that “these BIOS changes were 100% my fix”… other than putting some faith in the Arch post being correct; until it proves not to be.

megavolt · 5 December 2021 14:19

Yes, the C6 c-state is a problem… here is an old post of mine:

There I recommend to restrict the C-State to max 5.

Zesko · 5 December 2021 15:17

I can not reproduce this issue when using Ryzen 3600 without X.

Daniel-I · 5 December 2021 15:24

please explain what “x” means… xfce? x11?

Zesko · 5 December 2021 15:24

I mean Ryzen 3600X

Daniel-I · 5 December 2021 15:26

ah, gotcha… 3600 versus 3600X

Zesko · 5 December 2021 15:34

The issue is probably in the BIOS of mainboard. What is your mainboard?

Did you update the latest version of BIOS?

Ah

Here is new version of BIOS (Release: 2021-09-27).
https://www.msi.com/Motherboard/MEG-X570-UNIFY/support#down-bios

Daniel-I · 5 December 2021 15:44

I’ll be experimenting with two BIOS tweaked noted above from Ryzen - ArchWiki.

My BIOS is the latest (from earlier this year) “non alpha/beta AGESA” (no letter code after the version)… 7C35vA9 provides Update to AMD ComboAM4PIV2 1.2.0.2… when a newer BIOS with the final/official ‘AGESA 1.2.0.3’ (no a/b/c suffix) is released is when I’ll update my BIOS next.

EDIT:
Oops, just re-reviewed my inxi details and noticed I’m on 7C35vA8 (AGESA 1.2.0.0) and not 7C35vA9 (AGESA 1.2.0.2) like I thought… so definitely a BIOS update to 7C35vA9 is in order!

Daniel-I · 5 December 2021 18:08

Okay, BIOS updated to 7C35vA9 with AGESA 1.2.0.2.

Wasn’t able to find the two BIOS settings I wanted, but…

Regarding “AMD curve optimiser” - it might be gated behind the PBO = Auto default setting that I have never played with.

Screenshot_20211205_115535715×198 203 KB
Regarding “Power idle control” - like the Ryzen - ArchWiki mentioned… having AGESA 1.2.0.2 might help here now… although I found in my settings that I had actually disabled C-States at one point… hmm I do hate anything that smells like hibernation

Screenshot_20211205_115630740×385 380 KB

There were a couple noteworthy RAM settings I found set to Auto that may be worth looking at in the future… more so the second “Power Down” one:

stephane · 5 December 2021 20:38

what is your boot kernel command :
i run a 3600XT with x570 gigabyte
you can add theses parameters

processor.max_cstate=5 amd_iommu=on rcu_nocbs=0_11

the scheduler processor under linux is based on acpi , not cccp optimization
that will come in time

so try to get the best work on Bus memory DDR4
avoid if possible to overclock CPU , linuw still use ( min frequency ( mine is 2200 , 3800 , max frequency to 4500 )
for version upper than 5800x , 5900x , 5950c , better try lower work Cpu frequency - no undervolt )

Daniel-I · 5 December 2021 23:15

Hi @stephane , and thank you for chiming in.

My grub is fairly default accept for 2 changes I made:

removed quiet because I wanted to see more than the root file system check @ boot
added sysrq_always_enabled=1 to support REISUB; just in case

GRUB_CMDLINE_LINUX_DEFAULT="apparmor=1 security=apparmor udev.log_priority=3 sysrq_always_enabled=1"

I currently have C-States disabled in the BIOS. Looking at [SOLVED] Ryzen 5800X + x570 Aorus Pro on Linux: freeze on poweroff/reboot - Linux - Level1Techs Forums the poster had tried playing with C-States, Cool&Quiet, and variations with a couple of the GRUB options you shared… but ultimately found “Power idle control” (like I tried to find per Ryzen - ArchWiki) was their ultimate fix… I’m going to try find that setting again… well I’m glad i took BIOS pic’s, because I think I just found it right under the C-State setting!

Going to update it to “Typical current idle” and re-enable C-States.

My BIOS has IOMMU set to AUTO… should I force that to Enabled if I’m adding amd_iommu=on in GRUB? Not sure what AUTO picks. BTW… Is it normal to find no results for lsmod | grep amd_iommu, modinfo amd_iommu, and find /lib/modules/$(uname -r) -type f | grep amd_iommu? Maybe it’s inside the kernel and not an external module?

Other than that, my CPU is not overclocked… all AUTO. My TridentZ Neo RAM is the closest thing I have to an OC by enabling A-XMP (profile 1)… DDR4-3600 16GBx2 CL16-19-19-39 @ 1.35V
Screenshot_20211205_171954

EDIT: BIOS updated…

re-enabled 'C-States' and moved 'Power Supply Idle Control' off AUTO

Now if only I could find an alternative (maybe a small bump in Load Line Calibration?) to the Curve Optimizer 4 point positive adjustment (I don’t want to enable PBO [Pandora’s Box Opening] just for that) then I’d be equal to the Ryzen - ArchWiki recommendations.

Daniel-I · 9 December 2021 21:07

I figured out on my motherboard how to apply the “4 point positive curve optimizer” stated as the resolution to Random Reboots (with MCE errors) in the Ryzen - ArchWiki…

First I adjusted PBO from “Auto” => “Advanced”, which opened up 4 more sub-options “left on AUTO” (which was comforting), plus exposed the “Curve Optimizer” I was looking for.

Screenshot_20211209_144056762×196 214 KB

2.Then once inside “Curve Optimizer” I was able to change the CO from “Disabled” => “All Cores”, and magnitude from “0” => “4”… for my mobo, the sign defaulted to “positive”

BIOS save recap...

Screenshot_20211209_144859

I do not believe I have seen a temp increase either with the higher voltage (4pts is ~12-20mv) at idle… memory told me my lowest CPU temp was around 36C, and I have observed 35.88C as the “lowest temp” so far.

Unfortunately/Fortunately so far… I’ve only had a random reboot trigger once (between July 11, 2021 and today Dec 9, 2021)… so it’s going to be difficult for me to validate that “this was 100% my fix”… other than putting some faith in the Arch post being correct.

I also have some confidence in both the updated firmware (AGESA 1.2.0.2) and “Power Idle Control = Typical current idle” working to prevent any freezing (had some odd freezing in the past)… so hopefully the trifecta of changes proves to combine and provide stability over time.

BIOS @ AGESA 1.2.0.2 (will explore AGESA 1.2.0.3 once officially released… i.e. non-beta)
Power Idle Control = Typical current idle
Curve Optimizer = +4 on all cores (noting if +4 doesn’t yield good results, step up slowly/incrementally to +8 if required)

Note:

leaving “Global C-State Control = AUTO” until I experience a freeze with #1 or #2
switched IOMMU in the BIOS from “AUTO” => “enabled”… although I have not yet explored adding “amd_iommu=on” in grub… or if it’s really necessary as lspci -vvv seems to find it…

00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device 7c35
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin ? routed to IRQ 27
        Capabilities: <access denied>