System auto-rebooted... mce: [Hardware Error] in dmesg related to CPU

Daniel, I am here also to bother you again :slight_smile:

While your method is good to check erorrs, I suggest more powerful one: to base it on event message priority how OS (or apps) classified messages by themselves, cause other important messages like crash segmentation fault, etc. could have no the “error” text or have its synonym only (“broken”, “misbehaving detected”), which leads the grep will filter them out from the output you will see.

So to increase flexibility of filtering error messages you can:

  1. add “ignore case” option (to find error text also):
    sudo dmesg | grep -i Error
    
  2. on the next stage of you development level to switch to built-in filtering by message priority:
    sudo dmesg --level err
    
  3. on your next development level to switch to list all messages of any (current or earlier) boot (Why `journalctl -k` usage preference could be better over `dmesg` while listing kernel events), not only from current boot and only recent ones:
    journalctl -k --priority err --boot 0
    
  4. When see no error, but suspecting some misbehave, switch to lower priority:
    journalctl -k --priority warning --boot 0
    
  5. As --priority warning includes all higher priorities, then for you and to report your issue you could split your output by parts by message priority ranges:
    to show warnings and higher priority (--priority warning) or only warnings (--priority warning..warning) alongside with separate --priority err to make you and readers to be able to recognize which priority level any line has.
    Cause --priority warning is the same as --priority emerg..warning and highly priority messages could lost in a wall of just warning messages.

I think you (as enthusiast) could be interested in that upgrade of “your firmware” to use more flexible error search methods :slight_smile:

2 Likes

Always willing to learn, so many thanks for the tips @alven !

Just to make sure I understand… --boot=0 is just the current boot right… If I wanted previous, I’d need to --boot= -1… or could I do something like --boot=0..-10 for the current and previous 10 boots?


man journalctl is my friend, if I wanted all boots, it looks like I can just specify --boot= all… other than that it appears to have no range option, but can target a specific boot… i.e. --boot= -1

       -b [[ID][±offset]|all], --boot[=[ID][±offset]|all]
           Show messages from a specific boot. This will add a match for "_BOOT_ID=".

           The argument may be empty, in which case logs for the current boot will be shown.

           If the boot ID is omitted, a positive offset will look up the boots starting from the beginning of the journal, and an equal-or-less-than zero
           offset will look up boots starting from the end of the journal. Thus, 1 means the first boot found in the journal in chronological order, 2 the
           second and so on; while -0 is the last boot, -1 the boot before last, and so on. An empty offset is equivalent to specifying -0, except when the
           current boot is not the last boot (e.g. because --directory was specified to look at logs from a different machine).

           If the 32-character ID is specified, it may optionally be followed by offset which identifies the boot relative to the one given by boot ID.
           Negative values mean earlier boots and positive values mean later boots. If offset is not specified, a value of zero is assumed, and the logs for
           the boot given by ID are shown.

           The special argument all can be used to negate the effect of an earlier use of -b.

I can’t imagine a time I’d want to use the boot id… but that’s pretty cool the offset works with it… so if I had a baseline boot I was focused on with a boot id of 33a33dcf74e249e3b3105aa38c89e12a, I could do a…

  • --boot = 33a33dcf74e249e3b3105aa38c89e12a for it’s details
  • --boot = 33a33dcf74e249e3b3105aa38c89e12a -1 for the boot previous to it for compare
  • --boot = 33a33dcf74e249e3b3105aa38c89e12a +1 for the boot after it for compare

Hmm… I think I just thought of a use for boot id… so just like I only had the out of the blue reboot once… instead of having to remember an ever changing offset … I just note the boot id and always be able to reference it…

$ journalctl -k --priority err --boot 5c9af688fbef4d82af93ec5dc7b786cf
-- Journal begins at Tue 2021-07-13 15:47:15 CDT, ends at Thu 2021-12-09 19:01:39 CST. --
Nov 08 19:05:25 AM4-5600X-Linux kernel: mce: [Hardware Error]: CPU 8: Machine Check: 0 Bank 5: bea0000001000108
Nov 08 19:05:25 AM4-5600X-Linux kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffc0ecab9c MISC d0130fff00000000 SYND 4d000000 IPID 500b000000000 
Nov 08 19:05:25 AM4-5600X-Linux kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1636419924 SOCKET 0 APIC 5 microcode a201009

That’s cool, I can always reference that one boot by id.


Well look at that… using $ journalctl -k --priority err --boot all | grep -i mce: I can see that my 1st mce error was actually back on August 11 (Nov 8 was my second)…

$ journalctl -k --priority err --boot all | grep -i mce:
Aug 11 01:19:12 AM4-x5600-Linux kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea0000000000108
Aug 11 01:19:12 AM4-x5600-Linux kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffc0f0873e MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Aug 11 01:19:12 AM4-x5600-Linux kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1628662751 SOCKET 0 APIC 4 microcode a201009
Nov 08 19:05:25 AM4-5600X-Linux kernel: mce: [Hardware Error]: CPU 8: Machine Check: 0 Bank 5: bea0000001000108
Nov 08 19:05:25 AM4-5600X-Linux kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffc0ecab9c MISC d0130fff00000000 SYND 4d000000 IPID 500b000000000 
Nov 08 19:05:25 AM4-5600X-Linux kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1636419924 SOCKET 0 APIC 5 microcode a201009

Yah… I was just under 1 month in on GNU/Linux back then… so I’m not surprised that went straight over my head… probably didn’t know how to look beyond the 1,000 entry limit of KSystemLog yet.

Don’t get me wrong… lots still goes over my head, maybe just a bit less now :rofl:


extra brownie points

I really liked seeing the -- Journal begins at... line that journalctl outputs before the grep… and couldn’t figure out how to do an “OR” to keep it in as well, but found (via DDG university) that egrep can do it easily…

$ journalctl -k --priority emerg --boot all | egrep -i 'mce:|-- Journal'
-- Journal begins at Tue 2021-07-13 15:47:15 CDT, ends at Thu 2021-12-09 19:56:03 CST. --
Aug 11 01:19:12 AM4-x5600-Linux kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea0000000000108
Aug 11 01:19:12 AM4-x5600-Linux kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffc0f0873e MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Aug 11 01:19:12 AM4-x5600-Linux kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1628662751 SOCKET 0 APIC 4 microcode a201009
Nov 08 19:05:25 AM4-5600X-Linux kernel: mce: [Hardware Error]: CPU 8: Machine Check: 0 Bank 5: bea0000001000108
Nov 08 19:05:25 AM4-5600X-Linux kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffc0ecab9c MISC d0130fff00000000 SYND 4d000000 IPID 500b000000000 
Nov 08 19:05:25 AM4-5600X-Linux kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1636419924 SOCKET 0 APIC 5 microcode a201009

I also duplicated this command swapping the priority with “emerg” (0), “alert” (1), and “crit” (2) to see how high these errors were actually flagged… and they survived all the way to “emerg”, so pretty important.

And I also learned from a Distotube video that when I’m in a man page I can type “/” to enter a search term; which helps me find what I am looking for.

1 Like

y

Indexes and IDs are from: journalctl --list-boots where you can find boot and shutdown (last event) time

grep -iE "mce:|-- Journal" also.

I did not know it. Before you taught me I usually redirect stdout to export to file (man gpg > "man gpg.txt") or to export into console without pager (man gpg | cat) and then Ctrl+F (Ctrl+Shift+F) to search it. You suggested to be aware of search in man directly. Now it becomes easier to me. Thx!

1 Like

omg… I was using the wrong quote/apostrophe symbol :rofl: Thanks for noting the syntax error!

And that’s cool I taught you something about searching man pages. Kind of makes one wonder why man man doesn’t point that out :wink:

No :grinning: I am no about this. You used
egrep -i,
me - the same but “core” grep with flag/option
grep -iE

BTW, try to change the quote type for awk command:

echo {1..10}"element" | awk '{print $3}'

I mean I tried doing it in grep using the wrong quotes/wrappers and didn’t share what “failed”… but shared the working egrep… so with your working grep example I saw 2 things I did wrong:

  • I used -e, not -E
  • I used apostrophe ', not quotes "

grep -i -e 'mce:|Journal' was as close as my failed attempts got to your working grep -iE "mce:|-- Journal" before I abandoned those efforts and found egrep -i 'mce:|-- Journal' via DDG (which was fine with ' and didn’t need and extra -E parameter).


I love awk… use it, grep and in some cases sed (sed really confuses me… pure copy/paste from DDG to remove elements I wanted gone) in my conky… for example:
${color5}GPU Fabric Clk: ${color4}${exec cat /sys/class/hwmon/hwmon6/device/pp_dpm_fclk | grep '*$' | awk '{print $2}' | sed 's/\([a-zA-Z: \t]\)//g' | awk -v OFMT="%.2f%" '{print $1/1941*100}'}$alignr${color4}${exec cat /sys/class/hwmon/hwmon6/device/pp_dpm_fclk | grep '*$' | awk '{print $2}'} ${color}/ 1941Mhz
Screenshot_20211210_085525

1 Like

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.