Can't boot past GRUB after update

Hi all. This is my first post in the forum. I’m a bit of a Linux newbie - I’ve used MX-Linux for about 1.5 years and chose to switch to Manjaro for my new build as of March 2.

Motherboard: MSi MPG Z490 Gaming Edge Wifi
Processor: Intel Core i7-10700KF
Graphics: XFX Radeon RX 580 8G, switched to EVGA GeForce GTX 1060 6G to try to isolate problem
PSU: Rosewill ARC Series 650W Gaming Power

System was running fine until Mar 11, I ran the update and rebooted as recommended. After rebooting I was playing a reasonably intensive Minecraft modpack (which had run smoothly prior to the update), and the system crashed and powered down. I rebooted, tried to reload the game, and the system froze. I waited for it to respond but manually powered down when I heard the fans accelerating (presumed the system was heating up). Afterwards I could no longer boot.
I got the following messages (transcribed):

[ 0.712245] mce: [Hardware Error]: CPU 7: Machine Check: 0 Bank 0: 9000004000010005
[ 0.712251] mce: [Hardware Error]: TSC b3820c9f6
[ 0.712252] mce: [Hardware Error]: PROCESSOR 0:a0655 TIME 1615516484 SOCKET 0 APIC e microcode e0

I can access the GRUB loader either by holding shift during boot, or attempting to boot from live USB (both Manjaro and MX), and I can get into the Grub command line. There are no boot logs since the reboot before the final system freeze.

I’ve tried
Changing the graphics card, as noted above (we happened to have it around)
Changing boot parameters in grub edit, including mce=off, 3, single… and I can’t remember any others.
Running Memtest86, which passed once on each of two occasions. I got impatient.
Googling… and googling… and goooooogling.

Am I missing anything obvious, other than that I’ve had hardware errors returned and I haven’t submitted an RMA request yet?

Any chance that this is a microcode issue?

Hi @JediArachn, and welcome!

I have absolutely no idea what it could be, but I’ve found this after a short bit of googling

So maybe it helps you. I honestly don’t have anything else for you. Sorry 'bout that.

So, I was curios (yes, I know - I get facinated by the strangest things), and I dug some more, landing on this page. According to that page,

linux - How do I interpret the output of mce? - Super User

The mcelog program can supply some explanations:

So, I installed it:

$ pamac install mcelog                
Warning: mcelog is only available from AUR
Cloning mcelog build files...
Checking mcelog dependencies...
Resolving dependencies...
Checking inter-conflicts...

To build (1):
mcelog  175-1    AUR

Edit build files : [e] 
Apply transaction ? [e/y/N] y

Building mcelog...
Transaction successfully finished.

After which I ran those 3 lines of yours through it

$ cat /tmp/mcelogtest

[ 0.712245] mce: [Hardware Error]: CPU 7: Machine Check: 0 Bank 0: 9000004000010005
[ 0.712251] mce: [Hardware Error]: TSC b3820c9f6
[ 0.712252] mce: [Hardware Error]: PROCESSOR 0:a0655 TIME 1615516484 SOCKET 0 APIC e microcode e0

$ mcelog --ascii --file /tmp/mcelogtest

And got the following result:

Hardware event. This is not a software error.
CPU 7 BANK 0 TSC b3820c9f6
TIME 1615516484 Fri Mar 12 04:34:44 2021
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 9000004000010005 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 165 Step 5
SOCKET 0 APIC e microcode e0

So, I’m not sure. I am almost certain it has to do with your CPU. But it might also be RAM. I’m bot even close to sure. I looks hardware, though.

Best would be to install it yourself, and have it analyze your syslog or something.

OK, now I’ve learnt something new as well, so I’d say today wasn’t too bad of a day…

Thanks for the info and the welcome, @Mirdarthos! I didn’t know about mcelog, so I can at least install it on an older system for now. I’m going to try switching the RAM around when I get home later today and see if anything changes (and do a better job of logging my other attempts at fixing it).

You’re welcome!

I didn’t know about mcelog before today, either.

So, like I said, I learnt something new, so it was a good day!

Well, I finally confirmed (in my opinion) that the problem rests with the CPU. I finally found a setting in the BIOS that allowed me to limit the number of cores to enable, and it boots again! So I guess I need to RMA that sucker.

Any likelyhood that it was the update or the game that fried it? Or just faulty? (Not too keen on repeating this experience!)

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.