Something strange and unsettling happened to me today. I woke up to my screen no longer powering back on after moving the mouse, not an entirely unique occurrence. I restarted and was surprised to see that right before the login screen, the monitor would power itself off, and this time I was unable to do a clean shutdown by pressing the power button. It soon became apparent the computer would stay frozen for roughly a minute, then proceed to restart itself and repeat the cycle. After one restart I’m able to catch the following error message in the console:
I realized it must be hardware related since I didn’t install any updates nor make changes to the system configuration for over a week, this wouldn’t happen yesterday on the exact same system… to confirm it I reproduced by booting a live image, exact same behavior there. I pulled out the memory modules and tried them in sets, disconnected all hard drives, tried two different screens (HDMI and DisplayPort cables), booting two kernels (5.14 and 5.15), radeon vs amdgpu, reset the CMOS via pins… in the end the only thing that worked was removing my video card and plugging in an older one.
What makes this extremely bizarre is that I get image up until boot time: I can enter BIOS just fine, see GRUB, there are no GPU freezes or graphical corruption… this seems to be all Linux detecting an error and freaking out over it. All error messages are prefixed with “mce” and oddly enough reference a CPU issue, the rest of my hardware works just fine so it’s not the processor thank god.
Does anyone know what could break in a video card that would make Linux do this? I saw a reference about a mcelog command for these errors, but like I said the machine becomes completely inoperable after that’s printed so I can’t issue any commands. If you can suggest further tests I’ll take a look, but please mention everything I could test first as I don’t feel comfortable plugging and pulling the video card with my motherboard so often and risk breaking things (tried it twice today). If this is a hardware issue that can’t be solved from kernel I have no choice but to spend a large sum of money I didn’t want to spend… figured I’d ask for help here first so I know I tried everything else.
Never did any overclocking on my last system, all frequency and performance settings are at their defaults. Only exception is the memory frequency because I have 3200 MHz RAM but the BIOS runs it at 2400 MHz by default, I have to set it manually but it’s the specification of the memory modules and works perfectly fine. For this error I tried fresh CMOS and the issue persists.
The video card in cause had one issue just an year ago, where overheating would cause square corruption and system crashes: It was resolved by simply repasting the GPU and cleaning the dust out, there were no issues under load last time when it worked and it’s still clean inside. Temperature maxed out around the normal 93C* when the GPU is loaded, CPU temperature never passes 84C*.
There is one more detail that comes to mind now: For the past months I’ve had issues with the monitor randomly not powering on and me having to restart a few times for it to be detected, always thought it’s normal but this too went away since putting in the old card. Also issues with weird lines temporarily appearing at the bottom of my monitor when running it at 144 Hz, I thought the monitor might be failing but that too went away now. Also forgot to mention for reference, the broken card is an AMD Radeon™ R9 390X from XFX:
I don’t suspect an actual CPU or RAM issue thank goodness: I’m on the exact same system now just on an older and slower video card, everything works absolutely perfect at all times. My CPU is however a Ryzen so that issue may be relevant if I ever run into a similar problem.