I use a Xeon 2640v3 with 16GB REG DDR4 and a Vega 56 GPU. Manjaro runs off an ADATA SU650 SATA SSD.
System is perfectly stable on Windows even with quite aggressive undervolting, but I can’t get rid of amdgpu crashes on Linux whatsoever. Only thing that made them rarer (I guess?) is maxing out the GPU voltage. I found a post with the exact same issue (I also have artifacts after it attempts to recover from time to time), but the problem persisted for OP in Windows, while I can’t get the driver to crash in Windows at all without serious overclocking. And I even use a 3rd party beta driver there, which isn’t really supposed to be stable. Additionally, I use LACT on Linux to tweak my card, but it’s the same with CoreCtrl as well, and disabling both doesn’t seem to help.
I googled about the problem and saw some people with Vega cards on Arch forums suggest using these kernel options for CPU and amdgpu module respectively:
processor.max_cstate=1
rcu_nocbs=0-15
idle=nomwait
pcie_aspm=off
iommu=pt
amdgpu.lockup_timeout=0
amdgpu.dc=1
amdgpu.vm_update_mode=0
amdgpu.dpm=-1
amdgpu.ppfeatureamdgpu.vm_fault_stop=2
amdgpu.ppfeaturemask=0xffffffff
amdgpu.vm_fault_stop=2
amdgpu.vm_debug=1
amdgpu.gpu_recovery=0
Obviously, it hasn’t helped. I attempted it with kernels 6.7, 6.6, 6.1 and even tried downloading an old (20.45) linux-firmware package and replacing my amdgpu folder in usr/lib/firmware with the one it had.
As for the logs, after a crash I usually see something like this in journalctl:
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=7773, emitted seq=7775
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 4018 thread firefox:cs0 pid 4176
It appears out of the blue, without anything remarkable preceding it - usually it’s just about me starting the process several minutes ago. Crashes can occur anywhere where the GPU is somewhat utilized, but happen especially often with one game (Redout 2) and I think spikes in GPU utilization have to do something with it, it has a lot of them. But again, it has all these spikes on Windows and nothing is crashing. Other than this, my gaming performance on Linux is perfect and I do have all the Vulkan and OpenCL drivers installed, even reinstalled them yesterday.
I’d also mention that I tried switching to Mesa-git and the problem persists.