An AMD rx6700xt crashes continuosly

The PCIE lanes are set correctly in that picture - it seems to be the PCIE bifurcation toggle, which you presumably don’t need to mess around with (assuming you don’t have a PCIE splitter card of some description).

However, I have two ideas that might explain things, as I had some similar issues. The most likely one: how old is your power supply? And are you confident that it’s not developed a fault? I had a lot of GPU crashes due to my own PSU developing a fault and not being able to hold stable voltages, causing GPU crashes - despite the PSU having more than enough capacity on paper. If your crashes are typically when you hit the GPU with a heavy load, then it’s quite likely to be PSU related. Testing for this is easy; just download something like the Superposition benchmark and see if you crash during it. Of course, this also could indicate a faulty GPU as well, so fully diagnosing the problem might require borrowing a known good PSU or GPU from someone.

Secondly: You’re running 128GiB of memory on a Zen 3 system. I think that means 4 sticks of dual rank RAM. The memory controller on AMD platforms isn’t the greatest, and that configuration is going to be asking a lot of it. You may want to try either removing 2 sticks of RAM (from the correct slots, of course), or lowering your RAM timings all the way down to JEDEC speeds. If your crashes are happening at random all the time i.e. with no GPU load, then this is more likely.

Thanks I’ll try asap.

The PSU is a Be Quiet DARK POWER 12 1000W, 80 PLUS PLATINUM purchased on March 2021.

The RAM are 4 stick of Patriot Viper Steel RAM DDR4 3600 Mhz 64GB (2x32GB) C18 and I set in the bios the frequency fixed to 3600

I’ don’t have any PCIE splitter card.

There’s a lot that could be causing issues with this configuration:

  1. I’m assuming you set the rate to 3600, not the frequency, which for a rate of 3600 would be 1800 (because it’s Double Data Rate memory).
  2. Hopefully you applied an XMP profile or similar to change the rate, because 3600 is technically an overclock and may require additional voltage.
  3. Lots of memory manufacturers mix/match the memory ICs for the same SKU (and Patriot is confirmed to do this on the Viper Steel line), so mixing two separate kits can end up with two different memory ICs, which causes problems when applying XMP profiles.

Given this, it may be worth doing a run of MemTest (from Manjaro’s boot menu) just to check the memory is OK. And possibly dropping down to 3200 speeds, and/or removing two sticks of RAM which come from the same memory kit.

I’ll try and report here the results.

First benchmark test: RAM frequency

With frequency set fixed to 3600


the benchmak starts and immediately gives a black screen.

In dmesg I see:

[   76.232169] systemd-journald[417]: /var/log/journal/20b860dfa515404eade47678bbfd1b08/user-1000.journal: Journal file uses a different sequence number ID, rotating.
[  467.487660] nvme nvme0: request 0x181 genctr mismatch (got 0xd expected 0x0)
[  467.487665] nvme nvme0: invalid id 53633 completed on queue 8208
[  497.580551] nvme nvme0: I/O tag 388 (7184) opcode 0x1 (Write) QID 16 timeout, aborting req_op:WRITE(1) size:131072
[  497.587477] nvme nvme0: Abort status: 0x0
[  527.659995] nvme nvme0: I/O tag 388 (7184) opcode 0x1 (Write) QID 16 timeout, reset controller
[  527.725734] nvme nvme0: D3 entry latency set to 10 seconds
[  527.727399] nvme nvme0: 16/0/0 default/read/poll queues
[  708.319517] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
[  718.560177] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
[  730.730471] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=41546, emitted seq=41548
[  730.730783] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process superposition pid 4871 thread superposit:cs0 pid 4908
[  730.731058] amdgpu 0000:2f:00.0: amdgpu: GPU reset begin!
[  730.918126] amdgpu 0000:2f:00.0: amdgpu: MODE1 reset
[  730.918130] amdgpu 0000:2f:00.0: amdgpu: GPU mode1 reset
[  730.918188] amdgpu 0000:2f:00.0: amdgpu: GPU smu mode1 reset
[  742.407514] amdgpu 0000:2f:00.0: amdgpu: GPU reset succeeded, trying to resume
[  742.407995] [drm] PCIE GART of 512M enabled (table at 0x00000082FEB00000).
[  742.408058] [drm] VRAM is lost due to GPU reset!
[  742.408060] amdgpu 0000:2f:00.0: amdgpu: PSP is resuming...
[  749.795059] [drm:psp_v11_0_memory_training [amdgpu]] *ERROR* send training msg failed.
[  749.795246] amdgpu 0000:2f:00.0: amdgpu: Failed to process memory training!
[  749.795248] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
[  749.795376] amdgpu 0000:2f:00.0: amdgpu: GPU reset(4) failed
[  749.908783] snd_hda_intel 0000:2f:00.1: CORB reset timeout#2, CORBRP = 65535
[  749.908817] amdgpu 0000:2f:00.0: amdgpu: GPU reset end with ret = -62
[  749.908819] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -62

Then I set frequency to “Auto”

and now the benchmark starts and completes

Now I’ll try to stress the machine.

BTW do you have any recommendation for a 128Gb RAM configuration for an MSI MEG X570 ACE ?

It’s possible that what you have will work, especially at a lower speed? It’s just always a bit of a gamble with mixing/matching kits when the ICs aren’t guaranteed to match. Also, as it looks like your motherboard doesn’t support some variant of XMP, you may have to increase the voltages a bit to be stable at 3600 - although obviously there may be warranty issues if the RAM is new.

As far as new kits go, the best place to start would be the QVL: MEG X570 ACE | Motherboard | MSI Global. That’s the list of all memory kits that MSI thinks will work with the motherboard; you’ll want to find something that’s is tested in a 4 DIMM configuration and has a size of 32GiB per stick to get to the 128GiB target. Searching for “128” and sorting by the 1|2|4 DIMM column should get you most of the way to a list of what you want. (Incidentally, your current Patriot kit isn’t on this list, so that tracks with what you’re seeing so far).

Just make sure to buy an actual 128GiB kit; it’s very hard to know for certain what memory IC’s you’re getting, and in general mix/matching memory IC’s doesn’t work well when overclocking the memory, even if the overclocked frequencies are what is printed on the box.

Thanks a lot

It does though, there’s a big button on the BIOS entry screen to enable it.

@mirto have you tried that?

Weel, no.
I feared to create a profile and limited myself to:

  • disable the secure boot
  • enable SVN
  • change the RAM frequency

For now I’m stressing the machine and everything works correctly

You can’t just change the DRAM frequency like that. Higher frequencies require slower timings and probably higher voltage. XMP does all that for you automatically.

I usually choose Profile 1 which automatically adjusts to sane settings; with a nice stable performance outcome. I think this is a fair choice.

Thanks; my fault.
I’ll investigate the XMP profiles

Thanks
There is any documentation about ami bios XMP profiles?

Editted as I spotted an earlier post I had missed about XMP availability:

I’m not convinced XMP will work here. Firstly, XMP is designed for Intel platforms, and it’s always a bit of a hack on AMD. Secondly, and perhaps more importantly, the XMP profile is going to be for 2 DIMMs, not 4, so it may not work regardless. It’s always harder to get 4 DIMMs working stably, especially dual rank DIMMs.

Of course, I’d still recommend doing some memtests. It’s entirely possible the RAM is just faulty and needs RMA-ing.

Yes, the XMP profile is probably intended for only 2 sticks if they were 2 stick kits, but is still more likely to work then changing DRAM frequency without appropriate timings. There’s nothing wrong with XMP on modern AMD systems.

If it doesn’t work then MSI motherboards have a Memory Try It! feature which has a long list of presets to try. So maybe you can get 3200 or whatever working. This is for B550 but X570 is the same - https://www.msi.com/blog/b550-memory-try-it . Obviously you need to stress test after changing memory configuration.

I would imagine your BIOS manual should have some information, or perhaps the manufacturer website.

Thanks,
the manual only says that there are 6 overclocking profiles; but (from the manual illustration) they only set voltages and not the RAM frequency.

I’ll investigate

Thanks to everyone for your time and your kindness.

At the end the solution is:

Set the RAM frequency to “Auto” in the bios

Something to assist your investigation:

Thanks a lot