Machines with AMD CPUs Ryzen 5 1600 and Ryzen 7 1700 crash regularly

After reboot the dmesg output is like this: hastebin

Not a bad idea to install linux516 which is rc3 currently on Manjaro.

@cscs here is the dmesg result on linux516 hastebin
Mostly identical.

Looks like you lost a few things from pt originallyā€¦

[    0.227600] pci 0000:03:00.2: Adding to iommu group 12
[    0.227616] pci 0000:04:05.0: Adding to iommu group 12
[    0.227620] pci 0000:04:06.0: Adding to iommu group 12
[    0.227623] pci 0000:04:07.0: Adding to iommu group 12
[    0.227627] pci 0000:04:08.0: Adding to iommu group 12

Then with 5.16 you got the elan error:

[   10.780182] i2c_hid_acpi: probe of i2c-ELAN1200:00 failed with error -22

Hey folks, I should have checked the Arch Wiki
https://wiki.archlinux.org/title/Ryzen
It mentions random reboots and soft lock freezes.

The CPU ID and the Processor number may vary. To solve this problem you need to supply higher voltage to your CPU so that it is stable when running at peak frequencies. The easiest way to achieve this is to use the AMD curve optimiser which is accessible via your motherboardā€™s bios. Access it and put a positive offset of 4 points, which will increase the voltage your CPU is getting at higher loads. It will limit overclocking potential due to higher heat dissipation requirements, but it will run stable. For more details check this forum post. When I did this for my 5950X, my processor stabilised and the frequency and voltage ranges were more similar to those observed under windows.

1 Like

can you disable baloo index ?

balooctl disable

test also ā€˜amd_iommu=onā€™ on boot kernel

1 Like

It is still freezing and crashing with all the boot parameters rcu_nocbs=0-15 amd_iommu=on idle=nomwait and linux516.

Will Kfind work if I disable balooctl?

KFind doesnā€™t use the Baloo index at all. And if you use the Dolphin search without the baloo index it also works, just it doesnā€™t use the index (if you have a SSD it is working good enough to not use this trash indexing).

1 Like

After all those modifications Iā€™m still getting issues. Unwelcome reboots 1-2 times a day or two, more often than freezes.

And this is the output when the laptop forced reboots:

I might try some those:
https://aur.archlinux.org/packages/ryzen-stabilizator-git/
https://aur.archlinux.org/packages/amd-disable-c6/
https://aur.archlinux.org/packages/disable-c6-systemd/
https://aur.archlinux.org/packages/zenstates-git/

If we hadnt already added the disable c6 you can do it with this option:
processor.max_cstate=5
But zenstates can also be useful

So far no occurances of reboots or freezes anymore. *knock on wood :crossed_fingers:

1 Like

Happening again, twice.

Boo. Lets lower that down to max_cstate=4 ?

Actually ā€¦ of interest may be running this:

grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
~ >>> grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name                                                                                               
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2

I think Iā€™ll try migrating to Firefox first.

Mk ā€¦ looks like you just have 2 cstates So instead we would try with max_cstate=1

Also of interest:

pcie_aspm=off
pcie_aspm=force
pcie_aspm.policy=performance
nvme_core.default_ps_max_latency_us=0
acpi_osi=Linux

Hello, early Zen stepping owner. Iā€™m going to disregard other replies to take an alternative route:
In Firefox go to about:crashes and see what kind of error messages there are for crashes. If thereā€™s a colourful variety of different reasons listed for crashes, this might be the early stepping issue. Ever segfaulted during compiles? :slight_smile:

processor       : 15
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD Ryzen 7 1700 Eight-Core Processor
stepping        : 1

Which is ā€œrevā€ in inxiā€™s output apparently. Iā€™ve heard XMRig disables opcache when running onā€¦ our CPUs, so just for a test disable micro-Op cache in BIOS settings, I think this is ā€œOpcache controlā€ in acutbalā€™s screenshot. (#4)

About thermal limits: the temperature doesnā€™t noticeably affect the frequency-voltage curve until ~75C on Zen1. 80C was the breaking point for my overclock under all-core ffmpeg load (vp9->h265, x265). I donā€™t think you have issues due to this.

AMD has offered CPU swaps for affected people of the early stepping bug (support guy didnā€™t need any more but a screenshot of a ryzen-kill segv) to approve it. That was past year 2 (2019). Unfortunately I couldnā€™t have sent in the CPU for time and other reasons, and I havenā€™t tried again since. I donā€™t know if they still offer it. For the record, Iā€™ve had inexplicable crashes on Windows too, especially Firefox had a neat record of them (modern web is heavy, ok)

PS: If you do not use Firefox at all, see if Chromium has a similar accesible crash report page. Otherwise, use Firefox for a couple weeks.

1 Like

I think I found another approach:

Fans
lm_sensors do not detect any interfaces for fan control. However, fan control through NBFC works well; and there is a profile for this laptop. Enabling and starting the service files, and applying the configurations are sufficient.

NBFC-Linux
NBFC-Linux is a lightweight implementation of NBFC, written in C. It does not depend on the Mono framework. It can be installed as nbfc-linuxAUR or nbfc-linux-gitAUR.

So I installed nbfc-linux, then ran

sudo nbfc config --set 'Asus ROG GL702ZC'
sudo nbfc start

It works fine so far, the fan speed is up from 2300 to 3200 which I can hear.
I will probably run

sudo systemctl enable nbfc_service

to make it start on boot.

1 Like

But itā€™s still crashing when using the browser (ungoogled-chromium).