Fix Ryzen lockups related to low system usage

ryzen

#1

@mioc has described a method for disabling the C6 power state reliably in another topic. here, i want to amend my own experiences, simplify the process altogether, and create an editable wiki so other people can add their experiences.


there are couple of tips floating around on the internet for fixing Ryzen lockups related to low system usage. typically, your system is no doing much at all (like showing a movie, playing music, showing the same simple website for an extended period of time) and when you want more CPU power, e.g. by moving the mouse, it freezes.
the cause of this freeze/crash is a bug in the C6 power/sleep state of first generation Ryzen CPUs.

i have collected and tested these tips in the past:

  • setting rcu_nocbs=0-11 (for a 12 thread CPU) as your boot parameter in /etc/default/grub. this setting is supposed to disable ASLR, which should decrease the number of times Ryzen CPUs enter C6 sleep state. my system still kept crashing!

  • setting processor.max_cstate=5 as your boot parameter in /etc/default/grub. this setting is supposed to disable the c6 sleep state altogether, but my system kept crashing. probably this setting was overwritten by another process.

the only method, which works for me (i have already been without crashes for almost a month) is described in the following tutorial. i have only tested it on both kernel 4.14 and 4.15.

  1. load MSR kernel module during boot:
    sudo nano /etc/modules-load.d/modules.conf
    add the following line and save the file:

    msr

  2. get zenstates from github:
    cd ~
    git clone https://github.com/r4m0n/ZenStates-Linux.git
    move zenstates.py to a place you can leave it and forget about it:
    sudo cp ZenStates-Linux/zenstates.py /usr/local/bin/

  3. create systemd service:
    sudo nano /usr/lib/systemd/system/ryzen-disable-c6.service
    enter the following code and save it:

[Unit]
Description=Disable C6 power state on Ryzen CPUs
DefaultDependencies=no
After=sysinit.target local-fs.target
Before=basic.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/zenstates.py --c6-disable

[Install]
WantedBy=basic.target
  1. enable systemd service:
    sudo systemctl enable ryzen-disable-c6

  2. delete downloaded folder from github:
    cd ~
    sudo rm -r ZenStates-Linux

  3. reboot your system

  4. make sure everything has worked:

    • check, whether msr kernel module is loaded (the following command should have an output):
      lsmod | grep msr

    • check, whether c6 power state is disabled:
      sudo /usr/local/bin/zenstates.py -l


AMD ridge raven crashes
#2

I just got a lockup with C6 state disabled.

But so far I haven’t gotten one when using the iommu=soft grub option.

My system is:

System:    Host: strit-lenovo Kernel: 4.16.0-1-MANJARO x86_64 bits: 64 gcc: 7.3.0
           Desktop: KDE Plasma 5.12.3 (Qt 5.10.1) Distro: Manjaro Linux
Machine:   Device: laptop System: LENOVO product: 81BR v: Lenovo ideapad 720S-13ARR serial: N/A
           Mobo: LENOVO model: LNVNB161216 v: SDK0J40709 WIN serial: N/A
           UEFI: LENOVO v: 6KCN28WW date: 12/19/2017
Battery    BAT0: charge: 47.9 Wh 100.1% condition: 47.9/48.0 Wh (100%) model: SMP L16M4PB3 status: Charging
CPU:       Quad core AMD Ryzen 7 2700U with Radeon Vega Mobile Gfx (-MT-MCP-) arch: Zen rev.0 cache: 2048 KB
           flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm) bmips: 17567
           clock speeds: max: 2200 MHz 1: 1417 MHz 2: 1439 MHz 3: 2629 MHz 4: 3194 MHz 5: 1581 MHz 6: 1638 MHz
           7: 1892 MHz 8: 2112 MHz
Graphics:  Card: Advanced Micro Devices [AMD/ATI] Raven Bridge [Radeon Vega Series / Radeon Vega Mobile Series]
           bus-ID: 03:00.0
           Display Server: x11 (X.Org 1.19.6 ) drivers: ati,amdgpu (unloaded: modesetting,radeon)
           Resolution: 1920x1080@60.02hz
           OpenGL: renderer: AMD RAVEN (DRM 3.23.0 / 4.16.0-1-MANJARO, LLVM 5.0.1)
           version: 4.5 Mesa 17.3.6 Direct Render: Yes
Audio:     Card-1 Advanced Micro Devices [AMD] Device 15e3 driver: snd_hda_intel bus-ID: 03:00.6
           Card-2 Advanced Micro Devices [AMD/ATI] Device 15de driver: snd_hda_intel bus-ID: 03:00.1
           Sound: Advanced Linux Sound Architecture v: k4.16.0-1-MANJARO
Network:   Card: Realtek RTL8821CE 802.11ac PCIe Wireless Network Adapter
           driver: rtl8821ce port: 3000 bus-ID: 01:00.0
           IF: wlp1s0 state: up mac: <filter>
Drives:    HDD Total Size: 512.1GB (34.8% used)
           ID-1: /dev/nvme0n1 model: SAMSUNG_MZVLB512HAJQ size: 512.1GB
Partition: ID-1: / size: 67G used: 11G (16%) fs: ext4 dev: /dev/nvme0n1p6
           ID-2: /home size: 148G used: 11G (8%) fs: ext4 dev: /dev/nvme0n1p7
Sensors:   System Temperatures: cpu: No active sensors found. Have you configured your sensors yet? mobo: N/A gpu: 0.0
Info:      Processes: 204 Uptime: 11 min Memory: 1616.3/7518.9MB Init: systemd Gcc sys: 7.3.0
           Client: Shell (bash 4.4.191) inxi: 2.3.56 


At my wits end... random freezing!
#3

i had a crash with disabled c6 state as well - a month ago.
but crashes/lock-ups can have many reasons.

are you recommending to both disable c6 state AND use iommu=soft kernel boot parameter or ONLY using iommu=soft kernel boot parameter?


#4

I don’t know really. I just know that so far I have had no lockups while iommu=soft is used.


#5

Just curious, have you tried “amd_iommu=on iommu=pt” instead of “iommu-soft”.
any benefit ?


#6

Haven’t tried that. Will do next time I reboot. :slight_smile:

Tried them. Got a lockup after like 10 hours of uptime.


#7

Tried them, didn’t work.

But I have also gotten a few lockups with iommu=soft in the last few days.

This is the contents of journalctl -b -1 after a reboot:

mar 28 10:20:40 strit-lenovo kernel: C2H_MAILBOX_STATUS: 
mar 28 10:20:40 strit-lenovo kernel: 01 
mar 28 10:20:40 strit-lenovo kernel: 
mar 28 10:20:42 strit-lenovo kernel: RTW: 
mar 28 10:20:42 strit-lenovo kernel: C2H_MAILBOX_STATUS: 
mar 28 10:20:42 strit-lenovo kernel: 01 
mar 28 10:20:42 strit-lenovo kernel: 
mar 28 10:20:44 strit-lenovo kernel: RTW: 
mar 28 10:20:44 strit-lenovo kernel: C2H_MAILBOX_STATUS: 
mar 28 10:20:44 strit-lenovo kernel: 01 
mar 28 10:20:44 strit-lenovo kernel: 
mar 28 10:20:46 strit-lenovo kernel: RTW: 
mar 28 10:20:46 strit-lenovo kernel: C2H_MAILBOX_STATUS: 
mar 28 10:20:46 strit-lenovo kernel: 01 
mar 28 10:20:46 strit-lenovo kernel: 
mar 28 10:20:49 strit-lenovo kernel: RTW: 
mar 28 10:20:49 strit-lenovo kernel: C2H_MAILBOX_STATUS: 
mar 28 10:20:49 strit-lenovo kernel: 01 
mar 28 10:20:49 strit-lenovo kernel: 
mar 28 10:20:51 strit-lenovo kernel: RTW: rtw_update_ramask => mac_id:0, networkType:0x0b, mask:0x00000000000f0000
                                              ==> rssi_level:6, rate_bitmap:0x0000000000000000, shortGIrate=1
                                              ==> bw:0, ignore_bw:0x1
mar 28 10:20:51 strit-lenovo kernel: RTW: rtl8821c_set_FwMacIdConfig_cmd(wlp1s0): mac_id=0 raid=0x3 bw=0 mask=0x00000000000f0000
mar 28 10:20:51 strit-lenovo kernel: RTW: rtl8821c_set_FwMacIdConfig_cmd, mask=0x00000000000f0000, mac_id=0x0, raid=0x3, shortGIrate=1, power training=00
mar 28 10:20:51 strit-lenovo kernel: RTW: 
mar 28 10:20:51 strit-lenovo kernel: C2H_MAILBOX_STATUS: 
mar 28 10:20:51 strit-lenovo kernel: 01 
mar 28 10:20:51 strit-lenovo kernel: 
mar 28 10:20:53 strit-lenovo kernel: RTW: 
mar 28 10:20:53 strit-lenovo kernel: C2H_MAILBOX_STATUS: 
mar 28 10:20:53 strit-lenovo kernel: 01 
mar 28 10:20:53 strit-lenovo kernel: 
mar 28 10:20:55 strit-lenovo kernel: RTW: 
mar 28 10:20:55 strit-lenovo kernel: C2H_MAILBOX_STATUS: 
mar 28 10:20:55 strit-lenovo kernel: 01 
mar 28 10:20:55 strit-lenovo kernel: 
mar 28 10:20:57 strit-lenovo kernel: RTW: 
mar 28 10:20:57 strit-lenovo kernel: C2H_MAILBOX_STATUS: 
mar 28 10:20:57 strit-lenovo kernel: 01 
mar 28 10:20:57 strit-lenovo kernel: 
mar 28 10:20:59 strit-lenovo kernel: RTW: 
mar 28 10:20:59 strit-lenovo kernel: C2H_MAILBOX_STATUS: 
mar 28 10:20:59 strit-lenovo kernel: 01 
mar 28 10:20:59 strit-lenovo kernel: 
mar 28 10:21:01 strit-lenovo kernel: RTW: rtw_update_ramask => mac_id:0, networkType:0x0b, mask:0x00000000000f0000
                                              ==> rssi_level:6, rate_bitmap:0x0000000000000000, shortGIrate=1
                                              ==> bw:0, ignore_bw:0x1
mar 28 10:21:01 strit-lenovo kernel: RTW: rtl8821c_set_FwMacIdConfig_cmd(wlp1s0): mac_id=0 raid=0x3 bw=0 mask=0x00000000000f0000
mar 28 10:21:01 strit-lenovo kernel: RTW: rtl8821c_set_FwMacIdConfig_cmd, mask=0x00000000000f0000, mac_id=0x0, raid=0x3, shortGIrate=1, power training=00
mar 28 10:21:01 strit-lenovo kernel: RTW: 
mar 28 10:21:01 strit-lenovo kernel: C2H_MAILBOX_STATUS: 
mar 28 10:21:01 strit-lenovo kernel: 01 
mar 28 10:21:01 strit-lenovo kernel: 

Seems to be Wifi related, but I’m not sure.


#8

what about running boinc to compute data for world community grid or another project that specifically appeals to you? It negates the idle time but doesn’t interfere with when you need to use the PC as boinc automatically pauses when it detects other activity demanding cpu time above a certain percentage.


#9

Also got this in my journal after my latest lockup:

mar 29 20:31:53 strit-lenovo kernel: watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [kworker/6:1:129]
mar 29 20:31:53 strit-lenovo kernel: Modules linked in: sd_mod snd_usb_audio snd_usbmidi_lib usbhid snd_rawmidi snd_seq_device uas usb_storage cdc_ether scsi_mod usbnet r8152 mii bnep btusb btrtl btbcm btintel bluetooth ecdh_generic amd>
mar 29 20:31:53 strit-lenovo kernel:  agpgart pcspkr syscopyarea sp5100_tco sysfillrect sysimgblt sparse_keymap soundcore fb_sys_fops i2c_piix4 tpm_crb rfkill ucsi_acpi tpm_tis typec_ucsi shpchp tpm_tis_core i2c_hid typec wmi battery rt>
mar 29 20:31:53 strit-lenovo kernel: CPU: 6 PID: 129 Comm: kworker/6:1 Tainted: G           O L   4.16.0-1-MANJARO #1
mar 29 20:31:53 strit-lenovo kernel: Hardware name: LENOVO 81BR/LNVNB161216, BIOS 6KCN28WW 12/19/2017
mar 29 20:31:53 strit-lenovo kernel: Workqueue: events netstamp_clear
mar 29 20:31:53 strit-lenovo kernel: RIP: 0010:smp_call_function_many+0x23e/0x270
mar 29 20:31:53 strit-lenovo kernel: RSP: 0018:ffffaed3412f3d58 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff12
mar 29 20:31:53 strit-lenovo kernel: RAX: 0000000000000002 RBX: ffff9d645eda2dc0 RCX: 0000000000000001
mar 29 20:31:53 strit-lenovo kernel: RDX: ffff9d645eca8a80 RSI: 0000000000000000 RDI: ffff9d645eda2dc8
mar 29 20:31:53 strit-lenovo kernel: RBP: ffff9d645eda2df0 R08: 0000000000000007 R09: ffff9d645eda2df0
mar 29 20:31:53 strit-lenovo kernel: R10: ffff9d645eda2dc8 R11: 0000000000000005 R12: 0000000000000001
mar 29 20:31:53 strit-lenovo kernel: R13: 0000000000000140 R14: ffffffff9202cc30 R15: 0000000000000000
mar 29 20:31:53 strit-lenovo kernel: FS:  0000000000000000(0000) GS:ffff9d645ed80000(0000) knlGS:0000000000000000
mar 29 20:31:53 strit-lenovo kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
mar 29 20:31:53 strit-lenovo kernel: CR2: 0000102d7b098000 CR3: 0000000203a26000 CR4: 00000000003406e0
mar 29 20:31:53 strit-lenovo kernel: Call Trace:
mar 29 20:31:53 strit-lenovo kernel:  ? setup_data_read+0xc0/0xc0
mar 29 20:31:53 strit-lenovo kernel:  ? netif_receive_skb_internal+0x21/0x130
mar 29 20:31:53 strit-lenovo kernel:  smp_call_function+0x36/0x60
mar 29 20:31:53 strit-lenovo kernel:  ? setup_data_read+0xc0/0xc0
mar 29 20:31:53 strit-lenovo kernel:  on_each_cpu+0x2a/0x80
mar 29 20:31:53 strit-lenovo kernel:  ? netif_receive_skb_internal+0x20/0x130
mar 29 20:31:53 strit-lenovo kernel:  ? netif_receive_skb_internal+0x21/0x130
mar 29 20:31:53 strit-lenovo kernel:  text_poke_bp+0x68/0xe0
mar 29 20:31:53 strit-lenovo kernel:  __jump_label_transform.isra.0+0x123/0x130
mar 29 20:31:53 strit-lenovo kernel:  arch_jump_label_transform+0x2b/0x40
mar 29 20:31:53 strit-lenovo kernel:  __jump_label_update+0x7d/0xb0
mar 29 20:31:53 strit-lenovo kernel:  static_key_enable_cpuslocked+0x52/0x80
mar 29 20:31:53 strit-lenovo kernel:  static_key_enable+0x16/0x20
mar 29 20:31:53 strit-lenovo kernel:  process_one_work+0x1ce/0x3f0
mar 29 20:31:53 strit-lenovo kernel:  worker_thread+0x2b/0x3d0
mar 29 20:31:53 strit-lenovo kernel:  ? process_one_work+0x3f0/0x3f0
mar 29 20:31:53 strit-lenovo kernel:  kthread+0x113/0x130
mar 29 20:31:53 strit-lenovo kernel:  ? kthread_create_on_node+0x70/0x70
mar 29 20:31:53 strit-lenovo kernel:  ret_from_fork+0x22/0x40
mar 29 20:31:53 strit-lenovo kernel: Code: 89 c7 e8 56 dd 5f 00 3b 05 f4 23 02 01 0f 83 46 fe ff ff 48 63 c8 48 8b 13 48 03 14 cd 40 04 ed 92 8b 4a 18 83 e1 01 74 0a f3 90 <8b> 4a 18 83 e1 01 75 f6 eb c7 48 c7 c2 80 34 13 93 48 89 ee 89 
mar 29 20:31:53 strit-lenovo kernel: pcieport 0000:00:01.2: AER: Multiple Uncorrected (Non-Fatal) error received: id=0008
mar 29 20:31:53 strit-lenovo kernel: pcieport 0000:00:01.2: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=000a(Requester ID)
mar 29 20:31:53 strit-lenovo kernel: pcieport 0000:00:01.2:   device [1022:15d3] error status/mask=00100000/04400000
mar 29 20:31:53 strit-lenovo kernel: pcieport 0000:00:01.2:    [20] Unsupported Request    (First)
mar 29 20:31:53 strit-lenovo kernel: pcieport 0000:00:01.2:   TLP Header: 34000000 02000010 00000000 88468846
mar 29 20:31:53 strit-lenovo kernel: pcieport 0000:00:01.2: broadcast error_detected message
mar 29 20:31:53 strit-lenovo kernel: pcieport 0000:00:01.2: broadcast mmio_enabled message

PS: Tried disabling power saving, TLP etc, did not help.


#10

Well, on my new system with a Ryzen 7 2700 I get lockups when I play a Wine game. So for me it’s not because of low usage.

I have not noticed these lockups happening when using native games or apps.

I did try the rcu_nocbs=0-15 option. Did not help.


#11

what about trying kernel 4.17, have you done that to see if it’s any better?


#12

Not yet. Did they fix anything for Ryzen in it?
I only heard about the Vega stuff.


#13

not specifically Ryzen but there are references to timer fixes. also, while nothing was specifically mentioned about my SSD it works properly with no FIS errors when using kernel 4.17, so it’s entirely possible something may be in there for Ryzen lockups too.

EDIT - this post jinxed that statement about the SSD, although the error frequency has massively reduced.