Doing some tests, I read that these issues could be caused by the power scaling features in the graphics card.
I’ve worked with @shadesmaclean on this issue for a bit, and during my searches I found a random script from github that helps control the power/performance settings of the card.
We’ve since seen some success by manually switching power scaling settings for the card using this script. Granted we tried using some of the amdgpu GUI tools that could do the same thing but honestly we couldn’t get them to work on his setup. So instead we ended up using the following script:
source: https://github.com/superjamie/snippets/blob/master/radcard
https://wiki.archlinux.org/index.php/AMDGPU
#!/bin/bash
# Radcard
# Script to control radeon DPM power saving
# Ref: https://wiki.archlinux.org/index.php/ATI#Powersaving
# Version: 2019-02-12
# License: GPLv3
# Authors: jamie.bainbridge@gmail.com
CARDPATH="/sys/class/drm/card0/device"
do_set() {
case "$1" in
bat*)
sudo sh -c "echo battery > $CARDPATH/power_dpm_state"
;;
bal*)
sudo sh -c "echo balanced > $CARDPATH/power_dpm_state"
;;
per*)
sudo sh -c "echo performance > $CARDPATH/power_dpm_state"
;;
a*)
sudo sh -c "echo auto > $CARDPATH/power_dpm_force_performance_level"
;;
l*)
sudo sh -c "echo low > $CARDPATH/power_dpm_force_performance_level"
;;
h*)
sudo sh -c "echo high > $CARDPATH/power_dpm_force_performance_level"
;;
*)
do_usage
;;
esac
}
do_get() {
echo -n "power_dpm_state: "; cat "$CARDPATH/power_dpm_state"
echo -n "power_dpm_force_performance_level: "; cat "$CARDPATH/power_dpm_force_performance_level"
}
do_usage() {
echo "Usage: $(basename "$0") [get|set [battery|balanced|performance|auto|low|high|bat|bal|per|a|h|l]]"
exit 1
}
case "$1" in
"set")
shift
for VAR in "$@"; do
do_set "$VAR"
done
do_get
;;
"get")
do_get
;;
*)
do_usage
;;
esac
exit 0
Basically make that a *.sh file and give it permission to run as a program and then while in the same directory as the script we’ve been having shadesmaclean run the following command.
./radcard.sh set performance high
So far this SEEMS to have kept his system from crashing from the same errors as before (while we’re still looking into seemingly unrelated things on the side).
I’m sorry but I didn’t post my research early enough to track down all the sources I used when I found this solution, but the archwiki article on ATI Powersaving and in the amdgpu there’s some references to power saving features and stuff. I’m not an expert, but after fiddling with things, hopefully this leads to someone giving a more elaborate and informed solution/answer. There’s ways to set this to load automatically at boot, but we haven’t tested that yet as we’re still evaluating it’s effectiveness.
perhaps @shadesmaclean can chime in with any extra info on this from his end?
Anyway TLDR it seems to be a power saving bug with some AMD graphics cards and disabling or manually setting power saving or performance settings to high settings seems to negate the issue as a workaround.
as a novice myself any corrections or information you guys can throw at me will be great.
as a sidenote we also started having @shadesmaclean put his CPU in performance profile mode as well to negate any chance of powersettings somehow affecting that. probably unrelated, but should mention as it was part of our workarounds.