Btrfs scrub failing on nvme with pcie bus error

Hi,
I am running my system with btrfs and after cleaning up some big files I decided to run a scrub with btrfs assistant. During that my system crashed. I tried it again with the same result.
I run manjaro from a usb stick and got some errors from journalctl --dmesg:

Feb 21 21:10:28 manjaro kernel: Bluetooth: RFCOMM TTY layer initialized
Feb 21 21:10:28 manjaro kernel: Bluetooth: RFCOMM socket layer initialized
Feb 21 21:10:28 manjaro kernel: Bluetooth: RFCOMM ver 1.11
Feb 21 21:10:32 manjaro kernel: logitech-hidpp-device 0003:046D:406C.0008: HID++ 4.2 device connected.
Feb 21 21:11:24 manjaro kernel: BTRFS: device fsid 703ffe1a-84b0-41ba-9dab-dfc62477af24 devid 1 transid 516824 /dev/nvme1n1p2 (259:2) scanned by pool-udisksd (2553)
Feb 21 21:11:24 manjaro kernel: BTRFS info (device nvme1n1p2): first mount of filesystem 703ffe1a-84b0-41ba-9dab-dfc62477af24
Feb 21 21:11:24 manjaro kernel: BTRFS info (device nvme1n1p2): using crc32c (crc32c-intel) checksum algorithm
Feb 21 21:11:24 manjaro kernel: BTRFS info (device nvme1n1p2): disk space caching is enabled
Feb 21 21:11:24 manjaro kernel: BTRFS warning (device nvme1n1p2): space cache v1 is being deprecated and will be removed in a future release, please use -o space_cache=v2
Feb 21 21:11:39 manjaro kernel: BTRFS info (device nvme1n1p2): scrub: started on devid 1
Feb 21 21:18:51 manjaro kernel: pcieport 0000:00:1d.0: AER: Multiple Correctable error message received from 0000:00:1d.0
Feb 21 21:18:51 manjaro kernel: pcieport 0000:00:1d.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
Feb 21 21:18:51 manjaro kernel: pcieport 0000:00:1d.0:   device [8086:a298] error status/mask=00002041/00002000
Feb 21 21:18:51 manjaro kernel: pcieport 0000:00:1d.0:    [ 0] RxErr                  (First)
Feb 21 21:18:51 manjaro kernel: pcieport 0000:00:1d.0:    [ 6] BadTLP                
Feb 21 21:19:22 manjaro kernel: nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Feb 21 21:19:22 manjaro kernel: nvme nvme1: Does your device have a faulty power saving mode enabled?
Feb 21 21:19:22 manjaro kernel: nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
Feb 21 21:19:22 manjaro kernel: nvme 0000:09:00.0: enabling device (0000 -> 0002)
Feb 21 21:19:22 manjaro kernel: nvme nvme1: Disabling device after reset failure: -19
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): bdev /dev/nvme1n1p2 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): bdev /dev/nvme1n1p2 errs: wr 0, rd 2, flush 0, corrupt 0, gen 0
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): bdev /dev/nvme1n1p2 errs: wr 0, rd 3, flush 0, corrupt 0, gen 0
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): bdev /dev/nvme1n1p2 errs: wr 0, rd 4, flush 0, corrupt 0, gen 0
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): bdev /dev/nvme1n1p2 errs: wr 0, rd 6, flush 0, corrupt 0, gen 0
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): bdev /dev/nvme1n1p2 errs: wr 0, rd 5, flush 0, corrupt 0, gen 0
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): bdev /dev/nvme1n1p2 errs: wr 0, rd 7, flush 0, corrupt 0, gen 0
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): bdev /dev/nvme1n1p2 errs: wr 0, rd 8, flush 0, corrupt 0, gen 0
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): bdev /dev/nvme1n1p2 errs: wr 0, rd 9, flush 0, corrupt 0, gen 0
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): bdev /dev/nvme1n1p2 errs: wr 0, rd 10, flush 0, corrupt 0, gen 0
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): unable to fixup (regular) error at logical 4950066528256 on dev /dev/nvme1n1p2 physical 297511944192
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): unable to fixup (regular) error at logical 4950066724864 on dev /dev/nvme1n1p2 physical 297512140800
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): unable to fixup (regular) error at logical 4950066921472 on dev /dev/nvme1n1p2 physical 297512337408
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): unable to fixup (regular) error at logical 4950066790400 on dev /dev/nvme1n1p2 physical 297512206336
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): unable to fixup (regular) error at logical 4950066659328 on dev /dev/nvme1n1p2 physical 297512075264
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): unable to fixup (regular) error at logical 4950066724864 on dev /dev/nvme1n1p2 physical 297512140800
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): unable to fixup (regular) error at logical 4950066528256 on dev /dev/nvme1n1p2 physical 297511944192
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): unable to fixup (regular) error at logical 4950066987008 on dev /dev/nvme1n1p2 physical 297512402944
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): unable to fixup (regular) error at logical 4950066659328 on dev /dev/nvme1n1p2 physical 297512075264
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2): unable to fixup (regular) error at logical 4950066724864 on dev /dev/nvme1n1p2 physical 297512140800
Feb 21 21:19:22 manjaro kernel: BTRFS: error (device nvme1n1p2) in btrfs_commit_transaction:2523: errno=-5 IO failure (Error while writing out transaction)
Feb 21 21:19:22 manjaro kernel: BTRFS info (device nvme1n1p2 state E): forced readonly
Feb 21 21:19:22 manjaro kernel: BTRFS warning (device nvme1n1p2 state E): Skipping commit of aborted transaction.
Feb 21 21:19:22 manjaro kernel: BTRFS error (device nvme1n1p2 state EA): Transaction aborted (error -5)
Feb 21 21:19:22 manjaro kernel: BTRFS: error (device nvme1n1p2 state EA) in cleanup_transaction:2017: errno=-5 IO failure
Feb 21 21:19:22 manjaro kernel: BTRFS info (device nvme1n1p2 state EA): scrub: not finished on devid 1 with status: -5
Feb 21 21:19:40 manjaro kernel: btrfs_dev_stat_inc_and_print: 966 callbacks suppressed
Feb 21 21:19:40 manjaro kernel: BTRFS error (device nvme1n1p2 state EA): bdev /dev/nvme1n1p2 errs: wr 38, rd 957, flush 0, corrupt 0, gen 0
Feb 21 21:19:40 manjaro kernel: BTRFS error (device nvme1n1p2 state EA): bdev /dev/nvme1n1p2 errs: wr 38, rd 958, flush 0, corrupt 0, gen 0
Feb 21 21:19:40 manjaro kernel: BTRFS error (device nvme1n1p2 state EA): bdev /dev/nvme1n1p2 errs: wr 38, rd 959, flush 0, corrupt 0, gen 0
Feb 21 21:19:40 manjaro kernel: BTRFS error (device nvme1n1p2 state EA): bdev /dev/nvme1n1p2 errs: wr 38, rd 960, flush 0, corrupt 0, gen 0
Feb 21 21:19:40 manjaro kernel: BTRFS error (device nvme1n1p2 state EA): bdev /dev/nvme1n1p2 errs: wr 38, rd 961, flush 0, corrupt 0, gen 0
Feb 21 21:19:40 manjaro kernel: BTRFS error (device nvme1n1p2 state EA): bdev /dev/nvme1n1p2 errs: wr 38, rd 962, flush 0, corrupt 0, gen 0
Feb 21 21:19:40 manjaro kernel: BTRFS error (device nvme1n1p2 state EA): bdev /dev/nvme1n1p2 errs: wr 38, rd 963, flush 0, corrupt 0, gen 0
Feb 21 21:19:40 manjaro kernel: BTRFS error (device nvme1n1p2 state EA): bdev /dev/nvme1n1p2 errs: wr 38, rd 964, flush 0, corrupt 0, gen 0
Feb 21 21:19:40 manjaro kernel: BTRFS error (device nvme1n1p2 state EA): bdev /dev/nvme1n1p2 errs: wr 38, rd 965, flush 0, corrupt 0, gen 0
Feb 21 21:19:40 manjaro kernel: BTRFS error (device nvme1n1p2 state EA): bdev /dev/nvme1n1p2 errs: wr 38, rd 966, flush 0, corrupt 0, gen 0

btrfs check did not find any errors.

In case it is connected: after the last manjaro update 2025-16-02 I sometimes have a problem on boot, that the desktop is not starting properly but systemctl restart sddm resolves it.

inxi:

System:
  Kernel: 6.12.12-2-MANJARO arch: x86_64 bits: 64 compiler: gcc v: 14.2.1
  Desktop: KDE Plasma v: 6.2.5 tk: Qt v: N/A wm: kwin_x11 dm: SDDM
    Distro: Manjaro base: Arch Linux
Machine:
  Type: Desktop Mobo: ASUSTeK model: MAXIMUS IX FORMULA v: Rev 1.xx
    serial: <superuser required> part-nu: SKU UEFI: American Megatrends v: 1301
    date: 03/14/2018
Battery:
  Device-1: hidpp_battery_0 model: Logitech G603 Wireless Gaming Mouse
    serial: <filter> charge: 100% (should be ignored) status: discharging
CPU:
  Info: quad core model: Intel Core i7-7700K bits: 64 type: MT MCP
    arch: Kaby Lake rev: 9 cache: L1: 256 KiB L2: 1024 KiB L3: 8 MiB
  Speed (MHz): avg: 800 min/max: 800/4700 cores: 1: 800 2: 800 3: 800 4: 800
    5: 800 6: 800 7: 800 8: 800 bogomips: 67224
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Graphics:
  Device-1: Advanced Micro Devices [AMD/ATI] Navi 32 [Radeon RX 7700 XT /
    7800 XT] vendor: Tul / PowerColor driver: amdgpu v: kernel arch: RDNA-3
    pcie: speed: 16 GT/s lanes: 16 ports: active: HDMI-A-1 empty: DP-1, DP-2,
    DP-3, Writeback-1 bus-ID: 03:00.0 chip-ID: 1002:747e
  Display: x11 server: X.Org v: 21.1.15 with: Xwayland v: 24.1.5
    compositor: kwin_x11 driver: X: loaded: amdgpu unloaded: modesetting,radeon
    alternate: fbdev,vesa dri: radeonsi gpu: amdgpu display-ID: :0 screens: 1
  Screen-1: 0 s-res: 3840x1080 s-dpi: 96
  Monitor-1: HDMI-A-1 mapped: HDMI-A-0 model: Samsung C49HG9x res: 3840x1080
    hz: 120 dpi: 81 diag: 1242mm (48.9")
  API: EGL v: 1.5 platforms: device: 0 drv: radeonsi device: 1 drv: swrast
    gbm: drv: kms_swrast surfaceless: drv: radeonsi x11: drv: radeonsi
    inactive: wayland
  API: OpenGL v: 4.6 compat-v: 4.5 vendor: amd mesa v: 24.3.4-arch1.1
    glx-v: 1.4 direct-render: yes renderer: AMD Radeon RX 7800 XT (radeonsi
    navi32 LLVM 19.1.7 DRM 3.59 6.12.12-2-MANJARO) device-ID: 1002:747e
  API: Vulkan v: 1.4.303 surfaces: xcb,xlib device: 0 type: discrete-gpu
    driver: N/A device-ID: 1002:747e
  Info: Tools: api: clinfo, eglinfo, glxinfo, vulkaninfo
    de: kscreen-console,kscreen-doctor gpu: nvidia-smi wl: wayland-info
    x11: xdpyinfo, xprop, xrandr
Audio:
  Device-1: Intel 200 Series PCH HD Audio vendor: ASUSTeK
    driver: snd_hda_intel v: kernel bus-ID: 00:1f.3 chip-ID: 8086:a2f0
  Device-2: Advanced Micro Devices [AMD/ATI] Navi 31 HDMI/DP Audio
    driver: snd_hda_intel v: kernel pcie: speed: 16 GT/s lanes: 16
    bus-ID: 03:00.1 chip-ID: 1002:ab30
  API: ALSA v: k6.12.12-2-MANJARO status: kernel-api with: aoss
    type: oss-emulator
  Server-1: JACK v: 1.9.22 status: off
  Server-2: PipeWire v: 1.2.7 status: off with: pipewire-media-session
    status: active
  Server-3: PulseAudio v: 17.0-43-g3e2bb status: active with:
    1: pulseaudio-alsa type: plugin 2: pulseaudio-jack type: module
Network:
  Device-1: Intel Ethernet I219-V vendor: ASUSTeK driver: e1000e v: kernel
    port: N/A bus-ID: 00:1f.6 chip-ID: 8086:15b8
  IF: enp0s31f6 state: down mac: <filter>
  Device-2: Realtek RTL8125 2.5GbE driver: r8169 v: kernel pcie:
    speed: 5 GT/s lanes: 1 port: d000 bus-ID: 06:00.0 chip-ID: 10ec:8125
  IF: enp6s0 state: up speed: 2500 Mbps duplex: full mac: <filter>
  IF-ID-1: br-d27b5779686b state: down mac: <filter>
  IF-ID-2: docker0 state: down mac: <filter>
Bluetooth:
  Device-1: ASUSTek Qualcomm Bluetooth 4.1 driver: btusb v: 0.8 type: USB
    rev: 1.1 speed: 12 Mb/s lanes: 1 bus-ID: 1-11:5 chip-ID: 0b05:1825
  Report: rfkill ID: hci0 rfk-id: 0 state: up address: see --recommends
Drives:
  Local Storage: total: 2.72 TiB used: 1015.56 GiB (36.5%)
  ID-1: /dev/nvme0n1 vendor: Corsair model: Force MP510 size: 894.25 GiB
    speed: 31.6 Gb/s lanes: 4 serial: <filter> temp: 45.9 C
  ID-2: /dev/nvme1n1 vendor: Seagate model: FireCuda 530 ZP2000GM30013
    size: 1.82 TiB speed: 63.2 Gb/s lanes: 4 serial: <filter> temp: 19.9 C
  ID-3: /dev/sda vendor: SanDisk model: Ultra size: 28.64 GiB type: USB
    rev: 2.1 spd: 480 Mb/s lanes: 1 serial: <filter>
Partition:
  ID-1: / size: 893.96 GiB used: 332.73 GiB (37.2%) fs: btrfs
    dev: /dev/nvme0n1p2
  ID-2: /boot/efi size: 299.4 MiB used: 568 KiB (0.2%) fs: vfat
    dev: /dev/nvme0n1p1
  ID-3: /home size: 893.96 GiB used: 332.73 GiB (37.2%) fs: btrfs
    dev: /dev/nvme0n1p2
  ID-4: /var/log size: 893.96 GiB used: 332.73 GiB (37.2%) fs: btrfs
    dev: /dev/nvme0n1p2
Swap:
  Alert: No swap data was found.
Sensors:
  System Temperatures: cpu: 32.0 C mobo: N/A gpu: amdgpu temp: 44.0 C
    mem: 41.0 C
  Fan Speeds (rpm): N/A gpu: amdgpu fan: 0
Info:
  Memory: total: 32 GiB available: 31.28 GiB used: 4.45 GiB (14.2%)
  Processes: 353 Power: uptime: 2m wakeups: 0 Init: systemd v: 257
    default: graphical
  Packages: pm: pacman pkgs: 1715 Compilers: clang: 19.1.7 gcc: 14.2.1
    Shell: Zsh v: 5.9 running-in: konsole inxi: 3.3.37
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.12-2-MANJARO] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Force MP510
Serial Number:                      192182070001277136B8
Firmware Version:                   ECFM12.2
PCI Vendor/Subsystem ID:            0x1987
IEEE OUI Identifier:                0x6479a7
Total NVM Capacity:                 960.197.124.096 [960 GB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          960.197.124.096 [960 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            6479a7 1fa1034238
Local Time is:                      Fri Feb 21 21:42:15 2025 CET
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x0054):     DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x0c):         Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     70 Celsius
Critical Comp. Temp. Threshold:     90 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    10.73W       -        -    0  0  0  0        0       0
 1 +     7.69W       -        -    1  1  1  1        0       0
 2 +     6.18W       -        -    2  2  2  2        0       0
 3 -   0.0490W       -        -    3  3  3  3     2000    2000
 4 -   0.0018W       -        -    4  4  4  4    25000   25000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        43 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    8%
Data Units Read:                    184.967.564 [94,7 TB]
Data Units Written:                 157.287.543 [80,5 TB]
Host Read Commands:                 751.351.142
Host Write Commands:                1.143.640.682
Controller Busy Time:               4.338
Power Cycles:                       2.505
Power On Hours:                     8.347
Unsafe Shutdowns:                   217
Media and Data Integrity Errors:    0
Error Information Log Entries:      5.646
Warning  Comp. Temperature Time:    24
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0       5646     0  0x0010  0x4004  0x028            0     0     -  Invalid Field in Command
  1       5645     0  0x0005  0x4004      -            0     0     -  Invalid Field in Command
  2       5644     0  0x0000  0x4004  0x028            0     0     -  Invalid Field in Command
  3       5643     0  0x0005  0x4004      -            0     0     -  Invalid Field in Command
  4       5642     0  0x3015  0x4004  0x028            0     0     -  Invalid Field in Command
  5       5641     0  0x0005  0x4004      -            0     0     -  Invalid Field in Command
  6       5640     0  0x0000  0x4004  0x028            0     0     -  Invalid Field in Command
  7       5639     0  0x0005  0x4004      -            0     0     -  Invalid Field in Command
  8       5638     0  0x0000  0x4004  0x028            0     0     -  Invalid Field in Command
  9       5637     0  0x0005  0x4004      -            0     0     -  Invalid Field in Command
 10       5636     0  0x5002  0x4004  0x028            0     0     -  Invalid Field in Command
 11       5635     0  0x0005  0x4004      -            0     0     -  Invalid Field in Command
 12       5634     0  0x0000  0x4004  0x028            0     0     -  Invalid Field in Command
 13       5633     0  0x0005  0x4004      -            0     0     -  Invalid Field in Command
 14       5632     0  0x000c  0x4004  0x028            0     0     -  Invalid Field in Command
 15       5631     0  0x0005  0x4004      -            0     0     -  Invalid Field in Command
... (47 entries not read)

Self-tests not supported

Does somebody know what I can do to resolve the problem?

Welcome to the forum! :vulcan_salute:

This suggests that the problem is in the hardware. A btrfs scrub doesn’t do anything low-level — it merely checks and (if necessary) corrects metadata — but a problem with either the flash memory on your NVMe drive or with the controller can of course cause the system to crash.

Now, the error message hints at the controller rather than at the drive itself, but considering that the drive is plugged into a PCIe slot, it is most likely a problem with the motherboard itself. I would therefore recommend contacting the vendor and requesting a replacement — hopefully it’s still under warranty.

2 Likes

Thank you for your response!
Sadly my motherboard is old - the cpu is from 2017. I wanted to wait one or two more years before I replace it all together. I just updated the graphics card and psu.

I ran the scrub again with the live iso to replace the screenshot with text. I find it interesting that it crashed again at around 75% (of scrub status).

If it is really a hardware problem then its a sad story. I will try do replug the nvme drive again and just in case I will try a scrub with a live iso of another distro.

3 Likes

The following is worth a try:

Try unplugging and re-plugging all cards and cables connected to the motherboard. Sometimes this will resolve a problem that was caused by poor connections. This is also an opportunity to clean up the PC internally.

In any case, it is now time to make another backup.

  • First of the important data
  • Then of the entire system/volume
    (although it is to be expected that if the scrub goes wrong, the system backup may get stuck at the same point)

:footprints:

P.S.

If it is a temperature or throughput problem, the --limit option of scrub could help. btrfs-scrub(8) — BTRFS documentation

P.S. 2

Did you give btrfs time to completely process the deletion of files / snapshots?

It can take a few minutes for btrfs to settle down after major changes have been made.

I got a small update to this issue. I replaced mainboard, cpu and ram. Now there is basically everything replaced. I still got that issue when I scrub my main disk. Now there is no message in journalctl (dmesg) as before, but the scrub doesnt finish and the drive goes read-only. So the NVMe disk must be broken. I got another fresh disk laying around. I will try to clone my system to it, since I dont want to reinstall all software. I will follow this manual: Wolfgang Ziegler - Migrating a Linux system to a larger SSD. Wish me luck!

Wow, that took only one hour and I expected problems with booting but actually everything is working. I run a scrub on the new disk and it passed. The problem seems to be solved