How to debug multiple EXT4-fs errors per session

Although I haven’t experienced any major problems, I decided to have a look at
$ journalctl -p 3 -xb

Among some errors related to kf.baloo and kwin-x11 crashing every time a logout - login happens, there are some error messages I don’t know how to tackle:

Feb 16 10:19:57 manjaroluia kernel: EXT4-fs (dm-0): Delayed block allocation failed for inode 10357172 at logical offset 96 with max blocks 536 with error 117
Feb 16 10:19:57 manjaroluia kernel: EXT4-fs (dm-0): This should not happen!! Data will be lost

These show up during any session (say, only having Firefox open).

dm-0 seems to be where my system is mounted:

$ dmsetup info /dev/dm-0
Name:              luks-4f08816c-b40b-4a78-b670-8ebe66df2cf8
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      254, 0
Number of targets: 1
UUID: CRYPT-LUKS1-4f08816cb40b4a78b6708ebe66df2cf8-luks-4f08816c-b40b-4a78-b670-8ebe66df2cf8
$ lsblk
NAME                                          MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                             8:0    0 232.9G  0 disk  
├─sda1                                          8:1    0   300M  0 part  /boot/efi
├─sda2                                          8:2    0 215.5G  0 part  
│ └─luks-4f08816c-b40b-4a78-b670-8ebe66df2cf8 254:0    0 215.5G  0 crypt /
└─sda3                                          8:3    0  17.1G  0 part  
  └─luks-35b530c7-cd21-409c-b31f-5fadbf7f3af6 254:1    0  17.1G  0 crypt [SWAP]

But I’m unsure on how to proceed to understand where the errors are coming from.
I think I could learn a lot from hearing about how a more experienced user would debug this.

As recommended, the output from inxi -Fazy follows below:

System:
  Kernel: 5.10.15-1-MANJARO x86_64 bits: 64 compiler: gcc v: 10.2.1 
  parameters: BOOT_IMAGE=/boot/vmlinuz-5.10-x86_64 
  root=UUID=ff20647b-87fe-4836-aa36-89739a4529b5 rw quiet 
  cryptdevice=UUID=4f08816c-b40b-4a78-b670-8ebe66df2cf8:luks-4f08816c-b40b-4a78-b670-8ebe66df2cf8 
  root=/dev/mapper/luks-4f08816c-b40b-4a78-b670-8ebe66df2cf8 
  resume=/dev/mapper/luks-35b530c7-cd21-409c-b31f-5fadbf7f3af6 
  udev.log_priority=3 
  Desktop: KDE Plasma 5.20.5 tk: Qt 5.15.2 wm: kwin_x11 dm: SDDM 
  Distro: Manjaro Linux 
Machine:
  Type: Desktop Mobo: ASUSTeK model: MAXIMUS VIII RANGER v: Rev 1.xx 
  serial: <filter> UEFI: American Megatrends v: 3802 date: 03/15/2018 
CPU:
  Info: Quad Core model: Intel Core i7-6700K bits: 64 type: MT MCP 
  arch: Skylake-S family: 6 model-id: 5E (94) stepping: 3 microcode: E2 
  L2 cache: 8 MiB 
  flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx 
  bogomips: 64026 
  Speed: 800 MHz min/max: 800/4200 MHz Core speeds (MHz): 1: 800 2: 800 3: 800 
  4: 800 5: 800 6: 800 7: 800 8: 800 
  Vulnerabilities: Type: itlb_multihit status: KVM: VMX disabled 
  Type: l1tf 
  mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable 
  Type: mds mitigation: Clear CPU buffers; SMT vulnerable 
  Type: meltdown mitigation: PTI 
  Type: spec_store_bypass 
  mitigation: Speculative Store Bypass disabled via prctl and seccomp 
  Type: spectre_v1 
  mitigation: usercopy/swapgs barriers and __user pointer sanitization 
  Type: spectre_v2 mitigation: Full generic retpoline, IBPB: conditional, 
  IBRS_FW, STIBP: conditional, RSB filling 
  Type: srbds mitigation: Microcode 
  Type: tsx_async_abort mitigation: Clear CPU buffers; SMT vulnerable 
Graphics:
  Device-1: NVIDIA GM204 [GeForce GTX 970] vendor: Gigabyte driver: nvidia 
  v: 390.141 alternate: nouveau,nvidia_drm bus ID: 01:00.0 chip ID: 10de:13c2 
  class ID: 0300 
  Display: x11 server: X.Org 1.20.10 compositor: kwin_x11 driver: 
  loaded: nvidia display ID: :0 screens: 1 
  Screen-1: 0 s-res: 1920x1080 s-dpi: 81 s-size: 602x343mm (23.7x13.5") 
  s-diag: 693mm (27.3") 
  OpenGL: renderer: llvmpipe (LLVM 11.0.1 256 bits) v: 4.5 Mesa 20.3.4 
  compat-v: 3.1 direct render: Yes 
Audio:
  Device-1: Intel 100 Series/C230 Series Family HD Audio vendor: ASUSTeK 
  driver: snd_hda_intel v: kernel bus ID: 00:1f.3 chip ID: 8086:a170 
  class ID: 0403 
  Device-2: NVIDIA GM204 High Definition Audio vendor: Gigabyte 
  driver: snd_hda_intel v: kernel bus ID: 01:00.1 chip ID: 10de:0fbb 
  class ID: 0403 
  Sound Server: ALSA v: k5.10.15-1-MANJARO 
Drives:
  Local Storage: total: 1.82 TiB used: 41.15 GiB (2.2%) 
  SMART Message: Unable to run smartctl. Root privileges required. 
  ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Samsung model: SSD 970 EVO 500GB 
  size: 465.76 GiB block size: physical: 512 B logical: 512 B speed: 31.6 Gb/s 
  lanes: 4 rotation: SSD serial: <filter> rev: 1B2QEXE7 temp: 27.9 C 
  scheme: GPT 
  ID-2: /dev/sda maj-min: 8:0 vendor: Samsung model: SSD 850 EVO 250GB 
  size: 232.89 GiB block size: physical: 512 B logical: 512 B speed: 6.0 Gb/s 
  rotation: SSD serial: <filter> rev: 1B6Q scheme: GPT 
  ID-3: /dev/sdb maj-min: 8:16 vendor: Western Digital 
  model: WDS250G2B0A-00SM50 size: 232.89 GiB block size: physical: 512 B 
  logical: 512 B speed: 6.0 Gb/s rotation: SSD serial: <filter> rev: 20WD 
  scheme: GPT 
  ID-4: /dev/sdc maj-min: 8:32 vendor: Samsung model: HD103UJ size: 931.51 GiB 
  block size: physical: 512 B logical: 512 B speed: 3.0 Gb/s serial: <filter> 
  rev: 1118 scheme: MBR 
Partition:
  ID-1: / raw size: 215.45 GiB size: 211.07 GiB (97.97%) 
  used: 41.15 GiB (19.5%) fs: ext4 dev: /dev/dm-0 maj-min: 254:0 
  mapped: luks-4f08816c-b40b-4a78-b670-8ebe66df2cf8 
  ID-2: /boot/efi raw size: 300 MiB size: 299.4 MiB (99.80%) 
  used: 484 KiB (0.2%) fs: vfat dev: /dev/sda1 maj-min: 8:1 
Swap:
  Kernel: swappiness: 60 (default) cache pressure: 100 (default) 
  ID-1: swap-1 type: partition size: 17.13 GiB used: 0 KiB (0.0%) priority: -2 
  dev: /dev/dm-1 maj-min: 254:1 
  mapped: luks-35b530c7-cd21-409c-b31f-5fadbf7f3af6 
Sensors:
  System Temperatures: cpu: 22.0 C mobo: N/A gpu: nvidia temp: 30 C 
  Fan Speeds (RPM): N/A gpu: nvidia fan: 45% 
Info:
  Processes: 243 Uptime: 38m wakeups: 0 Memory: 15.58 GiB 
  used: 2.82 GiB (18.1%) Init: systemd v: 247 Compilers: gcc: 10.2.0 Packages: 
  pacman: 1280 lib: 396 Shell: Zsh v: 5.8 running in: yakuake inxi: 3.3.01

Please post the output of:

smartctl --all /dev/sda

And please take a full data backup immediately! This doesn’t look good!

:fearful:

Thanks for replying!

I copied the most relevant data to another disk as per your recommendation.

The output of smartctl --all /dev/sda:

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.15-1-MANJARO] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 850 EVO 250GB
Serial Number:    S21PNSAG430763K
LU WWN Device Id: 5 002538 da024b697
Firmware Version: EMT01B6Q
User Capacity:    250,059,350,016 bytes [250 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Feb 24 12:23:29 2021 WET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x53) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 133) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       7115
 12 Power_Cycle_Count       0x0032   095   095   000    Old_age   Always       -       4577
177 Wear_Leveling_Count     0x0013   095   095   000    Pre-fail  Always       -       96
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   099   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   074   053   000    Old_age   Always       -       26
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       132
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       33857759467

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Warning! SMART Selective Self-Test Log Structure error: invalid SMART checksum.
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

It seems okay or am I reading it wrong?

:sweat_smile: :+1:

OK, the hardware is fine!

Please, boot from a Manjajo USB stick and execute:

fsck /dev/sda1
fsck /dev/dm-0
fsck /dev/dm-1

and follow best practices if you encounter any errors.

(I can’t tell you what to do as I don’t know what kind of errors you’re going to get: DDG any )