Btrfs and copy duration

Small question about the copy-on-write feature.
From reading the documentation, i understand that, when modifying a file, it is written as a copy rather than modifying the existing one on disk. Something i have a harder time grasping is: on the long run, what happens to the previous copies?
I found the notion of “commit” exists, and that there’s a time-to-commit configurable. Is it related? Like, after x time, the copy is considered the “definitive” version, and the previous one is deleted.

It’s a fundamentally different paradigm, but I can explain it with the ZFS lexicon, which uses the same principles as Btrfs or any CoW system. Need to crack my fingers for this one.


For ZFS (and subtitute the proper terminology for the Btrfs landscape) we have the following units of data:

  • Sectors: size (bytes/kilobytes) determined at the hardware level by the drive, but to make things smoother, assume that your drives use 4K sectors.
  • Blocks: size (bytes/kilobytes) determined at the virtual device level (vdev), with which you use to build a pool, datasets (subvolumes), etc. Sometimes referred to as the “ashift” value. It’s best to match your drive’s sector size, hence an ashift that yields 4K blocks works best for 4K sector sizes.
  • Records: size (up to how many “blocks”) can make a record. For instance, a maximum record size can be made to fit only one block (4K! :open_mouth:) which completely kills inline compression and requires more metadata as you will have tons and tons and tons of records, no matter what types of files you’re dealing with. Or you can have really large records that can hold many blocks (ZFS allows up to 1MB record sizes, which means a single record can hold up to 256 blocks)
  • Files: comprised of records; which records to load and in which order is how the file is loaded into memory.

The file in its entirety is not written as a new copy: the records (which are modified) are. Meaning that every bit of data represented by the current file is untouched, and the filesystem table will not point to the new modified records until it is 100% sure the operation completely successfully. At this point, the old records still exist on the device, but are no longer pointing to the file. (However, if you made a snapshot of the dataset/subvolume, then the snapshot will still point to those earlier records, if you should choose to revert/restore.)

So let’s say you have a record size of 1MB. You’ve got yourself a large 2GB media file. You load it into some app and make modifications. Throughout the entirety of this media file, for some reason only 10 records are modified. This means that 10 new 1MB records are copied to the dataset/subvolume, and thus only 10MB of new space is “used up”. However, if you never made a snapshot at an earlier point in time, no extra space is used up. If you did? Then your current live dataset/subvolume, compared to the earlier snapshot, will have a difference of 10MB, represented by those 10 older records.

This is fundamentally different than copying a new/temporary 2GB file all over again. Only what is modified is “copied-on-write”.

With ZFS and Btrfs, no files are ever modified “inplace”, even if the software believes it is modifying a file inplace.

2 Likes

Thanks for the detailed explanation, but that doesn’t answer my question.

They get marked for deletion, unless a snapshot references them.

1 Like

Is that related to the “commit”?

I’m under-qualified for this, @winnie has explained it better than I could. I don’t know what the “commit” is, though I suspect it may be related to something journal-like, and I use zfs so my knowledge of btrfs is very limited.

However my understanding of filesystems is that once there are no references to the data (hard links/inodes or whatever) left then the space is freed/marked for deletion/marked as free/available to be written, whatever you want to call it.

zfs snapshots also contain references, so as long as a snapshot references the old record it is kept, once there are no references left it becomes “empty” space that can be written to.

I seem to have forgotten most of the lower level stuff. :frowning_face:

This is pretty much it. :point_up: The main difference is these records (or in Btrfs whatever they’re called) are not “marked” for deletion. They simply have nothing left that points to them: the live filesystem table does not point to them and there are no snapshots that point to them.

Forensically? Yes, they can be sniffed if no encryption is used and those areas of the physical disks were never used again for new data. For end-users? Those ares of the disk are considered “unused” and will not take up any extra space in the pool/dataset/subvolume capacity.


Another way to put it: If you never use snapshots (which defeats one of the main strengths of CoW filesystems) then old copies of records are gone forever.

Scenario A (no snapshots):

  • Let’s say you’re making a custom ISO for a distro. It’s about 2GB large.
  • The next day you realize you need to change some included packages which take up about 100MB near the “middle” of the ISO.
  • The software you use “modifies” the .iso file “inplace”.
  • However, your CoW filesystem saves these newly modified 100MB of changes on different locations on the disk.
  • After successfully saving these new records, the live filesystem now points to the new 100MB worth of records, while the old 100MB worth of records are no longer referenced. (They’re “gone” forever.)
  • If you decide to “revert” back to the older version of your ISO, you cannot.

Scenario B (snapshots):

  • Let’s say you’re making a custom ISO for a distro. It’s about 2GB large.
  • You’re happy with this version of the ISO file, so you make a snapshot of the dataset/subvolume named "important-2021-07-31-18-00"
  • The next day you realize you need to change some included packages which take up about 100MB near the “middle” of the ISO.
  • The software you use “modifies” the .iso file “inplace”.
  • However, your CoW filesystem saves these newly modified 100MB of changes on different locations on the disk.
  • After successfully saving these new records, the live filesystem now points to the new 100MB worth of records, while the old 100MB worth of records are still referenced by the snapshot "important-2021-07-31-18-00"
  • If you decide to “revert” back to the older version of your ISO, you can either retrieve a copy of it from “important-2021-07-31-18-00” (possible with ZFS, not sure about Btrfs) or simply revert the entire filesystem (dataset/subvolume) to “important-2021-07-31-18-00”

In scenario B, while using the live filesystem your 2GB ISO file is the “new” version with your modifications, yet there still exists 100MB worth of records still being referenced by the snapshot “important-2021-07-31-18-00”. You can always restore/revert back to this version.

If you want to free up space, you can delete the snapshot “important-2021-07-31-18-00”, and thus there is no longer any references pointing to those premodified 100MB of records, and thus you freed up 100MB of space.


If you don’t use snapshots, no extra space is taken up and the old copies of records are “gone” (as the live filesystem now points to the new records.) The areas of the disks where these old copies exist can be overwritten in the future.

If you do use snapshots, the previous copies of records are still being referenced by earlier snapshots, and are kept safe until the snapshots that reference them are destroyed.

Using the live filesystem, you will not have access to these previous versions of data in your daily use. ZFS allows you to navigate to a “secret” folder to pick-and-choose what previos versions of files you want to retrieve, or you can simply revert the entire snapshot in question. (I’m not sure if Btrfs allows the former, but I know it allows the latter.)


If your concern is forensics, then yes it’s “possible” for a professional in cybersecurity to retrieve these older records that still physically live on the disk, but this is no different than any other filesystem that doesn’t use encryption. You know the Recycle Bin in Windows? When you “empty” it, nothing is wiped. That data can be retrieved using cheap software.


I’m not familiar with “commit” for Btrfs. It might be similar to ZFS’s “write intent” which is what handles “synchronous writes” using a write intent log. (If there’s a power outage or system crash, upon reloading the ZFS pool, it checks its write intent log to see what still needs to be written to the filesystem.) Perhaps Btrfs has a tunable where you can change how long it waits for the “okay”.

4 Likes

That confused me, because it is different from what i observed when experimenting, so i went to check why.

I did my experiment with an OpenSUSE VM, since i’ve known them using Btrfs for a long(est ?) time. What misguided me was what i saw as “copies” in Snapper, since i never manually made snapshots. And as those kept on accumulating, i was wondering when those would be “purged”. I found weird a filesystem would only keep on using more space without freeing it. Hence my question…
But those were actually snapshots, automatically triggered by Yast and Zypper as i installed packages. The behavior i saw then made sense.

Thanks for your time.

Glad it makes more sense to you now. :v:

Just remember that CoW and snapshots operate on the blocks/records, not the files. The way things are presented to the end-user makes it seem like it’s operating on “whole files”.

Snapshots only take up extra space on what is different in terms of records/blocks.

Hence, why you see two versions of the same 2GB file in your snapshot history, and intuitively you think “it’s taking up an extra 2GB to keep an older version of this large file?” :open_mouth:

Yet in reality, the CoW filesystem is only concerned with the units of data that comprise the file, and keeps track of the differences as blocks/records dispersed around the disks.

Perception. The end user sees:
“Two versions of the same 2GB file! That’s 2GB + 2GB! Yuck! If I delete the previous snapshot I’ll free up 2GB of space, right?” (It’s not true, though, see below.)

Reality. The CoW filesystem knows:
“Going back since the previous snapshot, this 2GB file has had 100MB worth of modifications, and thus the previous snapshot is using up 100MB of extra space in order to preserve the blocks/records that are necessary to reference the entire previous version of this 2GB file. If the user deletes the previous snapshot, they will free up only 100MB of space.”

On this point, it’s not as grave as it seems, since the “accumulation” of more and more space is very little in real-world use. 10MB here, another 16MB there, maybe an extra 1MB there, another 5MB here. (It only takes up as much space as what has been modified in the past.) It will take a long time to reach a point where you’re running out of space on high capacity storage devices.

However, if you “prune” old snapshots, or there’s an automated cron job or systemd timer, this can destroy old snapshots on a routine basis.

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.