First off – nothing I’m going to talk about in this post is novel or overly surprising, I just haven’t found a clear writeup of it before. I’m not criticising any design decisions or claiming this is an important issue, just raising something that people might otherwise be unaware of.
With that out of the way: Automatic deduplication of data is a feature of modern filesystems like zfs and btrfs. It takes two forms – inline, where the filesystem detects that data being written to disk is identical to data that already exists on disk and simply references the existing copy rather than, and offline, where tooling retroactively identifies duplicated data and removes the duplicate copies (zfs supports inline deduplication, btrfs only currently supports offline). In a world where disks end up with multiple copies of cloud or container images, deduplication can free up significant amounts of disk space.
What’s the security implication? The problem is that deduplication doesn’t recognise ownership – if two users have copies of the same file, only one copy of the file will be stored. So, if user a stores a file, the amount of free space will decrease. If user b stores another copy of the same file, the amount of free space will remain the same. If user b is able to check how much free space is available, user b can determine whether the file already exists.
This doesn’t seem like a huge deal in most cases, but it is a violation of expected behaviour (if user b doesn’t have permission to read user a’s files, user b shouldn’t be able to determine whether user a has a specific file). But we can come up with some convoluted cases where it becomes more relevant, such as law enforcement gaining unprivileged access to a system and then being able to demonstrate that a specific file already exists on that system. Perhaps more interestingly, it’s been demonstrated that free space isn’t the only sidechannel exposed by deduplication – deduplication has an impact on access timing, and can be used to infer the existence of data across virtual machine boundaries.
As I said, this is almost certainly not something that matters in most real world scenarios. But with so much discussion of CPU sidechannels over the past couple of years, it’s interesting to think about what other features also end up leaking information in ways that may not be obvious.
 Deduplication is usually done at the block level rather than the file level, but given zfs’s support for variable sized blocks, identical files should be deduplicated even if they’re smaller than the maximum record size