ZFS Reference & Cheat Sheet
A working ZFS cheat sheet covering pools, datasets, snapshots, send/recv, compression and the bits you forget at 2am. Started on OpenSolaris around 2008, kept current with OpenZFS on FreeBSD and Proxmox.
!! NOTE
This was originally written for OpenSolaris back when I picked up ZFS around 2007-2008 and have been updating it ever since. Most of these commands work across Solaris, FreeBSD and OpenZFS on Linux but I’ve called out version-specific bits where it matters.
A quick history
ZFS landed in Solaris 10 in June 2006 and the moment I read about copy-on-write, end-to-end checksums and pooled storage I knew RAID5 on a hardware controller was on borrowed time. Sun open-sourced it shortly after and OpenSolaris was where most of us tinkered with it.
Then came the unfortunate Oracle bit in 2010. Pool version 28 was the last open release. The community forked it as OpenZFS which is what powers ZFS on FreeBSD, Linux, illumos and macOS today. Modern OpenZFS uses feature flags instead of monotonic version numbers so you can mix and match what your platform supports.
A rough timeline of the bits you’ll actually care about:
| Year | Release | Notable |
|---|---|---|
| 2006 | Solaris 10 6/06 | ZFS arrives |
| 2008 | FreeBSD 7.0 | First non-Solaris port |
| 2009 | Pool v17 | RAIDZ3 (triple parity) |
| 2010 | Pool v28 | Last open Solaris version |
| 2013 | ZoL 0.6.1 | First stable ZFS on Linux, lz4 compression |
| 2019 | OpenZFS 0.8 | Native encryption, TRIM, special vdevs, sequential resilver |
| 2020 | OpenZFS 2.0 | Linux and FreeBSD on one codebase, zstd compression |
| 2021 | OpenZFS 2.1 | dRAID |
| 2023 | OpenZFS 2.2 | Block cloning, BLAKE3 checksums |
| 2024 | OpenZFS 2.3 | RAIDZ expansion (finally!), Direct IO |
Most of what follows works on anything from pool v15 onwards. Where it doesn’t, I’ve flagged the minimum version.
Pool Topology
Before you touch a single command, pick your topology. You can’t change a pool’s redundancy level once it’s created (well, you couldn’t until RAIDZ expansion in 2.3 which is still a fairly limited operation).
| Type | Min Disks | Parity | Notes |
|---|---|---|---|
stripe | 1 | 0 | No redundancy, lose a disk lose the pool |
mirror | 2 | n-1 | Best for IOPS, n-way mirrors supported |
raidz1 | 3 | 1 | Tolerates 1 disk loss, like RAID5 |
raidz2 | 4 | 2 | Tolerates 2 disk loss, like RAID6, the sensible default |
raidz3 | 5 | 3 | Tolerates 3 disk loss (pool v17+) |
draid | varies | varies | Distributed parity (OpenZFS 2.1+) for very large arrays |
A few rules of thumb that have served me well:
- For anything bigger than 4TB drives use
raidz2. Resilver times on big disks are scary andraidz1leaves you exposed during that window. - If you care about random IOPS (VM storage, databases) use mirrors. A pool of mirrored pairs gives you the IOPS of N drives versus 1 for a raidz vdev.
- You can stripe across vdevs but never stripe across raidz vdevs of different widths.
dRAIDis for the folks running 30+ disks. If you’re not, stick with raidz.
Device Naming
This bit changed a lot over the years. On Solaris/OpenSolaris we had the lovely c0t0d0 controller-target-disk style. FreeBSD uses ada0, da0. Linux had /dev/sdX which is awful because the letters can shuffle on reboot.
Always reference disks by their stable identifier. On Linux that’s /dev/disk/by-id/ (the WWN or serial number ones). On FreeBSD use the gptid or diskid. Saves you a lot of pain when a controller renumbers things.
Pool Creation
The basics, with examples mirroring my own kit (gandalf is one of my Proxmox boxes, zeus is the FreeBSD machine).
A simple stripe (don’t do this for anything you care about):
A two-way mirror, which is what I run for VM storage:
A six-disk raidz2 for bulk storage (movies, photos, the ~2PB of Aero/Astro test data I’ve been smashing through with Smash):
Striped mirrors (three vdevs of two-way mirrors), great for VM storage with 6 SSDs:
About ashift
ashift=12 means 4K sectors (2^12 = 4096 bytes). Almost every modern drive is 4K native or 4K-emulated, so ashift=12 is what you want. NVMe is sometimes happier on ashift=13 (8K). You cannot change ashift after pool creation, so get it right the first time.
If you’re not sure, peek at the drive:
Useful create-time properties
Set these at creation rather than fiddling later:
compression=zstd(OpenZFS 2.0+), orlz4for older. Always on, it’s faster than no compression for most workloads.atime=offstops every read updating access times, which is pointless write amplification.xattr=sastores extended attributes inline (Linux), much faster.acltype=posixaclenables POSIX ACLs (Linux).
Pool Status & Inspection
The commands you’ll run a hundred times.
Show all pools at a glance:
The full status with vdev tree, errors and resilver progress:
I/O statistics, refreshed every 2 seconds:
Pool history (every command run against the pool, ever, kept on-pool):
That last one has saved my bacon more than once when trying to remember what someone (me) did to a pool six months ago.
Datasets
Datasets are the ZFS equivalent of filesystems but they’re cheap to create, can be nested and inherit properties from their parent. Make liberal use of them.
Create a dataset:
Set a property (children inherit unless overridden):
recordsize matters for performance. Big media files love 1M. Databases want 8K or 16K to match their page size. The default of 128K is fine for general purpose stuff.
Inspect properties (the ones that aren’t default):
List all datasets with size info:
Volumes (zvols) are block devices backed by ZFS, used by Proxmox for VM disks:
The -b is the volblocksize, similar concept to recordsize but for zvols. Set it to match the guest filesystem block size for best performance.
Snapshots
Snapshots are essentially free (copy-on-write means they cost nothing until blocks change) and they’re the killer feature.
Take a snapshot:
Recursive snapshot of a dataset and all children:
List snapshots for a dataset:
Browse the contents of a snapshot (it’s right there at .zfs/snapshot/<name> in the dataset root):
If .zfs isn’t visible, set snapdir=visible:
Roll back to a snapshot (loses all changes since):
If there are newer snapshots between now and the target, you’ll need -r to discard them.
Clone a snapshot to a new writeable dataset (useful for spinning up a VM disk from a known-good template):
Promote the clone if you want to delete the original:
Delete a snapshot:
For automatic snapshots I’ve used zfs-auto-snapshot on Linux for years and zfsnap on FreeBSD. On Proxmox I let it manage VM/CT snapshots itself and run a cron for dataset-level ones.
Send and Receive (the magic bit)
zfs send and zfs receive is how you get data off one box and onto another, byte-for-byte, with full ZFS metadata. This is replication, backup and migration all in one tool.
Send a snapshot to a file:
Restore from that file:
Send to another box over SSH (this is the bread and butter):
Incremental send (only the changes between two snapshots):
Recursive incremental of a whole tree (use -R):
Some flags worth knowing:
-csend compressed-on-disk blocks as-is, no decompress/recompress (OpenZFS 0.8+)-wraw send, including encrypted blocks without decrypting (OpenZFS 0.8+)-Luse large blocks-euse embedded data blocks-pinclude properties
For ongoing replication I use syncoid by Jim Salter which wraps all of this up nicely. It even handles the resume tokens if a transfer dies halfway. Highly recommended.
Compression
Always on. Modern CPUs decompress faster than disks can read uncompressed data, and the algorithm is per-block so incompressible data just stores raw.
The options:
| Algorithm | Ratio | Speed | When |
|---|---|---|---|
off | 1.00x | fastest | almost never the right answer |
lz4 | ~1.5x | very fast | the default since 2013, fine for everything |
zstd | ~2x | fast | OpenZFS 2.0+, the new default |
zstd-N | up to 3x | tunable 1-19 | higher = slower, use for archive datasets |
gzip-N | ~2x | slow | mostly historical, prefer zstd |
zle | minimal | very fast | only collapses zero runs |
Set it (recursively if you want):
Check the compression ratio you’re actually getting:
Note that compression only applies to newly-written data. Existing data stays at whatever compression was active when it was written. Rewrite (e.g. via send/receive into a new dataset) if you change the algorithm and want the existing data recompressed.
Deduplication
Don’t.
Alright, that’s a bit harsh. ZFS dedupe is real but the rule of thumb is 5GB of RAM per TB of deduped data because the dedupe table (DDT) needs to live in ARC for performance. If your DDT spills to disk, performance falls off a cliff and recovering is painful.
If you genuinely have a high-dedup-ratio workload (VM disk images of identical OSes, mostly) then it’s worth considering. Otherwise use compression and call it a day.
OpenZFS 2.3 is bringing Fast Dedup which fixes most of the historical pain. Worth revisiting once that’s stable on your platform.
If you must:
Check your current DDT size:
Encryption (OpenZFS 0.8+)
Native encryption arrived in 0.8 and it’s brilliant because it’s per-dataset, not per-pool. Different keys for different datasets, raw-send to an untrusted backup target without decrypting, etc.
Create an encrypted dataset:
It will prompt for a passphrase. To use a key file instead:
Load the key after a reboot:
Or load all keys at once:
Raw send keeps everything encrypted in transit and at rest on the destination:
ARC, L2ARC, ZIL and SLOG
Confusing acronyms, simple ideas.
- ARC is the read cache in RAM. ZFS will use most of your free RAM for it. This is normal and good.
- L2ARC is a second-level read cache on a fast SSD. Useful only if you have specific workloads with hot data that doesn’t fit in RAM.
- ZIL (ZFS Intent Log) is where synchronous writes get logged. Every pool has one, by default it’s part of the pool itself.
- SLOG (Separate intent LOG) is a dedicated fast device (NVMe, Optane) for the ZIL. Only helps synchronous workloads (NFS, databases). For async writes it does nothing.
Limit ARC size on Linux (it’s hungry by default):
That sets it to 8GB. Reboot for it to take effect, or set it live via /sys/module/zfs/parameters/zfs_arc_max.
Add an L2ARC to a pool:
Add a SLOG (mirror it, because losing your SLOG mid-write loses synchronous data):
Check ARC stats (Linux):
Special vdevs (OpenZFS 0.8+)
Special vdevs hold metadata (and optionally small blocks) on faster storage. Massive performance win for metadata-heavy workloads (lots of small files, snapshot listings, etc).
Heads up
A special vdev becomes part of the pool. If it dies, the pool dies. Mirror them. Always.
Add a mirrored special vdev:
Steer small blocks to the special vdev as well:
That sends any block smaller than 64K to the special vdev. Set per-dataset for finer control.
Maintenance
Scrub
Reads all data, verifies checksums, repairs from parity if anything’s bad. Run monthly on consumer drives, quarterly on enterprise.
Watch progress:
Stop a scrub:
On Linux, the zfs-zed and zfsutils-linux packages typically install a cron entry that scrubs on the second Sunday of each month. Check /etc/cron.d/zfsutils-linux.
Resilver
Happens automatically when you replace a failed disk. Sequential resilver (0.8+) is dramatically faster than the old block-pointer-tree walk for full-disk replacements.
TRIM (OpenZFS 0.8+)
For SSDs. Tells the drive about free blocks so it can do its garbage collection properly.
Enable autotrim if you’d rather not remember:
Replacing a failed disk
The drill, more or less. Replace identifiers with your actual stable IDs.
Find the failed disk:
Offline it (if it’s not already faulted automatically):
Physically swap the disk. Then replace it in the pool:
Watch the resilver:
If the new disk has the same name (because the slot got reused), the second argument is optional.
Properties Cheat Sheet
The ones I tweak most often.
| Property | Values | Notes |
|---|---|---|
compression | lz4, zstd, zstd-N, off | Always on, default to zstd on 2.0+ |
atime | on, off | Off unless you actually need access times |
recordsize | 512B to 16M | Match workload, 1M for media, 16K for DBs |
xattr | on, sa | sa on Linux, much faster |
mountpoint | path or none | none for parents that just hold children |
quota | size or none | Hard cap including children |
reservation | size or none | Guaranteed space |
refquota | size | Quota excluding snapshots/clones |
sync | standard, always, disabled | Don’t disable in production |
dedup | off, on, etc | Don’t, see above |
copies | 1, 2, 3 | Extra copies of your data, useful on single-disk |
checksum | on, sha256, blake3 | blake3 on 2.2+ is fast and strong |
Useful one-liners
What’s actually using my space, sorted:
How much space would I free if I deleted snapshots:
Find datasets without compression set:
Total of all snapshots for a dataset:
Pool fragmentation (high frag means writes will get slower):
Common gotchas
A handful of things that have bitten me over the years.
- Don’t fill pools past 80%. ZFS gets dramatically slower as it approaches full because the allocator has to work harder. 80% is the soft limit, 90% is the panic limit.
- Mirror your SLOG and special vdevs. Losing them loses the pool.
ashiftis forever. Get it right at create time, you can’t change it.zpool import -fif a pool was last used elsewhere. Don’t do this if the original system still has it imported, you’ll corrupt the pool.- Send streams aren’t backups by themselves. Always keep two snapshots either side of an incremental, in case the stream is corrupt.
- Watch your free space on send/recv targets. A receive that runs out of space leaves a partial dataset that needs cleaning up.
atime=offeverywhere. I haven’t found a workload where the access time write amplification was worth it.
References
- OpenZFS Wiki - the canonical docs
- OpenZFS Feature Flags - what’s supported where
- Jim Salter’s ZFS articles on Ars Technica - the best explainers I’ve read
sanoid/syncoid- automated snapshots and replicationzfs-auto-snapshot- simpler automatic snapshots- Aaron Toponce’s ZFS Administration series - dated but still excellent fundamentals
- Proxmox ZFS Documentation - if you’re on Proxmox like me
