zfs
configurationoperating_systemslinuxfile-systemszfs

basics

The two major programs you must learn to use are called zfs and zpool. Both of them are as important to know as mkfs.ext3 and e2fsck when using an ext file system. The syntax sometimes looks a bit inconsistent, because some commands given to these programs expect the name of a zpool, some of a dataset and some need both separated with a slash. [DO NOT COPY AND PASTE, BUT UNDERSTRAND AND TRY ONE BY ONE!]{style=“display: block; background-color: red; text-align: center; color: white; font-weight: bold;”}

zfs create zpool/dataset ;# create a new dataset in zpool (use their names!)
zfs list ;# lists all zfs pools and datasets and displays disk usage
zfs destroy zpool/dataset ;# delete the dataset again.

zpool list ;# lists all pools and displays disk usage
zpool status ;# displays the health status and device configuration for a given pool

zpool create zpool vdev ;# creates a new virtual device
zpool destroy zpool vdev ;# destroy the device again
zpool destroy zpool ;# destroys the zpool (not a dataset like above), so that everything is gone!

Enabling zfs during boot (zfsonlinux with systemd)

systemctl enable zfs.target

zfs.target can than be used in service files as dependency, i.e. useful for libvirt-guest.service

what is what

For a better understanding of the features ZFS offers I find some of the names rather confusing. Also there is way too little information on how to use ZFS in the field. The internet is full of examples on how to do stuff with ZFS, but not in which scenarios these features actually make sense.

zfs clone should be called
zfs branch
because a clone gets created from a snapshot (like a commit). Unlike a snapshot the clone can be modified (written to) and finally converted back to a dataset, which can be renamed to replace an original dataset (which is like 'merge')

Network file sharing

samba/cifs/windows

In order to share files among different devices and operating systems samba has been proven to be a solid solution, so that it is widely used in industrial environments as well. ZFS relies on a relativly new feature of samba, called usershare. It is not enabled in default installations and must be explicitly set in /etc/samba/smb.conf, looking like that:

usershare path = /var/lib/samba/usershares
usershare max shares = 100
usershare allow guests = no
usershare owner only = yes

The folder in this configuration snipplet must be created by hand and *afaik* it's name is hard-coded in zfs. Other sites use usershare as an example, which I found not working together with zfs.

groupdadd sambashare
mkdir -p /var/lib/samba/usershares 
chowm root:sambashare /var/lib/samba/usershares
chmod ug+rwx /var/lib/samba/usershares
# finally: activate samba so, that it gets started after rebooting
systemctl enable samba
systemctl enable smbd
systemctl enable nmbd
zfs get sharesmb ;# displays the status of nfs shares managed by zfs
zfs set sharesmb=on zfs_pool ;# shares all datasets within the pool
zfs set sharesmb=on zfs_pool/dataset ;# to share a single dataset 
zfs inherit sharesmb zfs_pool/dataset ;# will configure dataset so, that it inherits its sharenfs preference from the pool

If this does not work you might get a better error message when trying to do a user share by hand. This is done with:

net usershare randomname /path/to/share

NFS - the network file system

NFS generally offers better support for filesystem features when compared to samba, can be tuned to be slightly faster and is sometimes easier to integrate in fstab.

zfs get sharenfs ;# displays the status of nfs shares managed by zfs
zfs set sharenfs=on zfs_pool ;# shares all datasets within the pool
zfs set sharenfs=ro zfs_pool/dataset ;# to share a single dataset and do it readonly (requires nfs4)
zfs set sharenfs='rw=@10.23.0.0/24' ;# limit the zfs sharing to a private/vpn network living in this address space
zfs inherit sharenfs zfs_pool/dataset ; will configure dataset so, that it inherits its sharenfs preference from the pool

how2: restart nfs

If you restart the nfs daemon you must reshare zfs shares as well or the exports might not longer represent your configuration. So get used to do:

systemctl restart nfs-server
zfs unshare -a
zfs share -a

nfs version issue

Be aware, that there is NFS4 and NFS3 out there, which use different configuration file formats, although the configuration file usually has the same name, namely /etc/exports. ZFS uses the older nfs3 *afaik* so that you must take care for yourself to not mix up both configuration file formats, but stick with nfs3, because mixing the two can lead to unexpected behaviour.

FreeBSD

FreeBSD users can configure a set of configuration files in their /etc/rc.conf by adding a line

mountd_flags="-e -r /etc/exports /etc/zfs/exports /etc/exports-v4"

for example. But that will cause trouble, because NFS3 and NFS4 easily can easily get mixed by the configuration style, which makes both versions of NFS active and leads to unresponsive hosts or connection errors. As a rule of thumb: Look which configuration file format is used by your ZFS and then append shares in that format accordingly.

SELinux: Shared file systems need the right context to be set

You must tell zfs in which selinux security context it shall mount the dataset or pool. It is much like the fstab-option I meantioned here.

zfs set rootcontext=system_u:object_r:public_content_rw_t:s0 zpool/datasetname

FreeBSD: Alignment and Sector size

There seem to be different implementations of ZFS where some support the creation of a fixed sector size and others do not. That topic is somehow complex, because it is hardware related and while some manufacturers have made their devices to report a sector size of 4096byte, others have programmed to fake 512byte for compatibility with some operating systems. But using 512byte make things slower. In short: One needs a trick to make zfs use 4096bytes and that is done with gnop and gpart like so:

gnop create -S 4096 /dev/ada0
zpool create -m /mnt/zfs_pool zfs_pool /dev/ada0.nop

This will use the temporary nop-device to create the pool with the nop devices sector size and can also be done to attach a disk to an existing pool:

gnop create -S 4096 /dev/ada1
zpool attach zfs_pool /dev/ada0 /dev/ada1.nop

Which will initialize a resilver from /dev/ada0 to /dev/ada1.nop, effectively making ada1.nop a mirror of ada0. As said before: We just wanted to use that nop device temporarily, so that we can do

zpool export
gnop unload
zpool import

which will reimport the pool without that nop device.

Backup strategy

zfs can be easily backed up using zfs send and zfs recv commands. It makes incremental backups possible without scanning for changes in real time as rsync would. Basicly you must have a snapshot laying around for each backup you want to make. I suggest using the current date in the form of yyyy-mm-dd (see: ISO8601). The steps are:

# plug in your external usb drive with zfs on it
zpool import #; show all importable zfs pools 
zpool import zpool_external #; will import the zpool with that descriptive name 
# in this example I have called the local zpool "zpool_internal" 
DATERFC3339=$(date --rfc-3339=date) #; will return something like 2014-11-26
zfs snapshot -r zpool_internal@backup-$DATERFC3339 ;# recursivly creates a snapshot over all datasets
zfs list -t snapshot #; lists the names of all snapshots
zfs send -R zpool_internal@$DATERFC3339 | zfs receive -F -d -u -v zpool_external #; send a snapshot from one zpool to another
# after having done this once you can switch to incremental backups like so:
zfs snapshot -r zpool_internal@backup-$DATERFC3339 ;# same as above
zfs list -t snapshot #; lists the names of all snapshots
zfs send -R -i zpool_internal@name_of_previews_snapshot zpool_internal@$DATERFC3339 | zfs receive -F -d -u -v zpool_external
# which will effectivly only send changes made since the last backup was made
zfs export zpool_external ;# prepare the zpool to be mounted anywhere else

zfs send | zfs receive

The pv tool can be used to monitor the progress of the transfer. Also zfs send can predict the amount of data to be transferred:

zfs send -R -nv -i zpool_internal@name_of_previews_snapshot zpool_internal@$DATERFC3339 ;# display size
zfs send -R -i zpool_internal@name_of_previews_snapshot zpool_internal@$DATERFC3339 | pv | zfs receive -F -d -u -v zpool_external ;# transfer with status

example

zfs send  -L -e zpool/projects@2016-01-22 | pv | zfs recv -F -d -u -v ext_silver ; 
zfs send  -L -e zpool/people@2016-01-22 | pv | zfs recv -F -d -u -v ext_silver ; 
zfs send  -L -e zpool/private@2016-01-22 | pv | zfs recv -F -d -u -v ext_silver ;
zfs send  -L -e zpool/http@2016-01-22 | pv | zfs recv -F -d -u -v ext_silver ;
zfs send  -L -e zpool/mysql@2016-01-22 | pv | zfs recv -F -d -u -v ext_silver
receiving full stream of zpool/projects@2016-01-22 into ext_silver/projects@2016-01-22
 330GiB 1:30:11 [62.5MiB/s] [                                                                                                                             <=>        ]
received 330GB stream in 5423 seconds (62.4MB/sec)
receiving full stream of zpool/people@2016-01-22 into ext_silver/people@2016-01-22
 389GiB 1:51:08 [59.8MiB/s] [     <=>                                                                                                                                ]
received 389GB stream in 6670 seconds (59.7MB/sec)
receiving full stream of zpool/private@2016-01-22 into ext_silver/private@2016-01-22
 282GiB 1:15:35 [63.9MiB/s] [                          <=>                                                                                                           ]
received 283GB stream in 4584 seconds (63.2MB/sec)
receiving full stream of zpool/http@2016-01-22 into ext_silver/http@2016-01-22
28.8GiB 0:06:26 [76.3MiB/s] [                                                                                                  <=>                                   ]
received 28.8GB stream in 390 seconds (75.6MB/sec)
receiving full stream of zpool/mysql@2016-01-22 into ext_silver/mysql@2016-01-22
 193MiB 0:00:02 [82.5MiB/s] [       <=>                                                                                                                              ]
received 193MB stream in 2 seconds (96.6MB/sec)

performance tuning

The performance of zfs can really suck when using the wrong block size for your devices and can also waste disk space. These settings can only be changed when creating a pool and not modified afterwards.

zdb ;# ashift tells you which block size your zpool uses
blkid -i /dev/sdX ;# at least under linux tell you the block size of your devices*

* some devices, mostly hard disk drives 'lie' about their block sizes in order to preserve compatibility with legacy operating systems

Bugs and workarounds

No nfs shares after reboot

#2883: Currently a bug is preventing zfs to share datasets via nfs during boot. It has something to do with the order in which the zfs-services are called by systemd and someone has suggested to devide the existing systemd-scripts into smaller portions in order to gain more control about the order in which commands are executed. I have had success in working around this with this ugly script:

[Unit]
Description=encryption
After=nfs.target

[Service]
ExecStartPre=/usr/sbin/zfs unshare -a
ExecStart=/usr/sbin/zfs share -a
Type=oneshot

[Install]
WantedBy=multi-user.target

Permanent errors after scrub with silly names

errors: Permanent errors have been detected in the following files:
        zroot/var/log:<0x20>
        zroot/var/log:<0x36>
        zroot/var/log:<0x57>

These hex numbers represent the inode numbers of the broken files. We can locate those with find, but find uses decimal numbers. However Bash can convert those:

find /var/log/ -inum $((16#20)) -or -inum $((16#36)) -or -inum $((16#57))

They must be deleted or restored and the error message will be gone after the following scrub

Command cheat sheet

To display the size currently occupied without snapshots (=refer), requires bc:

The simple way

zfs list -o name,logicalused,usedbysnapshots -s logicalused 

Or more advanced

echo -e "scale=4\n(0"$(printf "+%d" $(zfs list -s refer -o refer -H -p))\)/1024^3"\n"  | bc

Pool management

In a mirror configuration all disk drives contain the same data. We can add new devices to the mirror configuration with:

zpool attach [pool] [device] [newdevice]

Note: Do not confuse zpool attach and zpool add, because that is a one way ticket. zpool also adds the disk to the pool, but extends the pool and its storage in a We can remove individual disks from a mirror configuration with:

zpool detach [pool] [device]

and we can now also separate a device from the pool with all its data and make that device the first in a new pool:

zpool split [pool] [newpool] [device]

To be tried

Just as a side note for me

  • does wipefs do the same thing as zfs labelclear?

ashift and sector sizes

When creating a ZFS pool on a device with zpool create one can configure the physical sector size of a hard disk with -o ashift. A wrong value for ashift would hurt the disks write performance, but it sometimes happens, that ZFS is not able to determine the value for ashift automatically, especially if you are not working with a device directly, but with an encryption layer under /dev/mapper/. How to determine the physical sector size of your devices?

lsblk -o +phy-sec

Displays the values for 2 ^ ashift, so we can use the basic calculator echo 'l(4096) / l(2)' | bc -l to determine the ashift values from the following table:

block size ashift value
16384 14
8192 13 *
4096 12 *
2048 11
1024 10
512 9 *
256 8
128 7
64 6
32 5
16 4
8 3
4 2
2 1

…okay we had some fun here, but I have marked the two most common sizes with an asterisk. We will then create the zpool with something like

zpool create -o ashift=12 -O compression=lz4 -O mountpoint=/mnt/backup01 backup01 /dev/mapper/backup01
top