basics #
The two major programs you must learn to use are called zfs
and
zpool
. Both of them are as important to know as mkfs.ext3
and
e2fsck
when using an ext file system. The syntax sometimes looks a
bit inconsistent, because some commands given to these programs expect
the name of a zpool, some of a dataset and some need both separated with
a slash. [DO NOT COPY AND PASTE, BUT UNDERSTRAND AND TRY ONE BY
ONE!]{style=“display: block; background-color: red; text-align: center; color: white; font-weight: bold;”}
zfs create zpool/dataset ;# create a new dataset in zpool (use their names!)
zfs list ;# lists all zfs pools and datasets and displays disk usage
zfs destroy zpool/dataset ;# delete the dataset again.
zpool list ;# lists all pools and displays disk usage
zpool status ;# displays the health status and device configuration for a given pool
zpool create zpool vdev ;# creates a new virtual device
zpool destroy zpool vdev ;# destroy the device again
zpool destroy zpool ;# destroys the zpool (not a dataset like above), so that everything is gone!
Enabling zfs during boot (zfsonlinux with systemd) #
systemctl enable zfs.target
zfs.target
can than be used in service files as dependency, i.e.
useful for libvirt-guest.service
what is what #
For a better understanding of the features ZFS offers I find some of the names rather confusing. Also there is way too little information on how to use ZFS in the field. The internet is full of examples on how to do stuff with ZFS, but not in which scenarios these features actually make sense.
- zfs clone should be called
- zfs branch
- because a clone gets created from a snapshot (like a commit). Unlike a snapshot the clone can be modified (written to) and finally converted back to a dataset, which can be renamed to replace an original dataset (which is like 'merge')
Network file sharing #
samba/cifs/windows #
In order to share files among different devices and operating systems
samba has been proven to be a solid solution, so that it is widely used
in industrial environments as well. ZFS relies on a relativly new
feature of samba, called usershare. It is not enabled in default
installations and must be explicitly set in /etc/samba/smb.conf
,
looking like that:
usershare path = /var/lib/samba/usershares
usershare max shares = 100
usershare allow guests = no
usershare owner only = yes
The folder in this configuration snipplet must be created by hand and
*afaik* it's name is hard-coded in zfs. Other sites use usershare
as an example, which I found not working together with zfs.
groupdadd sambashare
mkdir -p /var/lib/samba/usershares
chowm root:sambashare /var/lib/samba/usershares
chmod ug+rwx /var/lib/samba/usershares
# finally: activate samba so, that it gets started after rebooting
systemctl enable samba
systemctl enable smbd
systemctl enable nmbd
zfs get sharesmb ;# displays the status of nfs shares managed by zfs
zfs set sharesmb=on zfs_pool ;# shares all datasets within the pool
zfs set sharesmb=on zfs_pool/dataset ;# to share a single dataset
zfs inherit sharesmb zfs_pool/dataset ;# will configure dataset so, that it inherits its sharenfs preference from the pool
If this does not work you might get a better error message when trying to do a user share by hand. This is done with:
net usershare randomname /path/to/share
NFS - the network file system #
NFS generally offers better support for filesystem features when compared to samba, can be tuned to be slightly faster and is sometimes easier to integrate in fstab.
zfs get sharenfs ;# displays the status of nfs shares managed by zfs
zfs set sharenfs=on zfs_pool ;# shares all datasets within the pool
zfs set sharenfs=ro zfs_pool/dataset ;# to share a single dataset and do it readonly (requires nfs4)
zfs set sharenfs='rw=@10.23.0.0/24' ;# limit the zfs sharing to a private/vpn network living in this address space
zfs inherit sharenfs zfs_pool/dataset ; will configure dataset so, that it inherits its sharenfs preference from the pool
how2: restart nfs #
If you restart the nfs daemon you must reshare zfs shares as well or the exports might not longer represent your configuration. So get used to do:
systemctl restart nfs-server
zfs unshare -a
zfs share -a
nfs version issue #
Be aware, that there is NFS4 and NFS3 out there, which use different
configuration file formats, although the configuration file usually has
the same name, namely /etc/exports
. ZFS uses the older nfs3
*afaik* so that you must take care for yourself to not mix up both
configuration file formats, but stick with nfs3, because mixing the two
can lead to unexpected behaviour.
FreeBSD #
FreeBSD users can configure a set of configuration files in their
/etc/rc.conf
by adding a line
mountd_flags="-e -r /etc/exports /etc/zfs/exports /etc/exports-v4"
for example. But that will cause trouble, because NFS3 and NFS4 easily can easily get mixed by the configuration style, which makes both versions of NFS active and leads to unresponsive hosts or connection errors. As a rule of thumb: Look which configuration file format is used by your ZFS and then append shares in that format accordingly.
SELinux: Shared file systems need the right context to be set #
You must tell zfs in which selinux security context it shall mount the dataset or pool. It is much like the fstab-option I meantioned here.
zfs set rootcontext=system_u:object_r:public_content_rw_t:s0 zpool/datasetname
FreeBSD: Alignment and Sector size #
There seem to be different implementations of ZFS where some support the creation of a fixed sector size and others do not. That topic is somehow complex, because it is hardware related and while some manufacturers have made their devices to report a sector size of 4096byte, others have programmed to fake 512byte for compatibility with some operating systems. But using 512byte make things slower. In short: One needs a trick to make zfs use 4096bytes and that is done with gnop and gpart like so:
gnop create -S 4096 /dev/ada0
zpool create -m /mnt/zfs_pool zfs_pool /dev/ada0.nop
This will use the temporary nop-device to create the pool with the nop devices sector size and can also be done to attach a disk to an existing pool:
gnop create -S 4096 /dev/ada1
zpool attach zfs_pool /dev/ada0 /dev/ada1.nop
Which will initialize a resilver from /dev/ada0 to /dev/ada1.nop, effectively making ada1.nop a mirror of ada0. As said before: We just wanted to use that nop device temporarily, so that we can do
zpool export
gnop unload
zpool import
which will reimport the pool without that nop device.
Backup strategy #
zfs can be easily backed up using zfs send
and zfs recv
commands. It
makes incremental backups possible without scanning for changes in real
time as rsync would. Basicly you must have a snapshot laying around for
each backup you want to make. I suggest using the current date in the
form of yyyy-mm-dd (see:
ISO8601). The steps are:
# plug in your external usb drive with zfs on it
zpool import #; show all importable zfs pools
zpool import zpool_external #; will import the zpool with that descriptive name
# in this example I have called the local zpool "zpool_internal"
DATERFC3339=$(date --rfc-3339=date) #; will return something like 2014-11-26
zfs snapshot -r zpool_internal@backup-$DATERFC3339 ;# recursivly creates a snapshot over all datasets
zfs list -t snapshot #; lists the names of all snapshots
zfs send -R zpool_internal@$DATERFC3339 | zfs receive -F -d -u -v zpool_external #; send a snapshot from one zpool to another
# after having done this once you can switch to incremental backups like so:
zfs snapshot -r zpool_internal@backup-$DATERFC3339 ;# same as above
zfs list -t snapshot #; lists the names of all snapshots
zfs send -R -i zpool_internal@name_of_previews_snapshot zpool_internal@$DATERFC3339 | zfs receive -F -d -u -v zpool_external
# which will effectivly only send changes made since the last backup was made
zfs export zpool_external ;# prepare the zpool to be mounted anywhere else
zfs send | zfs receive #
The pv
tool can be used to monitor the progress of the transfer. Also
zfs send
can predict the amount of data to be transferred:
zfs send -R -nv -i zpool_internal@name_of_previews_snapshot zpool_internal@$DATERFC3339 ;# display size
zfs send -R -i zpool_internal@name_of_previews_snapshot zpool_internal@$DATERFC3339 | pv | zfs receive -F -d -u -v zpool_external ;# transfer with status
example
zfs send -L -e zpool/projects@2016-01-22 | pv | zfs recv -F -d -u -v ext_silver ;
zfs send -L -e zpool/people@2016-01-22 | pv | zfs recv -F -d -u -v ext_silver ;
zfs send -L -e zpool/private@2016-01-22 | pv | zfs recv -F -d -u -v ext_silver ;
zfs send -L -e zpool/http@2016-01-22 | pv | zfs recv -F -d -u -v ext_silver ;
zfs send -L -e zpool/mysql@2016-01-22 | pv | zfs recv -F -d -u -v ext_silver
receiving full stream of zpool/projects@2016-01-22 into ext_silver/projects@2016-01-22
330GiB 1:30:11 [62.5MiB/s] [ <=> ]
received 330GB stream in 5423 seconds (62.4MB/sec)
receiving full stream of zpool/people@2016-01-22 into ext_silver/people@2016-01-22
389GiB 1:51:08 [59.8MiB/s] [ <=> ]
received 389GB stream in 6670 seconds (59.7MB/sec)
receiving full stream of zpool/private@2016-01-22 into ext_silver/private@2016-01-22
282GiB 1:15:35 [63.9MiB/s] [ <=> ]
received 283GB stream in 4584 seconds (63.2MB/sec)
receiving full stream of zpool/http@2016-01-22 into ext_silver/http@2016-01-22
28.8GiB 0:06:26 [76.3MiB/s] [ <=> ]
received 28.8GB stream in 390 seconds (75.6MB/sec)
receiving full stream of zpool/mysql@2016-01-22 into ext_silver/mysql@2016-01-22
193MiB 0:00:02 [82.5MiB/s] [ <=> ]
received 193MB stream in 2 seconds (96.6MB/sec)
performance tuning #
The performance of zfs can really suck when using the wrong block size for your devices and can also waste disk space. These settings can only be changed when creating a pool and not modified afterwards.
zdb ;# ashift tells you which block size your zpool uses
blkid -i /dev/sdX ;# at least under linux tell you the block size of your devices*
* some devices, mostly hard disk drives 'lie' about their block sizes in order to preserve compatibility with legacy operating systems
Bugs and workarounds #
No nfs shares after reboot #
#2883: Currently a bug is preventing zfs to share datasets via nfs during boot. It has something to do with the order in which the zfs-services are called by systemd and someone has suggested to devide the existing systemd-scripts into smaller portions in order to gain more control about the order in which commands are executed. I have had success in working around this with this ugly script:
[Unit]
Description=encryption
After=nfs.target
[Service]
ExecStartPre=/usr/sbin/zfs unshare -a
ExecStart=/usr/sbin/zfs share -a
Type=oneshot
[Install]
WantedBy=multi-user.target
Permanent errors after scrub with silly names #
errors: Permanent errors have been detected in the following files:
zroot/var/log:<0x20>
zroot/var/log:<0x36>
zroot/var/log:<0x57>
These hex numbers represent the inode numbers of the broken files. We can locate those with find, but find uses decimal numbers. However Bash can convert those:
find /var/log/ -inum $((16#20)) -or -inum $((16#36)) -or -inum $((16#57))
They must be deleted or restored and the error message will be gone after the following scrub
Command cheat sheet #
To display the size currently occupied without snapshots (=refer), requires bc:
The simple way #
zfs list -o name,logicalused,usedbysnapshots -s logicalused
Or more advanced #
echo -e "scale=4\n(0"$(printf "+%d" $(zfs list -s refer -o refer -H -p))\)/1024^3"\n" | bc
Pool management #
In a mirror configuration all disk drives contain the same data. We can add new devices to the mirror configuration with:
zpool attach [pool] [device] [newdevice]
Note: Do not confuse zpool attach and zpool add, because that is a one way ticket. zpool also adds the disk to the pool, but extends the pool and its storage in a We can remove individual disks from a mirror configuration with:
zpool detach [pool] [device]
and we can now also separate a device from the pool with all its data and make that device the first in a new pool:
zpool split [pool] [newpool] [device]
To be tried #
Just as a side note for me
- does
wipefs
do the same thing as zfs labelclear?
ashift and sector sizes #
When creating a ZFS pool on a device with zpool create
one can configure the
physical sector size of a hard disk with -o ashift
. A wrong value for ashift
would hurt the disks write performance, but it sometimes happens, that ZFS is
not able to determine the value for ashift automatically, especially if you
are not working with a device directly, but with an encryption layer under
/dev/mapper/
. How to determine the physical sector size of your devices?
lsblk -o +phy-sec
Displays the values for 2 ^ ashift
, so we can use the basic calculator
echo 'l(4096) / l(2)' | bc -l
to determine the ashift values from
the following table:
block size | ashift value |
---|---|
16384 | 14 |
8192 | 13 * |
4096 | 12 * |
2048 | 11 |
1024 | 10 |
512 | 9 * |
256 | 8 |
128 | 7 |
64 | 6 |
32 | 5 |
16 | 4 |
8 | 3 |
4 | 2 |
2 | 1 |
…okay we had some fun here, but I have marked the two most common sizes with an asterisk. We will then create the zpool with something like
zpool create -o ashift=12 -O compression=lz4 -O mountpoint=/mnt/backup01 backup01 /dev/mapper/backup01