Background
ZFS is a member of the newer generation of filesystems that include advanced features beyond simple file storage. Its capabilities are quite extensive covering a wide range of pain points hit with previous filesystems. The Wikipedia page details them all nicely but for the purpose of this post we will be focusing on its ability to create N-Way sets of disk mirrors.
Traditionally mirrored disk sets in Linux and other operating systems have been limited to two devices (note: devices in this context could be disks, partitions or even other raid groups, such is the case in raid 10 setups). While mirroring has the benefit over other raid levels in that each mirrored device contains a complete copy of the data, the two device limit became inadequate as disk sizes ballooned. In the age of multi-TB drives, simply rebuilding a degraded mirrored array could actually cause the surviving device to fail, eliminating the very redundancy one was expecting.
ZFS addresses this particular problem in several ways through data checksums, self-healing and smart resilvering instead of blindly rebuilding full array members even if only 1% of disk space is being used.
And to top it off, ZFS also includes the ability to specify N number of devices in a mirrored set. In this post we will create a sample 3-way mirrored set using loopback devices and run a series of test scenarios against it.
For those unfamiliar, a loopback device allows you to expose an file as a block device. Using loopback devices we can create file-based "disks" that we can use as mirror array members in our test.
Testbed Setup
For this exercise I am using a fresh Debian Jessie (8.1) x86_64 vanilla system installed into a KVM/QEMU virtual machine. The kernel currently shipped with Jessie is 3.16.0-4-amd64 and the ZFSOnLinux package currently available for Debian is 0.6.4-1.2-1.
It should be especially noted that ZFS should only be used on 64-bit hosts.
Installation
Following the Debian instructions on the ZFSOnLinux website, the following commands were run:
$ su - # apt-get install lsb-release # wget http://archive.zfsonlinux.org/debian/pool/main/z/zfsonlinux/zfsonlinux_6_all.deb # dpkg -i zfsonlinux_6_all.deb # apt-get update # apt-get install debian-zfs
This will add /etc/apt/sources.list.d/zfsonlinux.list, install the software and dependencies, then proceed to build the ZFS/SPL kernel modules.
Preparing the loopback devices
Finding the first available loopback device
# losetup -a
If you see anything listed, change 1 2 3 in the commands below to the start with the next available number and increment appropriately.
Creating the files
# for i in 1 2 3; do dd if=/dev/zero of=/tmp/zfsdisk_$i bs=1M count=250; done 250+0 records in 250+0 records out 262144000 bytes (262 MB) copied, 0.371318 s, 706 MB/s 250+0 records in 250+0 records out 262144000 bytes (262 MB) copied, 0.614396 s, 427 MB/s 250+0 records in 250+0 records out 262144000 bytes (262 MB) copied, 0.824889 s, 318 MB/s
Setup the loopback mappings
# for i in 1 2 3; do losetup /dev/loop$i /tmp/zfsdisk_$i; done
Verify the mappings
# losetup -a /dev/loop1: [65025]:399320 (/tmp/zfsdisk_1) /dev/loop2: [65025]:399323 (/tmp/zfsdisk_2) /dev/loop3: [65025]:399324 (/tmp/zfsdisk_3)
Create the ZFS 3-Way Mirror
# zpool \ create \ -o ashift=12 \ -m /mnt/zfs/mymirror \ mymirror \ mirror \ /dev/loop1 \ /dev/loop2 \ /dev/loop3
A couple things to note:
- -o ashift=12
This tells ZFS to align along 4KB sectors. It is generally a good idea to always set this option since modern disks use 4KB sectors and once a pool has been created with a given sector size it cannot be changed later. The net result is that if you created a pool with 512b sectors lets say using 1TB drives, you couldn't later change the sector size to 4KB when adding 3TB drives (resulting in abysmal performance on the newer drives). So as a rule of thumb, always set -o ashift=12. - -m /mnt/zfs/mymirror
This indicates where this pool should be mounted. - /dev/loopN
The devices that make up the mirrored set. If these were physical disks you would likely want to use the appropriate disk symlinks under /dev/disk/by-id/.
Verify The ZFS Pool
# zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT mymirror 244M 408K 244M - 0% 0% 1.00x ONLINE -
# zpool status pool: mymirror state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM mymirror ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 loop2 ONLINE 0 0 0 loop3 ONLINE 0 0 0 errors: No known data errors
Poking The Bear
So now that we have our test 3-way mirror running, lets test the resiliency.
!!! WARNING NOTE: ALTHOUGH ZFS IS BUILT TO RECOVER FROM ERRORS, ONLY RUN THE FOLLOWING COMMANDS IN A TEST ENVIRONMENT OTHERWISE YOU WILL SUFFER DATA LOSS!!!
Setting The Stage
Create random file that takes up ~50% of disk space:
# dd if=/dev/urandom of=/mnt/zfs/mymirror/test.dat bs=1M count=125 125+0 records in 125+0 records out 131072000 bytes (131 MB) copied, 16.8152 s, 7.8 MB/s
# zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT mymirror 244M 126M 118M - 20% 51% 1.00x ONLINE -
# zpool scrub mymirror
# zpool status pool: mymirror state: ONLINE scan: scrub repaired 0 in 0h0m with 0 errors on Sun Jul 19 20:20:12 2015 config: NAME STATE READ WRITE CKSUM mymirror ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 loop2 ONLINE 0 0 0 loop3 ONLINE 0 0 0
Complete Corruption Of A Single Disk
Wipe disk with all ones (to differentiate from the initialization above from /dev/zero to demonstrate how ZFS resilvers)
# dd if=/dev/zero bs=1M count=250 | tr '\000' '\001' > /tmp/zfsdisk_3 250+0 records in 250+0 records out 262144000 bytes (262 MB) copied, 0.708197 s, 370 MB/s
This will wipe out the ZFS disk label among everything else, simulating the state where a disk is alive but corrupt.
# zpool scrub mymirror
# zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT mymirror 244M 127M 117M - 21% 51% 1.00x ONLINE - # zpool status pool: mymirror state: ONLINE status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-4J scan: scrub repaired 0 in 0h0m with 0 errors on Sun Jul 19 20:39:45 2015 config: NAME STATE READ WRITE CKSUM mymirror ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 loop2 ONLINE 0 0 0 loop3 UNAVAIL 0 0 0 corrupted data errors: No known data errors
Replacing the disk:
zpool replace -o ashift=12 mymirror loop3
# zpool status pool: mymirror state: ONLINE scan: resilvered 126M in 0h0m with 0 errors on Sun Jul 19 20:42:51 2015 config: NAME STATE READ WRITE CKSUM mymirror ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 loop2 ONLINE 0 0 0 loop3 ONLINE 0 0 0
Note that only 126MB needed to be resilvered. ZFS will only synchronize blocks in use, not empty blocks and not blocks that are the same in the new drive (this is demonstrated as we corrupted it with all ones).
Complete Corruption Of 2 Out Of 3 Disks
Check the file first:
# md5sum /mnt/zfs/mymirror/test.dat c253c4c5421d793f4fefe34af5a5ecc1 /mnt/zfs/mymirror/test.dat
Corrupt disk 2 and 3:
# dd if=/dev/zero bs=1M count=250 | tr '\000' '\001' > /tmp/zfsdisk_2 250+0 records in 250+0 records out 262144000 bytes (262 MB) copied, 0.660485 s, 397 MB/s # dd if=/dev/zero bs=1M count=250 | tr '\000' '\001' > /tmp/zfsdisk_3 250+0 records in 250+0 records out 262144000 bytes (262 MB) copied, 0.718505 s, 365 MB/s
# zpool scrub mymirror
# zpool status pool: mymirror state: ONLINE status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-4J scan: scrub repaired 0 in 0h0m with 0 errors on Sun Jul 19 22:39:05 2015 config: NAME STATE READ WRITE CKSUM mymirror ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 loop2 UNAVAIL 0 0 0 corrupted data loop3 UNAVAIL 0 0 0 corrupted data errors: No known data errors
# md5sum /mnt/zfs/mymirror/test.dat c253c4c5421d793f4fefe34af5a5ecc1 /mnt/zfs/mymirror/test.dat
File still looks good. Now replace both drives (done in the following way so we can see it in progress)
# zpool replace -o ashift=12 mymirror loop2 & \ zpool replace -o ashift=12 mymirror loop3 & \ sleep 1 && \ zpool status & state: ONLINE scan: resilvered 127M in 0h0m with 0 errors on Sun Jul 19 22:45:17 2015 config: NAME STATE READ WRITE CKSUM mymirror ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 replacing-1 UNAVAIL 0 0 0 old UNAVAIL 0 0 0 corrupted data loop2 ONLINE 0 0 0 replacing-2 UNAVAIL 0 0 0 old UNAVAIL 0 0 0 corrupted data loop3 ONLINE 0 0 0 errors: No known data errors
And finally replaced
# zpool status pool: mymirror state: ONLINE scan: resilvered 127M in 0h0m with 0 errors on Sun Jul 19 22:45:17 2015 config: NAME STATE READ WRITE CKSUM mymirror ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 loop2 ONLINE 0 0 0 loop3 ONLINE 0 0 0 errors: No known data errors
And finally check the file:
# md5sum /mnt/zfs/mymirror/test.dat c253c4c5421d793f4fefe34af5a5ecc1 /mnt/zfs/mymirror/test.dat
Corrupting A File
In this test we'll inject the file on the drive with bad data using the zinject testing tool included with ZFS.
# zinject -t data -f 1 /mnt/zfs/mymirror/test.dat Added handler 5 with the following properties: pool: mymirror objset: 21 object: 24 type: 0 level: 0 range: all
# zpool scrub mymirror
# zpool status pool: mymirror state: ONLINE scan: scrub in progress since Sun Jul 19 21:54:23 2015 88.4M scanned out of 127M at 3.84M/s, 0h0m to go 2.12M repaired, 69.51% done config: NAME STATE READ WRITE CKSUM mymirror ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 (repairing) loop2 ONLINE 0 0 0 (repairing) loop3 ONLINE 0 0 0 (repairing)
Found bad data and in the process of repairing.
# zpool status pool: mymirror state: ONLINE scan: scrub repaired 3M in 0h0m with 0 errors on Sun Jul 19 21:54:55 2015 config: NAME STATE READ WRITE CKSUM mymirror ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 loop2 ONLINE 0 0 0 loop3 ONLINE 0 0 0 errors: No known data errors
Finished reparing 3M of bad data.
Cleanup: If you are testing this yourself, remember to cancel the zinject handler after
# zinject ID POOL OBJSET OBJECT TYPE LVL RANGE --- --------------- ------ ------ -------- --- --------------- 5 mymirror 21 24 - 0 all
# zinject -c 5 removed handler 5
Partial Drive Corruption
Inject random bytes into one of the files backing a loopback device (mirrored array member) with dd
# dd if=/dev/urandom of=/tmp/zfsdisk_3 bs=1K count=10 seek=200000 10+0 records in 10+0 records out 10240 bytes (10 kB) copied, 0.00324266 s, 3.2 MB/s
# zpool scrub mymirror
# zpool status pool: mymirror state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: scrub in progress since Sun Jul 19 22:08:26 2015 127M scanned out of 127M at 31.8M/s, 0h0m to go 24.8M repaired, 99.91% done config: NAME STATE READ WRITE CKSUM mymirror ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 loop2 ONLINE 0 0 0 loop3 ONLINE 0 0 260 (repairing) errors: No known data errors
Found corruption and is fixing.
# zpool status pool: mymirror state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: scrub repaired 24.8M in 0h0m with 0 errors on Sun Jul 19 22:08:30 2015 config: NAME STATE READ WRITE CKSUM mymirror ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 loop2 ONLINE 0 0 0 loop3 ONLINE 0 0 260 errors: No known data errors
24M of drive corruption fixed
Conclusion
Setting up 3-way disk arrays using ZFS provides robust error-detection and recovery from a wide variety of damage scenarios. Its ability to target healing to only the affected data allows it to resilver efficiently and recover faster than traditional array configurations.