Three Buddhas

3-way Disk Mirrors With ZFSOnLinux

Background

ZFS  is a member of the newer generation of filesystems that include advanced features beyond simple file storage. Its capabilities are quite extensive covering a wide range of pain points hit with previous filesystems. The Wikipedia page details them all nicely but for the purpose of this post we will be focusing on its ability to create N-Way sets of disk mirrors.

Traditionally mirrored disk sets in Linux and other operating systems have been limited to two devices (note: devices in this context could be disks, partitions or even other raid groups, such is the case in raid 10 setups). While mirroring has the benefit over other raid levels in that each mirrored device contains a complete copy of the data, the two device limit became inadequate as disk sizes ballooned.  In the age of multi-TB drives, simply rebuilding a degraded mirrored array could actually cause the surviving device to fail, eliminating the very redundancy one was expecting.

ZFS addresses this particular problem in several ways through data checksums, self-healing and smart resilvering instead of blindly rebuilding full array members even if only 1% of disk space is being used.

And to top it off, ZFS also includes the ability to specify N number of devices in a mirrored set.  In this post we will create a sample 3-way mirrored set using loopback devices and run a series of test scenarios against it.

For those unfamiliar, a loopback device allows you to expose an file as a block device. Using loopback devices we can create file-based "disks" that we can use as mirror array members in our test.

Testbed Setup

For this exercise I am using a fresh Debian Jessie (8.1) x86_64 vanilla system installed into a KVM/QEMU virtual machine. The kernel currently shipped with Jessie is 3.16.0-4-amd64 and the ZFSOnLinux package currently available for Debian is 0.6.4-1.2-1.

It should be especially noted that ZFS should only be used on 64-bit hosts.

Installation

Following the Debian instructions on the ZFSOnLinux website,  the following commands were run:

$ su -
# apt-get install lsb-release
# wget http://archive.zfsonlinux.org/debian/pool/main/z/zfsonlinux/zfsonlinux_6_all.deb
# dpkg -i zfsonlinux_6_all.deb
# apt-get update
# apt-get install debian-zfs

This will add /etc/apt/sources.list.d/zfsonlinux.list, install the software and dependencies, then proceed to build the ZFS/SPL kernel modules.

Preparing the loopback devices

Finding the first available loopback device

# losetup -a

If you see anything listed, change 1 2 3 in the commands below to the start with the next available number and increment appropriately.

Creating the files

# for i in 1 2 3; do dd if=/dev/zero of=/tmp/zfsdisk_$i bs=1M count=250; done
250+0 records in
250+0 records out
262144000 bytes (262 MB) copied, 0.371318 s, 706 MB/s
250+0 records in
250+0 records out
262144000 bytes (262 MB) copied, 0.614396 s, 427 MB/s
250+0 records in
250+0 records out
262144000 bytes (262 MB) copied, 0.824889 s, 318 MB/s

Setup the loopback mappings

# for i in 1 2 3; do losetup /dev/loop$i /tmp/zfsdisk_$i; done

Verify the mappings

# losetup -a
/dev/loop1: [65025]:399320 (/tmp/zfsdisk_1)
/dev/loop2: [65025]:399323 (/tmp/zfsdisk_2)
/dev/loop3: [65025]:399324 (/tmp/zfsdisk_3)

Create the ZFS 3-Way Mirror

# zpool \
    create \
    -o ashift=12 \
    -m /mnt/zfs/mymirror \
    mymirror \
    mirror \
    /dev/loop1 \
    /dev/loop2 \
    /dev/loop3

A couple things to note:

  1.  -o ashift=12
    This tells ZFS to align along 4KB sectors. It is generally a good idea to always set this option since modern disks use 4KB sectors and once a pool has been created with a given sector size it cannot be changed later. The net result is that if you created a pool with 512b sectors lets say using 1TB drives, you couldn't later change the sector size to 4KB when adding 3TB drives (resulting in abysmal performance on the newer drives). So as a rule of thumb, always set -o ashift=12.
  2.  -m /mnt/zfs/mymirror
    This indicates where this pool should be mounted.
  3.  /dev/loopN
    The devices that make up the mirrored set. If these were physical disks you would likely want to use the appropriate disk symlinks under /dev/disk/by-id/.

Verify The ZFS Pool

# zpool list
NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
mymirror   244M   408K   244M         -     0%     0%  1.00x  ONLINE  -
# zpool status
  pool: mymirror
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    mymirror    ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        loop1   ONLINE       0     0     0
        loop2   ONLINE       0     0     0
        loop3   ONLINE       0     0     0

errors: No known data errors

Poking The Bear

So now that we have our test 3-way mirror running, lets test the resiliency.

!!! WARNING NOTE: ALTHOUGH ZFS IS BUILT TO RECOVER FROM ERRORS, ONLY RUN THE FOLLOWING COMMANDS IN A TEST ENVIRONMENT OTHERWISE YOU WILL SUFFER DATA LOSS!!!

Setting The Stage

Create random file that takes up ~50% of disk space:

# dd if=/dev/urandom of=/mnt/zfs/mymirror/test.dat bs=1M count=125
 125+0 records in
 125+0 records out
 131072000 bytes (131 MB) copied, 16.8152 s, 7.8 MB/s
# zpool list
 NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
 mymirror   244M   126M   118M         -    20%    51%  1.00x  ONLINE  -
# zpool scrub mymirror
# zpool status
 pool: mymirror
 state: ONLINE
 scan: scrub repaired 0 in 0h0m with 0 errors on Sun Jul 19 20:20:12 2015
 config:

NAME        STATE     READ WRITE CKSUM
 mymirror    ONLINE       0     0     0
   mirror-0  ONLINE       0     0     0
     loop1   ONLINE       0     0     0
     loop2   ONLINE       0     0     0
     loop3   ONLINE       0     0     0

Complete Corruption Of A Single Disk

Wipe disk with all ones (to differentiate from the initialization above from /dev/zero to demonstrate how ZFS resilvers)

# dd if=/dev/zero bs=1M count=250 | tr '\000' '\001' > /tmp/zfsdisk_3
 250+0 records in
 250+0 records out
 262144000 bytes (262 MB) copied, 0.708197 s, 370 MB/s

This will wipe out the ZFS disk label among everything else, simulating the state where a disk is alive but corrupt.

# zpool scrub mymirror
# zpool list
 NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
 mymirror   244M   127M   117M         -    21%    51%  1.00x  ONLINE  -

# zpool status
 pool: mymirror
 state: ONLINE
 status: One or more devices could not be used because the label is missing or
 invalid.  Sufficient replicas exist for the pool to continue
 functioning in a degraded state.
 action: Replace the device using 'zpool replace'.
 see: http://zfsonlinux.org/msg/ZFS-8000-4J
 scan: scrub repaired 0 in 0h0m with 0 errors on Sun Jul 19 20:39:45 2015
 config:

NAME        STATE     READ WRITE CKSUM
 mymirror    ONLINE       0     0     0
   mirror-0  ONLINE       0     0     0
     loop1   ONLINE       0     0     0
     loop2   ONLINE       0     0     0
     loop3   UNAVAIL      0     0     0  corrupted data

errors: No known data errors

Replacing the disk:

zpool replace -o ashift=12 mymirror loop3
# zpool status
 pool: mymirror
 state: ONLINE
 scan: resilvered 126M in 0h0m with 0 errors on Sun Jul 19 20:42:51 2015
 config:

NAME        STATE     READ WRITE CKSUM
 mymirror    ONLINE       0     0     0
   mirror-0  ONLINE       0     0     0
     loop1   ONLINE       0     0     0
     loop2   ONLINE       0     0     0
     loop3   ONLINE       0     0     0

Note that only 126MB needed to be resilvered. ZFS will only synchronize blocks in use, not empty blocks and not blocks that are the same in the new drive (this is demonstrated as we corrupted it with all ones).

Complete Corruption Of 2 Out Of 3 Disks

Check the file first:

# md5sum /mnt/zfs/mymirror/test.dat 
c253c4c5421d793f4fefe34af5a5ecc1  /mnt/zfs/mymirror/test.dat

Corrupt disk 2 and 3:

# dd if=/dev/zero bs=1M count=250 | tr '\000' '\001' > /tmp/zfsdisk_2
250+0 records in
250+0 records out
262144000 bytes (262 MB) copied, 0.660485 s, 397 MB/s
# dd if=/dev/zero bs=1M count=250 | tr '\000' '\001' > /tmp/zfsdisk_3
250+0 records in
250+0 records out
262144000 bytes (262 MB) copied, 0.718505 s, 365 MB/s
# zpool scrub mymirror
# zpool status
  pool: mymirror
 state: ONLINE
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0h0m with 0 errors on Sun Jul 19 22:39:05 2015
config:

    NAME        STATE     READ WRITE CKSUM
    mymirror    ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        loop1   ONLINE       0     0     0
        loop2   UNAVAIL      0     0     0  corrupted data
        loop3   UNAVAIL      0     0     0  corrupted data

errors: No known data errors
# md5sum /mnt/zfs/mymirror/test.dat 
c253c4c5421d793f4fefe34af5a5ecc1  /mnt/zfs/mymirror/test.dat

File still looks good. Now replace both drives (done in the following way so we can see it in progress)

# zpool replace -o ashift=12 mymirror loop2 & \
  zpool replace -o ashift=12 mymirror loop3 & \
  sleep 1 && \
    zpool status &

state: ONLINE
 scan: resilvered 127M in 0h0m with 0 errors on Sun Jul 19 22:45:17 2015
config:

NAME STATE READ WRITE CKSUM
 mymirror ONLINE 0 0 0
 mirror-0 ONLINE 0 0 0
    loop1 ONLINE 0 0 0
    replacing-1 UNAVAIL 0 0 0
      old UNAVAIL 0 0 0 corrupted data
      loop2 ONLINE 0 0 0
    replacing-2 UNAVAIL 0 0 0
      old UNAVAIL 0 0 0 corrupted data
      loop3 ONLINE 0 0 0

errors: No known data errors

And finally replaced

# zpool status
  pool: mymirror
 state: ONLINE
  scan: resilvered 127M in 0h0m with 0 errors on Sun Jul 19 22:45:17 2015
config:

    NAME        STATE     READ WRITE CKSUM
    mymirror    ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        loop1   ONLINE       0     0     0
        loop2   ONLINE       0     0     0
        loop3   ONLINE       0     0     0

errors: No known data errors

And finally check the file:

# md5sum /mnt/zfs/mymirror/test.dat 
c253c4c5421d793f4fefe34af5a5ecc1  /mnt/zfs/mymirror/test.dat

Corrupting A File

In this test we'll inject the file on the drive with bad data using the zinject testing tool included with ZFS.

# zinject -t data -f 1 /mnt/zfs/mymirror/test.dat
Added handler 5 with the following properties:
  pool: mymirror
objset: 21
object: 24
  type: 0
 level: 0
 range: all
# zpool scrub mymirror
# zpool status
  pool: mymirror
 state: ONLINE
  scan: scrub in progress since Sun Jul 19 21:54:23 2015
    88.4M scanned out of 127M at 3.84M/s, 0h0m to go
    2.12M repaired, 69.51% done
config:

    NAME        STATE     READ WRITE CKSUM
    mymirror    ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        loop1   ONLINE       0     0     0  (repairing)
        loop2   ONLINE       0     0     0  (repairing)
        loop3   ONLINE       0     0     0  (repairing)

Found bad data and in the process of repairing.

# zpool status
  pool: mymirror
 state: ONLINE
  scan: scrub repaired 3M in 0h0m with 0 errors on Sun Jul 19 21:54:55 2015
config:

    NAME        STATE     READ WRITE CKSUM
    mymirror    ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        loop1   ONLINE       0     0     0
        loop2   ONLINE       0     0     0
        loop3   ONLINE       0     0     0

errors: No known data errors

Finished reparing 3M of bad data.

Cleanup: If you are testing this yourself, remember to cancel the zinject handler after

# zinject 
 ID  POOL             OBJSET  OBJECT  TYPE      LVL  RANGE          
---  ---------------  ------  ------  --------  ---  ---------------
  5  mymirror         21      24      -           0  all
# zinject -c 5
removed handler 5

Partial Drive Corruption

Inject random bytes into one of the files backing a loopback device (mirrored array member) with dd

# dd if=/dev/urandom of=/tmp/zfsdisk_3 bs=1K count=10 seek=200000
10+0 records in
10+0 records out
10240 bytes (10 kB) copied, 0.00324266 s, 3.2 MB/s
# zpool scrub mymirror
# zpool status
  pool: mymirror
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub in progress since Sun Jul 19 22:08:26 2015
    127M scanned out of 127M at 31.8M/s, 0h0m to go
    24.8M repaired, 99.91% done
config:

    NAME        STATE     READ WRITE CKSUM
    mymirror    ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        loop1   ONLINE       0     0     0
        loop2   ONLINE       0     0     0
        loop3   ONLINE       0     0   260  (repairing)

errors: No known data errors

Found corruption and is fixing.

# zpool status
  pool: mymirror
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 24.8M in 0h0m with 0 errors on Sun Jul 19 22:08:30 2015
config:

    NAME        STATE     READ WRITE CKSUM
    mymirror    ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        loop1   ONLINE       0     0     0
        loop2   ONLINE       0     0     0
        loop3   ONLINE       0     0   260

errors: No known data errors

24M of drive corruption fixed

Conclusion

Setting up 3-way disk arrays using ZFS provides robust error-detection and recovery from a wide variety of damage scenarios. Its ability to target healing to only the affected data allows it to resilver efficiently and recover faster than traditional array configurations.