High Availability RabbitMQ With Mirrored Queues

Background

RabbitMQ is a robust message queue which features high message throughput, configurable acknowledgements, an intuitive management GUI and wide client library support. Written in Erlang, RabbitMQ has built a reputation for outstanding stability which makes it a popular choice as a core infrastructure system.

As you plan your overall messaging architecture a universal requirement is to minimize downtime from any single point of failure. Fortunately RabbitMQ comes equipped with built-in high-availability facilities, your tolerance for message loss will determine the which of the available HA options and approaches fits best. This post will mainly focus on setting up RabbitMQ mirrored queues which provide the highest protection against message loss.

Clustering vs Mirrored Queues

For the purpose of this post, its sufficient to know that RabbitMQ clusters do in fact replicate items like users, exchanges, routing keys and queues among member nodes, but importantly they do not replicate the messages themselves1. In the event of a RabbitMQ node failure, clients can reconnect to a different node, declare their queues and begin to receive published messages from that time on, however any messages that were still waiting in a queue located on the failed node will be lost if the host becomes unrecoverable.

Mirrored queues introduce an additional level of protection by replicating published messages to other nodes in the cluster as well. In the event of a node failure, another cluster node will assume the queue master role for mirrored queues that were present on the failed node and then proceed to handle messaging operations for these mirrored queues thereafter, including serving messages that were still queued up at the time of the failure. This additional protection however does come at a cost of reduced message throughput and increased server load.

Requirements

Before we can configure mirrored queues we need to first have a RabbitMQ cluster in place.

RabbitMQ Cluster Requirements

  1. Two or more RabbitMQ nodes
  2. Identical Erlang cookie value in each node
  3. Peer discovery method

The identical Erlang cookie value is necessary for the nodes to authenticate between themselves to form a cluster. Since we will be using docker-compose for this demo we can specify this value as an environment variable and the default RabbitMQ docker-entrypoint.sh script will write this value to the correct file. If you are deploying RabbitMQ to hosts directly or using some other configuration orchestration see the official clustering guide for the location of the cookie file2.

The peer discovery method we will use is a hard-coded list of nodes that make up the cluster. This is the simplest method which avoids introducing additional dependencies, making it perfect for short demonstration purposes, but you’ll likely want to select a more dynamic peer discovery method outlined in the cluster formation documentation3.

Mirrored Queues Requirements

  1. RabbitMQ Cluster
  2. High-availability policy
  3. One or more queues to apply the HA policy to

As mentioned above, mirrored queue functionality is layered on top of the cluster mechanisms, so you’ll need to set up a cluster if you want to enable mirrored queues.

The high-availability policy defines the characteristics of the message replication, such as replication batch size, along with a pattern to match against queues you want mirrored. In the demo below we will indicate that the policy should apply to any queues with names starting with ‘ha.’.

Demo

Demo Overview

For this demo we will spin up two RabbitMQ nodes from scratch in a cluster configuration using docker-compose. We will also leverage the management plugin to pre-define an HA policy, a single queue to be mirrored and other initial configurations so that our cluster has everything it needs to be fully functional from the start.

An important thing to note is that the data volume mapping has been commented out in the docker-compose.yml file. This was done since this demo is meant to highlight how to start fresh without any previous state information. When the cluster is running, state information will be persisted to the data volume and will be read in on subsequent starts. Since this demo does not map the data volume, any state information will be lost when the node is shut down. In real world use you’ll want to map the data volume to an appropriate location on your system to avoid losing data on restart.

Demo Files

Below are the files necessary to start up two RabbitMQ nodes using docker-compose. They are intentionally minimalist to highlight the required elements and provide a stable base to allow for experimentation.

rabbitmq.conf

The new style rabbitmq.conf file below simply defines the peer discovery method, the nodes participating in the cluster and indicates the definitions file to load on start (this last functionality is provided by the management plugin).

# Load definitions file that includes HA policy
management.load_definitions = /etc/rabbitmq/definitions.json

# Declare cluster via host list
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_classic_config
cluster_formation.classic_config.nodes.1 = [email protected]
cluster_formation.classic_config.nodes.2 = [email protected]
rabbit1-definitions.json

The primary reason we need to use a definition file is to pre-define the HA policy that will apply to queues created in the cluster. An interesting surprise of using a definitions file however is that by using one on a fresh RabbitMQ instance, it will skip the automatic creation of the default ‘guest’ user. It is therefore necessary to additionally define one or more users explicitly in the definitions file.

{
  "vhosts": [
    {
      "name": "/"
    }
  ],

  "users": [
    {
      "name": "admin",
      "password": "admin",
      "tags": "administrator"
    }
  ],

  "permissions": [
    {
      "user":"admin",
      "vhost":"/",
      "configure":".*",
      "write":".*",
      "read":".*"
    }
  ],

  "policies": [
    {
      "name": "ha-all",
      "apply-to": "all",
      "definition": {
        "ha-mode": "all",
        "ha-sync-mode": "automatic",
        "ha-sync-batch-size": 1
      },
      "pattern": "^ha\.",
      "priority": 0,
      "vhost": "/"
    }
  ],

  "queues":
  [
    {
      "name": "ha.queue1",
      "vhost": "/",
      "durable": true,
      "auto_delete": false,
      "arguments": {}
    }
  ]
}
rabbit2-definitions.json

The rabbit2 definitions file can in fact just be an empty file as it will sync the configurations from the rabbit1 node when joining the cluster. However to err on the side of caution I prefer to include a user and permissions as well as the HA policy definition in this file in case the cluster gets in some kind of split-brain state that would need manual intervention.

It should be noted what is not defined in the rabbit2 file is the ‘ha.queue1’ definition since whichever node a queue is declared on becomes that queue’s master node.

{
  "vhosts": [
    {
      "name": "/"
    }
  ],

  "users": [
    {
      "name": "admin",
      "password": "admin",
      "tags": "administrator"
    }
  ],

  "permissions": [
    {
      "user":"admin",
      "vhost":"/",
      "configure":".*",
      "write":".*",
      "read":".*"
    }
  ],

  "policies": [
    {
      "name": "ha-all",
      "apply-to": "all",
      "definition": {
        "ha-mode": "all",
        "ha-sync-mode": "automatic",
        "ha-sync-batch-size": 1
      },
      "pattern": "^ha\.",
      "priority": 0,
      "vhost": "/"
    }
  ]
}
docker-compose.yml

The docker-compose.yml file brings everything together under one umbrella. It defines both RabbitMQ instances, the files and ports to map, and the shared Erlang cookie value.

One thing to note is that I’ve customized rabbit2’s ‘command’ value in order to stagger the startup of the two nodes to avoid a race condition where neither node believes its the primary4.

version: "2"

services:
    rabbit1:
        image: rabbitmq:3-management
        container_name: rabbit1
        hostname: rabbit1
        ports:
            - 5672:5672
            - 15672:15672
        environment:
            - RABBITMQ_ERLANG_COOKIE=shared_cookie
        volumes:
            - $PWD/rabbit1-definitions.json:/etc/rabbitmq/definitions.json:ro
            - $PWD/rabbitmq.conf:/etc/rabbitmq/rabbitmq.conf:ro
          # - $PWD/rabbit1-data:/var/lib/rabbitmq

    rabbit2:
        image: rabbitmq:3-management
        container_name: rabbit2
        hostname: rabbit2
        ports:
            - 5673:5672
            - 15673:15672
        environment:
            - RABBITMQ_ERLANG_COOKIE=shared_cookie
        volumes:
            - $PWD/rabbit2-definitions.json:/etc/rabbitmq/definitions.json:ro
            - $PWD/rabbitmq.conf:/etc/rabbitmq/rabbitmq.conf:ro
          # - $PWD/rabbit2-data:/var/lib/rabbitmq
        command: sh -c 'sleep 10 && rabbitmq-server'

Running The Demo

Once the files are in place (and assuming you have docker-compose installed), its just a matter of running ‘docker-compose up’:

$ docker-compose up

Starting rabbit1 ... done
Starting rabbit2 ... done
Attaching to rabbit1, rabbit2
...
rabbit1    | 2019-08-15 01:59:39.320 [info] <0.271.0> 
rabbit1    |  Starting RabbitMQ 3.7.16 on Erlang 22.0.7
...
rabbit2    | 2019-08-15 01:59:49.050 [info] <0.296.0> 
rabbit2    |  Starting RabbitMQ 3.7.16 on Erlang 22.0.7
...
rabbit2    | 2019-08-15 01:59:49.476 [info] <0.447.0> Mirrored queue 'ha.queue1' in vhost '/': Adding mirror on node [email protected]: <8837.647.0>
rabbit2    | 2019-08-15 01:59:49.485 [info] <0.447.0> Mirrored queue 'ha.queue1' in vhost '/': Synchronising: 0 messages to synchronise
rabbit2    | 2019-08-15 01:59:49.485 [info] <0.447.0> Mirrored queue 'ha.queue1' in vhost '/': Synchronising: batch size: 4096
rabbit2    | 2019-08-15 01:59:49.486 [info] <0.296.0> Running boot step load_definitions defined by app rabbitmq_management
rabbit2    | 2019-08-15 01:59:49.486 [info] <0.466.0> Mirrored queue 'ha.queue1' in vhost '/': Synchronising: all slaves already synced
rabbit2    | 2019-08-15 01:59:49.486 [info] <0.296.0> Applying definitions from /etc/rabbitmq/definitions.json
...
rabbit1    | 2019-08-15 01:59:49.500 [info] <0.385.0> rabbit on node [email protected] up
rabbit2    | 2019-08-15 01:59:49.500 [info] <0.398.0> rabbit on node [email protected] up

Verify Setup

After browsing to the admin GUI at http://localhost:15672 and logging in with admin/admin you should see the following:

nodes shown in overview screen the example mirrored queue mirrored queue details

Client code

The client code setup is slightly different for connecting to mirrored queues, you’ll want to indicate to the RabbitMQ/AMQP library that there are multiple cluster nodes so that in the event of a failure they can attempt to connect to another node.

The shuffling of the node list in both code snippets is to distribute clients across nodes. Keep in mind that the node where the queue was defined (ie the queue master) will still handle all messages routing for that queue.

Python
#!/usr/bin/env python3

import pika
import random

creds = pika.PlainCredentials('admin', 'admin')

params = [
           pika.ConnectionParameters(host='localhost', port=5672, credentials=creds),
           pika.ConnectionParameters(host='localhost', port=5673, credentials=creds),
         ]

random.shuffle(params)

connection = pika.BlockingConnection(params)

channel = connection.channel()

# Publish message
channel.basic_publish(
                       exchange='',
                       routing_key='ha.queue1',
                       body='Test message'
                     )
# Receive message
resp_get_ok, resp_props, resp_body = channel.basic_get(queue='ha.queue1', auto_ack=True)
print(resp_body)

channel.close()
connection.close()
Java
import com.rabbitmq.client.ConnectionFactory;
import com.rabbitmq.client.Connection;
import com.rabbitmq.client.Channel;
import com.rabbitmq.client.Address;
import com.rabbitmq.client.AMQP.BasicProperties;
import com.rabbitmq.client.GetResponse;

import java.util.Arrays;
import java.util.List;
import java.util.Collections;

public class RabbitMQTest{

    public static void main(String[] argv) throws Exception {

        List<Address> nodes = Arrays.asList(
                                 new Address("localhost", 5672),
                                 new Address("localhost", 5673)
                              );

        Collections.shuffle(nodes);

        ConnectionFactory factory = new ConnectionFactory();
        factory.setUsername("admin");
        factory.setPassword("admin");

        Connection connection = factory.newConnection(nodes);

        Channel channel = connection.createChannel();

        // Publish message
        String exch = "";
        BasicProperties props = null;
        String routing_key = "ha.queue1";
        String msg = "Test message";

        channel.basicPublish(exch, routing_key, props, msg.getBytes());

        // Receive message
        String queue = "ha.queue1";
        boolean autoAck = true;

        GetResponse response = channel.basicGet(queue, autoAck);

        byte[] body = response.getBody();
        System.out.println(new String(body));

        channel.close();
        connection.close();
    }
}

References

  1. What is Replicated?: https://www.rabbitmq.com/clustering.html#overview-what-is-replicated
  2. Clustering Guide: https://www.rabbitmq.com/clustering.html
  3. Cluster Formation and Peer Discovery: https://www.rabbitmq.com/cluster-formation.html
  4. Race Conditions During Initial Cluster Formation: https://www.rabbitmq.com/cluster-formation.html#initial-formation-race-condition
  5. Highly Available (Mirrored) Queues: https://www.rabbitmq.com/ha.html
  6. Loading Definitions (Schema) at Startuphttps://www.rabbitmq.com/management.html#load-definitions

Guaranteed Topic Delivery using ActiveMQ Virtual Destinations

Background

The original JMS spec first arrived in 2001 with JSR 914. At the time several enterprise messaging systems were already widely available however each had their own unique features and mechanisms which required software that wanted to talk on a given messaging bus to be tightly-coupled to the specific messaging system implementation in use.

Given that “tightly-coupled” is something you want to avoid in your enterprise systems, there was a push to abstract away these implementation-specific system aspects from the messaging client code. To address this, the JMS was developed to define a standardized, consistent programming interface that would work across different JMS-provider implementations. The JMS spec, by its abstracted nature, could only define the lowest-common-denominator of messaging constructs so that they could apply to (or be adapted for) the widest number of messaging systems. The JMS therefore defined two core messaging delivery mechanisms (also referred to as destinations): queues and topics.

  • Queues are used for point-to-point messaging (1-to-1), where a message sent to a given queue will only be delivered to a single consumer

image of queue

  • Topics on the other hand are used for pub/sub messaging (1-to-many), where a message sent to a topic will be delivered to all subscribers currently listening to that topic.

image of message topic

Beyond the producer/consumer topology differences between the two, there is another key aspect to consider: time dependency.

Time Dependency

A message sent to a queue will remain on that queue until a consumer removes it. In other words, a queue consumer can be brought online and all messages that were sent to the queue prior to the consumer coming online will then be delivered to it.

However, as hinted above, by default subscribers need to have an active subscription at the time a message is published in order to receive it. Any messages published when the subscriber is not listening will not be delivered to said subscriber. There is a slight exception to this rule, durable subscriptions, which allow subscribers to receive all messages between the time they subscribe until they unsubscribe, even if the subscriber was offline when a message was sent.

There are practical limitations to durable subscriptions which reduces the number of use-cases where they would be sufficient. One of the biggest constraints is that durable subscriptions don’t cover messages that were sent prior to the subscriber creating the durable subscription. For example if the sending system starts broadcasting messages at 7:00am and the receiving system registers a durable listener at 7:30am, all messages produced between 7:00a-7:30a will not be delivered to the subscriber. Although there are some situations where this type of operation is sufficient, the more common requirement is that the receiving system get ALL messages from the sending system, whether the receiving system was alive or not. What is needed is a pub/sub model but with the delivery guarantees that queues provide. This is where ActiveMQ Virtual Destinations can help.

Virtual Destinations

ActiveMQ Virtual Destinations are a way of mapping logical destinations to one or more actual (also called physical) destinations. There are a couple types of virtual destinations available but the one we will focus on is referred to as a Composite Topic. This particular virtual destination will behave as a topic destination for publishers, but we can map that topic to a series of queues (and/or other topics) in order to achieve the particular delivery characteristics that we are shooting for.

Side note: I did not use the phrase “virtual topic” above (which would have seem to have been a better word choice) because “virtual topic” denotes a specific kind of ActiveMQ virtual destination! Virtual topics are similar to composite topics except that virtual topics are simplified virtual destinations that can be dynamically created by messaging clients based on a naming convention (ie ActiveMQ topic “VirtualTopic.FOO” will create a dynamic, virtual topic).

Composite Topics

Composite Topics are set up in the ActiveMQ broker configuration and allow for flexibility in designing messaging architecture.

Both these aspects are key:

  • We want the flexibility to map a single topic to multiple queues, and
  • We want to define this setup in the broker’s initialization file so that the broker knows to create the physical queues on startup so it can begin dropping messages sent to the topic to them.

image of composite topic

This is what it looks like inside the conf/activemq.xml:

<broker ...>
...
    <!-- Optional: Pre-define topic/queues -->
    <destinations>
       <topic physicalName="myLogicalTopic" />
       <queue physicalName="queueA" />
       <queue physicalName="queueB" />
    </destinations>

    <destinationInterceptors>
        <virtualDestinationInterceptor>
            <virtualDestinations>
                <compositeTopic name="myLogicalTopic">
                    <forwardTo>
                        <queue physicalName="queueA"/>
                        <queue physicalName="queueB"/>
                        <queue physicalName="queueC"/> <!-- This queue was not defined above to 
                                                            demonstrate that ActiveMQ will create 
                                                            a target destination on the first message -->
                    </forwardTo>
                </compositeTopic>
            </virtualDestinations>
        </virtualDestinationInterceptor>
    </destinationInterceptors>
</broker>

Producer/Consumer Setup

With the broker squared away, the only thing left is to add the appropriate code to the producer(s) and consumer(s) programs. Fortunately since the broker is doing the heavy lifting of forwarding messages behind the scenes, we can just set up producer(s) to publish to the topic and each consumer to connect to their assigned queue as you normally would. From the messaging client perspective there is nothing special or exotic that needs to be additionally configured.

Trade-offs

Astute readers may have already guessed that there is some trade-offs involved going this route. The most obvious one is that in the way we’ve configured it, each receiving system would require its own queue, effectively losing the ability for subscribers to dynamically register. In practice this isn’t usually much of an issue since a) normally the number of receiving systems is known and limited and b) we could have easily added a physical topic to the mapping so that messages would also be forwarded to a standard topic. In addition, placing messages in queues rather than say durable subscriptions makes the operation much more visible and easy to work with and even promotes receiving systems’ ability to implement load-balancing.

Conclusion

Using virtual destinations the producer can enjoy the simplicity of a pub/sub model with the assurance that messages will be waiting for consumers to pull them from the broker when they are up and ready. Though any robust messaging architecture will require more design, planning, and configuration than we’ve touched on here, virtual destinations are an approach that may greatly simplify some otherwise very difficult use-cases.

Simple Time Series Graphs With Gnuplot

Introduction

Gnuplot is a feature rich command-line graphing utility available for Windows, Linux and Mac OSX. Though capable of generating much more advanced formula-based plots its also very handy at producing quick, ad-hoc time series graphs.

The Test Dataset

For our example we’ll use the following basic dataset saved to file ‘mydata.txt’

2019-02-01  15
2019-03-01   8
2019-04-01  16
2019-05-01  20
2019-06-01  14

Gnuplot Commands

These commands were tested with gnuplot 5.0 patchlevel 5

set xdata time                           # Indicate that x-axis values are time values
set timefmt "%Y-%m-%d"                   # Indicate the pattern the time values will be in
set format x "%m/%y"                     # Set how the dates will be displayed on the plot

set xrange ["2019-01-01":"2019-12-31"]   # Set x-axis range of values
set yrange [0:30]                        # Set y-axis range of values

set key off                              # Turn off graph legend
set xtics rotate by -45                  # Rotate dates on x-axis 45deg for cleaner display
set title 'Squirrels Spotted'            # Set graph title

set terminal jpeg                        # Set the output format to jpeg
set output 'output.jpg'                  # Set output file to output.jpg

plot 'mydata.txt' using 1:2 with linespoints linetype 6 linewidth 2

Running Gnuplot

Although you can paste the commands above right into gnuplot, it usually makes sense to put these commands in their own file for future reuse and reference. If we put these into the file ‘gpcommands.txt’ then we just need to run:

$ gnuplot < gpcommands.txt

To have it generate the following plot:

time series gnuplot graph

Final Thoughts

Although there are other graphing libraries and software that may plot time series graphs in a more visually appealing way there are few-to-none that match gnuplot’s efficiency. More often than not you’ll just need something that will visualize a set of data points thats ‘good enough’ and for this reason gnuplot is a very effective tool to keep in mind.

3-way Disk Mirrors With ZFSOnLinux

Background

ZFS  is a member of the newer generation of filesystems that include advanced features beyond simple file storage. Its capabilities are quite extensive covering a wide range of pain points hit with previous filesystems. The Wikipedia page details them all nicely but for the purpose of this post we will be focusing on its ability to create N-Way sets of disk mirrors.

Traditionally mirrored disk sets in Linux and other operating systems have been limited to two devices (note: devices in this context could be disks, partitions or even other raid groups, such is the case in raid 10 setups). While mirroring has the benefit over other raid levels in that each mirrored device contains a complete copy of the data, the two device limit became inadequate as disk sizes ballooned.  In the age of multi-TB drives, simply rebuilding a degraded mirrored array could actually cause the surviving device to fail, eliminating the very redundancy one was expecting.

ZFS addresses this particular problem in several ways through data checksums, self-healing and smart resilvering instead of blindly rebuilding full array members even if only 1% of disk space is being used.

And to top it off, ZFS also includes the ability to specify N number of devices in a mirrored set.  In this post we will create a sample 3-way mirrored set using loopback devices and run a series of test scenarios against it.

For those unfamiliar, a loopback device allows you to expose an file as a block device. Using loopback devices we can create file-based “disks” that we can use as mirror array members in our test.

Testbed Setup

For this exercise I am using a fresh Debian Jessie (8.1) x86_64 vanilla system installed into a KVM/QEMU virtual machine. The kernel currently shipped with Jessie is 3.16.0-4-amd64 and the ZFSOnLinux package currently available for Debian is 0.6.4-1.2-1.

It should be especially noted that ZFS should only be used on 64-bit hosts.

Installation

Following the Debian instructions on the ZFSOnLinux website,  the following commands were run:

$ su -
# apt-get install lsb-release
# wget http://archive.zfsonlinux.org/debian/pool/main/z/zfsonlinux/zfsonlinux_6_all.deb
# dpkg -i zfsonlinux_6_all.deb
# apt-get update
# apt-get install debian-zfs

This will add /etc/apt/sources.list.d/zfsonlinux.list, install the software and dependencies, then proceed to build the ZFS/SPL kernel modules.

Preparing the loopback devices

Finding the first available loopback device

# losetup -a

If you see anything listed, change 1 2 3 in the commands below to the start with the next available number and increment appropriately.

Creating the files

# for i in 1 2 3; do dd if=/dev/zero of=/tmp/zfsdisk_$i bs=1M count=250; done
250+0 records in
250+0 records out
262144000 bytes (262 MB) copied, 0.371318 s, 706 MB/s
250+0 records in
250+0 records out
262144000 bytes (262 MB) copied, 0.614396 s, 427 MB/s
250+0 records in
250+0 records out
262144000 bytes (262 MB) copied, 0.824889 s, 318 MB/s

Setup the loopback mappings

# for i in 1 2 3; do losetup /dev/loop$i /tmp/zfsdisk_$i; done

Verify the mappings

# losetup -a
/dev/loop1: [65025]:399320 (/tmp/zfsdisk_1)
/dev/loop2: [65025]:399323 (/tmp/zfsdisk_2)
/dev/loop3: [65025]:399324 (/tmp/zfsdisk_3)

Create the ZFS 3-Way Mirror

# zpool \
    create \
    -o ashift=12 \
    -m /mnt/zfs/mymirror \
    mymirror \
    mirror \
    /dev/loop1 \
    /dev/loop2 \
    /dev/loop3

A couple things to note:

  1.  -o ashift=12
    This tells ZFS to align along 4KB sectors. It is generally a good idea to always set this option since modern disks use 4KB sectors and once a pool has been created with a given sector size it cannot be changed later. The net result is that if you created a pool with 512b sectors lets say using 1TB drives, you couldn’t later change the sector size to 4KB when adding 3TB drives (resulting in abysmal performance on the newer drives). So as a rule of thumb, always set -o ashift=12.
  2.  -m /mnt/zfs/mymirror
    This indicates where this pool should be mounted.
  3.  /dev/loopN
    The devices that make up the mirrored set. If these were physical disks you would likely want to use the appropriate disk symlinks under /dev/disk/by-id/.

Verify The ZFS Pool

# zpool list
NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
mymirror   244M   408K   244M         -     0%     0%  1.00x  ONLINE  -
# zpool status
  pool: mymirror
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    mymirror    ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        loop1   ONLINE       0     0     0
        loop2   ONLINE       0     0     0
        loop3   ONLINE       0     0     0

errors: No known data errors

Poking The Bear

So now that we have our test 3-way mirror running, lets test the resiliency.

!!! WARNING NOTE: ALTHOUGH ZFS IS BUILT TO RECOVER FROM ERRORS, ONLY RUN THE FOLLOWING COMMANDS IN A TEST ENVIRONMENT OTHERWISE YOU WILL SUFFER DATA LOSS!!!

Setting The Stage

Create random file that takes up ~50% of disk space:

# dd if=/dev/urandom of=/mnt/zfs/mymirror/test.dat bs=1M count=125
 125+0 records in
 125+0 records out
 131072000 bytes (131 MB) copied, 16.8152 s, 7.8 MB/s
# zpool list
 NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
 mymirror   244M   126M   118M         -    20%    51%  1.00x  ONLINE  -
# zpool scrub mymirror
# zpool status
 pool: mymirror
 state: ONLINE
 scan: scrub repaired 0 in 0h0m with 0 errors on Sun Jul 19 20:20:12 2015
 config:

NAME        STATE     READ WRITE CKSUM
 mymirror    ONLINE       0     0     0
   mirror-0  ONLINE       0     0     0
     loop1   ONLINE       0     0     0
     loop2   ONLINE       0     0     0
     loop3   ONLINE       0     0     0

Complete Corruption Of A Single Disk

Wipe disk with all ones (to differentiate from the initialization above from /dev/zero to demonstrate how ZFS resilvers)

# dd if=/dev/zero bs=1M count=250 | tr '\000' '\001' > /tmp/zfsdisk_3
 250+0 records in
 250+0 records out
 262144000 bytes (262 MB) copied, 0.708197 s, 370 MB/s

This will wipe out the ZFS disk label among everything else, simulating the state where a disk is alive but corrupt.

# zpool scrub mymirror
# zpool list
 NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
 mymirror   244M   127M   117M         -    21%    51%  1.00x  ONLINE  -

# zpool status
 pool: mymirror
 state: ONLINE
 status: One or more devices could not be used because the label is missing or
 invalid.  Sufficient replicas exist for the pool to continue
 functioning in a degraded state.
 action: Replace the device using 'zpool replace'.
 see: http://zfsonlinux.org/msg/ZFS-8000-4J
 scan: scrub repaired 0 in 0h0m with 0 errors on Sun Jul 19 20:39:45 2015
 config:

NAME        STATE     READ WRITE CKSUM
 mymirror    ONLINE       0     0     0
   mirror-0  ONLINE       0     0     0
     loop1   ONLINE       0     0     0
     loop2   ONLINE       0     0     0
     loop3   UNAVAIL      0     0     0  corrupted data

errors: No known data errors

Replacing the disk:

zpool replace -o ashift=12 mymirror loop3
# zpool status
 pool: mymirror
 state: ONLINE
 scan: resilvered 126M in 0h0m with 0 errors on Sun Jul 19 20:42:51 2015
 config:

NAME        STATE     READ WRITE CKSUM
 mymirror    ONLINE       0     0     0
   mirror-0  ONLINE       0     0     0
     loop1   ONLINE       0     0     0
     loop2   ONLINE       0     0     0
     loop3   ONLINE       0     0     0

Note that only 126MB needed to be resilvered. ZFS will only synchronize blocks in use, not empty blocks and not blocks that are the same in the new drive (this is demonstrated as we corrupted it with all ones).

Complete Corruption Of 2 Out Of 3 Disks

Check the file first:

# md5sum /mnt/zfs/mymirror/test.dat 
c253c4c5421d793f4fefe34af5a5ecc1  /mnt/zfs/mymirror/test.dat

Corrupt disk 2 and 3:

# dd if=/dev/zero bs=1M count=250 | tr '\000' '\001' > /tmp/zfsdisk_2
250+0 records in
250+0 records out
262144000 bytes (262 MB) copied, 0.660485 s, 397 MB/s
# dd if=/dev/zero bs=1M count=250 | tr '\000' '\001' > /tmp/zfsdisk_3
250+0 records in
250+0 records out
262144000 bytes (262 MB) copied, 0.718505 s, 365 MB/s
# zpool scrub mymirror
# zpool status
  pool: mymirror
 state: ONLINE
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0h0m with 0 errors on Sun Jul 19 22:39:05 2015
config:

    NAME        STATE     READ WRITE CKSUM
    mymirror    ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        loop1   ONLINE       0     0     0
        loop2   UNAVAIL      0     0     0  corrupted data
        loop3   UNAVAIL      0     0     0  corrupted data

errors: No known data errors
# md5sum /mnt/zfs/mymirror/test.dat 
c253c4c5421d793f4fefe34af5a5ecc1  /mnt/zfs/mymirror/test.dat

File still looks good. Now replace both drives (done in the following way so we can see it in progress)

# zpool replace -o ashift=12 mymirror loop2 & \
  zpool replace -o ashift=12 mymirror loop3 & \
  sleep 1 && \
    zpool status &

state: ONLINE
 scan: resilvered 127M in 0h0m with 0 errors on Sun Jul 19 22:45:17 2015
config:

NAME STATE READ WRITE CKSUM
 mymirror ONLINE 0 0 0
 mirror-0 ONLINE 0 0 0
    loop1 ONLINE 0 0 0
    replacing-1 UNAVAIL 0 0 0
      old UNAVAIL 0 0 0 corrupted data
      loop2 ONLINE 0 0 0
    replacing-2 UNAVAIL 0 0 0
      old UNAVAIL 0 0 0 corrupted data
      loop3 ONLINE 0 0 0

errors: No known data errors

And finally replaced

# zpool status
  pool: mymirror
 state: ONLINE
  scan: resilvered 127M in 0h0m with 0 errors on Sun Jul 19 22:45:17 2015
config:

    NAME        STATE     READ WRITE CKSUM
    mymirror    ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        loop1   ONLINE       0     0     0
        loop2   ONLINE       0     0     0
        loop3   ONLINE       0     0     0

errors: No known data errors

And finally check the file:

# md5sum /mnt/zfs/mymirror/test.dat 
c253c4c5421d793f4fefe34af5a5ecc1  /mnt/zfs/mymirror/test.dat

Corrupting A File

In this test we’ll inject the file on the drive with bad data using the zinject testing tool included with ZFS.

# zinject -t data -f 1 /mnt/zfs/mymirror/test.dat
Added handler 5 with the following properties:
  pool: mymirror
objset: 21
object: 24
  type: 0
 level: 0
 range: all
# zpool scrub mymirror
# zpool status
  pool: mymirror
 state: ONLINE
  scan: scrub in progress since Sun Jul 19 21:54:23 2015
    88.4M scanned out of 127M at 3.84M/s, 0h0m to go
    2.12M repaired, 69.51% done
config:

    NAME        STATE     READ WRITE CKSUM
    mymirror    ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        loop1   ONLINE       0     0     0  (repairing)
        loop2   ONLINE       0     0     0  (repairing)
        loop3   ONLINE       0     0     0  (repairing)

Found bad data and in the process of repairing.

# zpool status
  pool: mymirror
 state: ONLINE
  scan: scrub repaired 3M in 0h0m with 0 errors on Sun Jul 19 21:54:55 2015
config:

    NAME        STATE     READ WRITE CKSUM
    mymirror    ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        loop1   ONLINE       0     0     0
        loop2   ONLINE       0     0     0
        loop3   ONLINE       0     0     0

errors: No known data errors

Finished reparing 3M of bad data.

Cleanup: If you are testing this yourself, remember to cancel the zinject handler after

# zinject 
 ID  POOL             OBJSET  OBJECT  TYPE      LVL  RANGE          
---  ---------------  ------  ------  --------  ---  ---------------
  5  mymirror         21      24      -           0  all
# zinject -c 5
removed handler 5

Partial Drive Corruption

Inject random bytes into one of the files backing a loopback device (mirrored array member) with dd

# dd if=/dev/urandom of=/tmp/zfsdisk_3 bs=1K count=10 seek=200000
10+0 records in
10+0 records out
10240 bytes (10 kB) copied, 0.00324266 s, 3.2 MB/s
# zpool scrub mymirror
# zpool status
  pool: mymirror
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub in progress since Sun Jul 19 22:08:26 2015
    127M scanned out of 127M at 31.8M/s, 0h0m to go
    24.8M repaired, 99.91% done
config:

    NAME        STATE     READ WRITE CKSUM
    mymirror    ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        loop1   ONLINE       0     0     0
        loop2   ONLINE       0     0     0
        loop3   ONLINE       0     0   260  (repairing)

errors: No known data errors

Found corruption and is fixing.

# zpool status
  pool: mymirror
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 24.8M in 0h0m with 0 errors on Sun Jul 19 22:08:30 2015
config:

    NAME        STATE     READ WRITE CKSUM
    mymirror    ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        loop1   ONLINE       0     0     0
        loop2   ONLINE       0     0     0
        loop3   ONLINE       0     0   260

errors: No known data errors

24M of drive corruption fixed

Conclusion

Setting up 3-way disk arrays using ZFS provides robust error-detection and recovery from a wide variety of damage scenarios. Its ability to target healing to only the affected data allows it to resilver efficiently and recover faster than traditional array configurations.

 

PostgreSQL Upserts

Background

SQL upserts are a combination of an INSERT and/or UPDATE into a single database operation which allows rows to to be added or modified in an atomic, concurrent-safe way.

For a time upserts were somewhat of a sensitive subject in PostgreSQL circles. Several other DBMS’s had some form of support for upserts/conflict resolutions (albeit often with limited assurances) and refugees from these other DBMS’s often got hung up when porting existing queries that heavily relied on this type of functionality.

Prior to Postgresql 9.5 there were various (and less-than-ideal) approaches to get around the lack of native support. The PostgreSQL manual provided a stored procedure example which:

  • attempts to update an existing row
  • detects if no row was updated
  • attempts to insert new row instead
  • catches and throws away duplicate key exceptions
  • repeat the steps above indefinitely until either a row has been updated or inserted

Though this approach works as intended (and is more or less applicable to other DBMS’s), it has a certain smell to it. Ignoring the performance aspects for a moment, one of the bigger practical pain points of this approach was that each table that you might want to upsert would need to have its own corresponding stored procedure.

However thanks to the work of dedicated PostgreSQL developers, native support finally made into the software starting with the 9.5 release. The INSERT statement was expanded with an ON CONFLICT clause that gives it the ability to recover from unique constraint violations.

Demo

Test system setup

If you don’t already have a PostgreSQL testbed, you can use the instruction listed here to quickly start up a PostgreSQL docker.

Create test table

CREATE TABLE upsert_test1 (
     name       TEXT PRIMARY KEY,
     fav_color  TEXT
);

Insert row using upsert

INSERT INTO upsert_test1 (name, fav_color) 
                  VALUES ('Sally', 'blue')
         ON CONFLICT (name) 
            DO UPDATE SET fav_color = 'blue';

Confirm row values

SELECT * FROM upsert_test1;

 name  | fav_color 
-------+-----------
 Sally | blue

Update row using same upsert statement

INSERT INTO upsert_test1 (name, fav_color) 
                  VALUES ('Sally', 'green')
         ON CONFLICT (name) 
            DO UPDATE SET fav_color = 'green';

Confirm row values

SELECT * FROM upsert_test1;

 name  | fav_color 
-------+-----------
 Sally | green

Conclusion

As you can see above the same upsert statement was able to insert a new row or update an existing row as the circumstances dictated. Though you shouldn’t use upserts recklessly as there will still be a performance hit, the additional flexibility it provides makes migrating from other databases much easier.