Pacemaker cluster

Pacemaker is a piece of software for running a cluster of computers. It uses corosync to provide cluster communication and membership management in most deployments I believe. The command line tool on Ubuntu Linux is pcs. pcs status tells your what the cluster is doing and which nodes are currently members. pcs cluster start node2 starts node2, even if the command is executed on another server in the cluster.

What makes Pacemaker a bit different from other cluster management systems is that it handles those awkward mutual exclusion scenarios quite well. You may have heard about Keepalived used to provide high availability and Kubernetes is a popular cluster platform today. But neither can safely handled resources that require mutual exclusion. It’s perfectly possible for a Kubernetes node running a MariaDB instance to freeze, at which point Kubernetes starts up the same instance on another server, only for the original to suddenly come alive again.

Pacemaker supports fencing which makes sure that something that had a unique resource can not run it again without approval from the cluster. This is typically done by killing the previous runtime(virtual machine, physical server or whatever it may be) and waiting for a response message that gives us the assurance the previous runtime isn’t going to do something unexpected.

For physical hardware IPMI is typically used to do fencing, cutting the power to all the parts of the computer that aren’t the IPMI-board itself. It is necessary for the IPMI-unit to continue operating because fencing requires both the kill action and some verification that the kill action worked. A common method is to let IPMI reset the power to the server and the verification is simply the server waking up and asking the cluster software if it can join again.

If implemented properly fencing should give complete assurance that two instances of a database server won’t be accepting writes concurrently, or instances of a file system writing to shared storage without being synchronized. But that means that if we lose contact with a server and its IPMI module, then the node can’t be fenced and we can’t safely bring up a new instance of whatever software failed along with the server. A human being can typically resolve the situation but that may take time.

Note that it is highly desirable to have clusters that do not rely on mutual exclusion. Ceph is a storage system with no need for fencing as it is quorum based(a majority of nodes have to agree on things, leaving some wonky node that freezes for a minute every now and then out in the cold) and each node has it’s own local storage(so none of that troublesome shared storage). Similarly Galera is an addition the MySQL-based databases that allow for multiple masters simultaneously thanks to a system of mutual certification of each proposed transaction so that no two changes to the contents of the database are ever in conflict.

Install

First we need to set things up corosync. I have three nodes as specified below and the file /etc/corosync/corosync.conf shown below needs to be the same on all three:

totem {
  version: 2
  secauth: off
  cluster_name: sveaha
  transport: udpu
}

nodelist {
  node {
        ring0_addr: pcmk1.svealiden.se
        nodeid: 1
  }
  node {
        ring0_addr: pcmk2.svealiden.se
        nodeid: 2
  }
  node {
        ring0_addr: pcmk3.svealiden.se
        nodeid: 3
  }
}

quorum {
  provider: corosync_votequorum
}

logging {
  to_syslog: yes
  to_logfile: yes
  logfile: /var/log/cluster/corosync.log
  # Log to the system log daemon. When in doubt, set to yes.
}

Then we need to generate an authkey – which by default is expected to be in /etc/corosync/authkey. It’s sufficient to generate it using this command(you don’t need to cd into any specific directory, the command knows where to place the key):

corosync-keygen

To make sure the configuration is correct on all three nodes I just used scp threeway copy:

[cjp@amdlinux /h/cjp ] $ scp -3 root@pcmk2.svealiden.se:/etc/corosync/corosync.conf root@pcmk1.svealiden.se:/etc/corosync
[cjp@amdlinux /h/cjp ] $ scp -3 root@pcmk2.svealiden.se:/etc/corosync/corosync.conf root@pcmk3.svealiden.se:/etc/corosync
[cjp@amdlinux /h/cjp ] $ scp -3 root@pcmk2.svealiden.se:/etc/corosync/authkey root@pcmk1.svealiden.se:/etc/corosync
[cjp@amdlinux /h/cjp ] $ scp -3 root@pcmk2.svealiden.se:/etc/corosync/authkey root@pcmk3.svealiden.se:/etc/corosync

After restart corosynd we can check its log:

Jul 06 17:23:21.251 [2247] pcmk1 corosync notice [MAIN ] Completed service synchronization, ready to provide service.
Jul 06 17:23:30.425 [2247] pcmk1 corosync notice [QUORUM] Sync members[1]: 1
Jul 06 17:23:30.425 [2247] pcmk1 corosync notice [TOTEM ] A new membership (1.1f) was formed. Members
Jul 06 17:23:30.426 [2247] pcmk1 corosync notice [QUORUM] Members[1]: 1
Jul 06 17:23:30.426 [2247] pcmk1 corosync notice [MAIN ] Completed service synchronization, ready to provide service.
Jul 06 17:23:34.809 [2247] pcmk1 corosync notice [QUORUM] Sync members[2]: 1 3
Jul 06 17:23:34.809 [2247] pcmk1 corosync notice [QUORUM] Sync joined[1]: 3
Jul 06 17:23:34.809 [2247] pcmk1 corosync notice [TOTEM ] A new membership (1.23) was formed. Members joined: 3
Jul 06 17:23:34.814 [2247] pcmk1 corosync notice [QUORUM] Sync members[3]: 1 2 3
Jul 06 17:23:34.814 [2247] pcmk1 corosync notice [QUORUM] Sync joined[2]: 2 3
Jul 06 17:23:34.814 [2247] pcmk1 corosync notice [TOTEM ] A new membership (1.27) was formed. Members joined: 2
Jul 06 17:23:34.821 [2247] pcmk1 corosync notice [QUORUM] This node is within the primary component and will provide service.
Jul 06 17:23:34.821 [2247] pcmk1 corosync notice [QUORUM] Members[3]: 1 2 3
Jul 06 17:23:34.821 [2247] pcmk1 corosync notice [MAIN ] Completed service synchronization, ready to provide service.

And finally we get pcs to start the cluster, and we can verify it with crm_node -l:

[root@pcmk1 ~]# pcs cluster start
Starting Cluster...
[root@pcmk1 ~]# crm_node -l
1 pcmk1.svealiden.se member
2 pcmk2.svealiden.se member
3 pcmk3.svealiden.se member

Note that I’m switching between different nodes but since this is a cluster it doesn’t really matter which node I’m using to call pcs or other cluster-related functions.

Set a password for the hacluster user and authorize yourself against all nodes using that user:

[root@pcmk1 ~]# pcs host auth pcmk1.svealiden.se
Username: hacluster
Password:
pcmk1.svealiden.se: Authorized
[root@pcmk1 ~]# pcs host auth pcmk2.svealiden.se
Username: hacluster
Password:
pcmk2.svealiden.se: Authorized
[root@pcmk1 ~]# pcs host auth pcmk3.svealiden.se
Username: hacluster
Password:
pcmk3.svealiden.se: Authorized
[root@pcmk1 ~]# pcs cluster status
Cluster Status:
 Cluster Summary:
   * Stack: corosync
   * Current DC: pcmk1.svealiden.se (version 2.1.0-0.2.rc1.fc34-b9ac0a9329) - partition with quorum
   * Last updated: Tue Jul  6 17:51:19 2021
   * Last change:  Tue Jul  6 17:26:31 2021 by hacluster via crmd on pcmk1.svealiden.se
   * 3 nodes configured
   * 0 resource instances configured
 Node List:
   * Node pcmk3.svealiden.se: pending
   * Online: [ pcmk1.svealiden.se pcmk2.svealiden.se ]

PCSD Status:
Warning: Some nodes are missing names in corosync.conf, those nodes were omitted
[root@pcmk1 ~]# pcs cluster start pcmk3.svealiden.se
pcmk3.svealiden.se: Starting Cluster...
[root@pcmk1 ~]# pcs cluster status
Cluster Status:
 Cluster Summary:
   * Stack: corosync
   * Current DC: pcmk1.svealiden.se (version 2.1.0-0.2.rc1.fc34-b9ac0a9329) - partition with quorum
   * Last updated: Tue Jul  6 17:51:36 2021
   * Last change:  Tue Jul  6 17:26:31 2021 by hacluster via crmd on pcmk1.svealiden.se
   * 3 nodes configured
   * 0 resource instances configured
 Node List:
   * Online: [ pcmk1.svealiden.se pcmk2.svealiden.se pcmk3.svealiden.se ]

PCSD Status:
Warning: Some nodes are missing names in corosync.conf, those nodes were omitted

It’s also good to set pcsd to start automatically(not a cluster command so this needs to be executed on each server individually):

systemctl enable pcsd

Configuration

There is good documentation on setting up a Pacemaker cluster on the website for the Pacemaker project: https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Clusters_from_Scratch/index.html

But creating somewhat elaborate configurations isn’t super-easy. Let’s look at creating a MySQL cluster. I’m no expert on this but I’ll explain it as best I can.

# pcs cluster cib sveaha.xml
# pcs -f sveaha.xml resource create db_master_ip ocf:heartbeat:IPaddr2 ip=192.168.1.67 cidr_netmask=32 op monitor interval=30s on-fail=restart
# pcs -f sveaha.xml resource create mysql ocf:heartbeat:mysql  \
   binary="/usr/bin/mysqld_safe"   config="/etc/mysql/my.cnf" \
   datadir="/var/lib/mysql"   pid="/var/lib/mysql/mysql.pid" \
   socket="/var/lib/mysql/mysql.sock"  \
   replication_user="replication_user" replication_passwd="MYPASSWORD" \
   --master \
   additional_parameters="--bind-address=0.0.0.0"   op start timeout=60s \
   op stop timeout=60s   op monitor interval=20s timeout=30s \
   on-fail=standby 
# pcs -f sveaha.xml resource clone mysql clone-max=3 clone-node-max=1
# pcs -f sveaha.xml constraint colocation add master mysql with db_master_ip score=INFINITY
# pcs cluster cib-push sveaha.xml --config

First we have pcs cluster cib sveaha.xml which simply dumps the configuration of the cluster to an XML-file called sveaha.xml. We’ll make pacemaker store our changes in that file for now and then when we’re done we load it all into the running cluster as a single transaction.

Next pcs -f sveaha.xml resource create db_master_ip ocf:heartbeat:IPaddr2 ip=192.168.1.67 cidr_netmask=32 op monitor interval=30s on-fail=restart creates an IP-address 192.168.1.67 that we call db_master_ip. We’re not telling the cluster how and when to use it yet, merely that it exists and should use the IPaddr2 resource script from the ocf:heartbeat library.

Now the big one:


# pcs -f sveaha.xml resource create mysql ocf:heartbeat:mysql  \
   binary="/usr/bin/mysqld_safe"   config="/etc/mysql/my.cnf" \
   datadir="/var/lib/mysql"   pid="/var/lib/mysql/mysql.pid" \
   socket="/var/lib/mysql/mysql.sock"  \
   replication_user="replication_user" replication_passwd="MYPASSWORD" \
   --master \
   additional_parameters="--bind-address=0.0.0.0"   op start timeout=60s \
   op stop timeout=60s   op monitor interval=20s timeout=30s \
   on-fail=standby

This creates a mysql database resource named mysql based on the ocf:heartbeat:mysql resource script. We specify the path of the binary, the mysql config file, the directory that mysql should use to store the data files, what pid-file to use and what path should be used for the socket for local communication. Phew, next bit: we have to specify the replication user and the password used. The ocf:heartbeat:mysql script uses these values when turning server nodes into slave, executing queries like “CHANGE MASTER TO”. The –master part is very important and declares that this resource shall be divided into master/slave instances. Otherwise all nodes would be the same, either masters or slaves. Finally we say that the servers should listen for traffic intended for any IP-address(–bind-address=0.0.0.0) and give some reasonable values for starting, stopping and so on.

So far we’ve basically said that these MySQL instances should be present on all nodes but we need to be more precise. That’s the next bit: pcs -f sveaha.xml resource clone mysql clone-max=3 clone-node-max=1

This states that the mysql resource which is made up of clones should only have a total of three active clones at any time and no node should have more than one clone. My cluster has three nodes so maybe I could have skipped the clone-max setting but it seemed better to include.

What about that IP-address? pcs -f sveaha.xml constraint colocation add master mysql with db_master_ip score=INFINITY
This states that the master mysql node and the db_master_ip should always be colocated on the same node, no exceptions. If the score is 5 and some other rule with score 6 says that the db_master_ip should be somewhere else, then it will win.

Finally, let’s state that the IP-address for the mysql master should be started until after the master node is ready: pcs -f sveaha.xml constraint order start mysql-master then db_master_ip

I also run three DNS-servers on these nodes and they aren’t divided into master/slave so the configuration looks slightly different:

# pcs -f sveaha.xml resource create bind ocf:heartbeat:named named_user="bind" named_config="/etc/bind/named.conf"
# pcs -f sveaha.xml resource clone bind clone-max=3 clone-node-max=1
# pcs -f sveaha.xml resource create bind_service_ip ocf:heartbeat:IPaddr2 ip=192.168.1.68 cidr_netmask=32 op monitor interval=30s on-fail=restart
# pcs -f sveaha.xml constraint order start bind-clone then bind_service_ip
 Adding bind-clone bind_service_ip (kind: Mandatory) (Options: first-action=start then-action=start)

So nothing here about –master or saying that the IP-address used to access the DNS-service has to be placed on a specific node. As long as some bind-clone runs on a node Pacemaker can put the IP-address there. But no IP-address is allowed to run on multiple nodes at the same time by design.

Let’s add some fencing. I rewrote the demonstration fencing-script that uses SSH to have my server hypervisors kill the cluster nodes. It’s not super-pretty but it works: https://deref.se/wp-content/uploads/2019/11/proxmox

You will need to set up passwordless SSH to the hypervisor nodes and place the script linked above in the directory for external scripts, on Ubuntu: /usr/lib/stonith/plugins/external/

To add these fencing devices to your Pacemaker-cluster:

# pcs -f sveaha.xml stonith create proxmox_fence_pcmk1 external/proxmox proxmoxhost=pve1 vmid=117 pcmk_host_list="pacemaker1.svealiden.se"
# pcs -f sveaha.xml stonith create proxmox_fence_pcmk2 external/proxmox proxmoxhost=pve2 vmid=120 pcmk_host_list="pacemaker2.svealiden.se"
# pcs -f sveaha.xml stonith create proxmox_fence_pcmk3 external/proxmox proxmoxhost=pve3 vmid=122 pcmk_host_list="pacemaker3.svealiden.se"
# pcs -f sveaha.xml property set stonith-enabled=true
pcs cluster cib-push sveaha.xml --config

This just says that the pcmk1 fencing device based on script external/pve shall fence the virtual machine with vmid 117 and that this has to be done on the Proxmox host pve1. The host list is what Pacemaker will look at when it tries to fence something. So pacemaker1.svealiden.se is now fenced by pcmk1 by running a command on Proxmox host pve1 using 117 as an argument to that command. Simiarly for pcmk2 and pcmk3. Finally we enable stonith(Shoot the Other Node in the Head) and load the configuration into the cluster.

I’ve been running a live MySQL and bind cluster like this for some months and by and large it works. I think I’m going to set Pacemaker to autostart itself on each node after power on though. By default it requires nodes to be started manually. One interesting and infuriating thing is that MySQL sometimes ends up with split brain requiring manual intervention. You can see if a slave is broken by running:

MariaDB [(none)]> SHOW SLAVE STATUS\G;

If you see error messages about duplicate keys and stuff, then you’re also in trouble. I don’t understand why this happens. I’ve read through the scripts used by Pacemaker and they all seem to make sense. If you’re sure nothing important got written to the node that Pacemaker can’t bring back into functioning slave-mode you can just delete everything in the MySQL data directory and create a blank template(example has node3 as a master, node2 as a broken slave that can’t catch up and node1 as a functioning slave):

node2 # pcs cluster stop --node2
node2 # rm -rf /var/lib/mysql/*
node2 # mysql_install_db

You then create a dump of the database from one of the functioning slaves:

node1 # mysqldump --dump-slave=1 --all-databases > dumpfromslave.sql

Then import it into the blank MySQL install on the broken slave, making sure to start the MySQL instance yourself, not through Pacemaker:

node2 # systemctl start mysql
node2 # mysql < dumpfromslave.sql

Next you need to manually get this node to catch up with the rest of the cluster before getting Pacemaker to manage the node. Not sure why this is but I spent hours with split brain issues even using this new “reset broken slave with SQL dump file” until I followed this advice. Let’s find out what the coordinates for synchronization were in the SQL dump file:

node2 # mysql
MariaDB [(none)]> SHOW MASTER STATUS;
 +-------------------+----------+--------------+------------------+
 | File              | Position | Binlog_Do_DB | Binlog_Ignore_DB |
 +-------------------+----------+--------------+------------------+
 | mysqld-bin.000015 |      328 |              |                  |
 +-------------------+----------+--------------+------------------+
 1 row in set (0.00 sec)

Then run the whole command for making this node synchronize with the master:

> CHANGE MASTER '' TO MASTER_LOG_FILE='mysqld-bin.000015', MASTER_LOG_POS=328, MASTER_HOST='node3', MASTER_USER='replication_user', MASTER_PASSWORD='MYPASSWORD';

Note that in my example node2 was the broken slave, node1 was a functioning slave and node3 was the master. We didn’t get an SQL dump from node3 because that would lock tables, preventing writes. But once the dump was read into node2 we set it to synchronize data from node3. It is possible to get a slave to synchronize from a slave but Pacemaker will try to set things up with all slaves synchronizing from the master so that’s how I prepare node2.

With this done I can close the MySQL instance started via systemd and let Pacemaker take over again:

node2 # systemctl stop mysql
node2 # pcs cluster start

This makes node2 start its own Pacemaker instance and join the cluster. Note that pcs status and pcs resource won’t tell you if MySQL slaves are running correctly, merely if they are live and labelled as slaves by Pacemaker. I use a Zabbix plugin to keep track of whether or not slaves are actually synchronizing. It warns “Slave SQL Not Running” if a slave isn’t keeping up with the master.

I’d like to try ClusterControl and MySQL Orchestrator at some point. We’ll see what time allows.