Ceph quorom shenanigans

God I love the word “shenanigans”. I can’t tell you how many files on my work laptop include that word.

Anway… So I moved some VMs around to upgrade pve2 to get rid of an annoying issue with UEFI where the console won’t work. But I overloaded pve1 which promptly went and died. Well, rebooted… And then mon.pve1 couldn’t join the cluster again. I figured I could take that opportunity to upgrade Ceph and the whole pve1-server actually.

No dice. Okey, delete mon.pve1 and add again? No dice. Proxmox’ Ceph tools don’t like things being wonky. After (looking at the time) an hour and a half? Roughly I’ve deleted and readded mon.pve1 several times and finally got it to work by also running “ceph mon add pve1 192.168.1.21” on one of the quorate nodes.

That might seem obvious but can you find that in the guide for adding and removing mons? https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/

I can’t. Also, why is it ceph mon add pve1 192.168.1.21 and not ceph mon add mon.pve1 192.168.1.21? The mon-part is included everywhere else. But this is why I run Ceph at home. To learn this stuff when it’s only me getting annoyed. My bosses are running Ceph in production the poor dears. Obviously not on my rinky-dink setup crammed into a cupboard but still… Nerve-wracking stuff.

Well, I guess I’d better get to upgrading pve2 then… Slightly behind schedule.

MariaDB slave lag

Just a few notes on MariaDB replication lag. My own backup program is an interesting generator of database traffic as we can see below:

But the slaves catch up in a very jerky fashion:

On the face of it both nodes suddenly fell 1800 seconds behind in a matter of 60 seconds. I argue this would only be possible if 1800 seconds of updates were suddenly sent to or acknowledged by the slaves. The sending theory isn’t entirely unreasonable based on this graph:

Commits on the master are relatively evenly spaced:

And Inserts spread out over the whole intensive period:

I suspect this sudden lag increase is a result of changes being grouped together in “replication transactions”:

Global transaction ID introduces a new event attached to each event group in the binlog. (An event group is a collection of events that are always applied as a unit. They are best thought of as a “transaction”,[…]

Let’s check the relay log on mutex02 to see if this intuition is correct. Beginning of relevant segment:

#211215  2:31:06 server id 11  end_log_pos 674282324 CRC32 0xddf8eb3a   GTID 1-11-35599776 trans
/*!100001 SET @@session.gtid_seq_no=35599776*//*!*/;
START TRANSACTION
/*!*/;
# at 674282625
#211215  2:01:54 server id 11  end_log_pos 674282356 CRC32 0x8e673045   Intvar
SET INSERT_ID=22263313/*!*/;
# at 674282657
#211215  2:01:54 server id 11  end_log_pos 674282679 CRC32 0x9c098efd   Query   thread_id=517313        exec_time=0     error_code=0    xid=0
use `backuptool`/*!*/;
SET TIMESTAMP=1639530114/*!*/;
SET @@session.sql_mode=1411383304/*!*/;
/*!\C utf8mb4 *//*!*/;
SET @@session.character_set_client=224,@@session.collation_connection=224,@@session.collation_server=8/*!*/;
insert into FileObservation (hashsum, indexJob_id, mtime, path, size) values ('e182c2a36d73098ca92aed5a39206de151190a047befb14d2eb9e7992ea8e324', 284, '2018-06-08 22:21:16.638', '/srv/storage/Backup/2018-06-08-20-img-win7-laptop/Info-dmi.txt', 21828)

Ending with:

SET INSERT_ID=22458203/*!*/;
# at 761931263
#211215  2:31:05 server id 11  end_log_pos 761931294 CRC32 0x54704ba3   Query   thread_id=517313        exec_time=0     error_code=0    xid=0
SET TIMESTAMP=1639531865/*!*/;
insert into FileObservation (hashsum, indexJob_id, mtime, path, size) values ('e9b3dc7dac6e9f8098444a5a57cb55ac9e97b20162924cda9d292b10e6949482', 284, '202
1-12-14 08:28:00.23', '/srv/storage/Backup/Lenovo/Path/LENOVO/Configuration/Catalog1.edb', 23076864)
/*!*/;
# at 761931595
#211215  2:31:05 server id 11  end_log_pos 761931326 CRC32 0x584a7652   Intvar
SET INSERT_ID=22458204/*!*/;
# at 761931627
#211215  2:31:05 server id 11  end_log_pos 761931659 CRC32 0x6a9c8f8a   Query   thread_id=517313        exec_time=0     error_code=0    xid=0
SET TIMESTAMP=1639531865/*!*/;
insert into FileObservation (hashsum, indexJob_id, mtime, path, size) values ('84be690c4ff5aaa07adc052b15e814598ba4aad57ff819f58f34ee2e8d61b8a5', 284, '202
1-12-14 08:30:58.372', '/srv/storage/Backup/Lenovo/Path/LENOVO/Configuration/Catalog2.edb', 23076864)
/*!*/;
# at 761931960
#211215  2:31:06 server id 11  end_log_pos 761931690 CRC32 0x98e12680   Xid = 27234912
COMMIT/*!*/;
# at 761931991
#211215  2:31:06 server id 11  end_log_pos 761931734 CRC32 0x90f792f6   GTID 1-11-35599777 cid=27722058 trans
/*!100001 SET @@session.gtid_seq_no=35599777*//*!*/;

So it seems like 1-11-35599776 stretches from 02:01:54 to 2:31:06 and it’s somewhat reasonable for mutex02 to suddenly report a lag of 30 minutes. I wonder what that means for actual data transfer. Could I query intermediate results from 1-11-35599776 before 02:31? :thinking_face:

Bonus:

The tiny slave lag caused on the localbackup node when this is run:

 mysql -e "STOP SLAVE;" && sleep 8 && cd $SCRIPT_PATH && source bin/activate && python snapshots.py hourly >> hourly.log && mysql -e "START SLAVE;"

It’s a really hacky way to let the localbackup process any processing of the relay log before making a Btrfs snapshot. Seems to work. Technically you can make snapshots while MariaDB is running full tilt but this seems a bit nicer. Have had some very rare lockups of unknown origin on these kinds of Btrfs snapshot-nodes for database backups.

Pacemaker failure

Dang it… Pacemaker wigged out during pacemaker03:s Btrfs snapshots run. Well, it was probably the cleanup job that clears out old snapshots that did it. Yeah, I know. “Don’t run OLTP workloads on CoW-filesystems”

But I’m running like 30 transactions per second with short bursts of a few hundred per second. CoW works fine. But Pacemaker wigged out for some reason and seems to have fiddled with the GTID:

ov 29 04:01:05.616 pacemaker03.svealiden.se pacemaker-based     [1201] (cib_process_request)   info: Completed cib_modify operation for section status: OK (rc=0, origin=pacemaker03.svealiden.se/crmd/49, version=0.173.120)
Nov 29 04:01:05.619 pacemaker03.svealiden.se pacemaker-controld  [1206] (process_lrm_event)     info: Result of monitor operation for mariadb_server on pacemaker03.svealiden.se: Cancelled | call=41 key=mariadb_server_monitor_20000 confirmed=true
Nov 29 04:01:10.624 pacemaker03.svealiden.se pacemaker-based     [1201] (cib_process_ping)      info: Reporting our current digest to pacemaker01.svealiden.se: bfbe319be87955b6424bb9b041600d5e for 0.173.120 (0x5646a9a5eac0 0)
Nov 29 04:01:27.325 pacemaker03.svealiden.se pacemaker-controld  [1206] (throttle_check_thresholds)     notice: High CPU load detected: 4.050000
Nov 29 04:01:27.325 pacemaker03.svealiden.se pacemaker-controld  [1206] (throttle_send_command)         info: New throttle mode: high load (was medium)
Nov 29 04:01:57.327 pacemaker03.svealiden.se pacemaker-controld  [1206] (throttle_check_thresholds)     notice: High CPU load detected: 4.060000
Nov 29 04:02:03  mariadb(mariadb_server)[1456545]:    INFO: MySQL stopped
Nov 29 04:02:03.343 pacemaker03.svealiden.se pacemaker-execd     [1203] (log_finished)  info: mariadb_server stop (call 44, PID 1456545) exited with status 0 (execution time 57728ms, queue time 0ms)
Nov 29 04:02:03.346 pacemaker03.svealiden.se pacemaker-controld  [1206] (process_lrm_event)     notice: Result of stop operation for mariadb_server on pacemaker03.svealiden.se: ok | rc=0 call=44 key=mariadb_server_stop_0 confirmed=true cib-update=50
Nov 29 04:02:03.346 pacemaker03.svealiden.se pacemaker-based     [1201] (cib_process_request)   info: Forwarding cib_modify operation for section status to all (origin=local/crmd/50)
Nov 29 04:02:03.351 pacemaker03.svealiden.se pacemaker-based     [1201] (cib_perform_op)        info: Diff: --- 0.173.120 2
Nov 29 04:02:03.351 pacemaker03.svealiden.se pacemaker-based     [1201] (cib_perform_op)        info: Diff: +++ 0.173.121 (null)
Nov 29 04:02:03.351 pacemaker03.svealiden.se pacemaker-based     [1201] (cib_perform_op)        info: +  /cib:  @num_updates=121
Nov 29 04:02:03.351 pacemaker03.svealiden.se pacemaker-based     [1201] (cib_perform_op)        info: +  /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resource[@id='mariadb_server']/lrm_rsc_op[@id='mariadb_server_last_0']:  @transition-magic=0:0;5:3308:0:7c487611-27b4-49ce-b931-c548d64ecc98, @call-id=44, @rc-code=0, @op-status=0, @exec-time=57728
Nov 29 04:02:03.352 pacemaker03.svealiden.se pacemaker-based     [1201] (cib_process_request)   info: Completed cib_modify operation for section status: OK (rc=0, origin=pacemaker03.svealiden.se/crmd/50, version=0.173.121)
Nov 29 04:02:03.890 pacemaker03.svealiden.se pacemaker-controld  [1206] (do_lrm_rsc_op)         notice: Requesting local execution of start operation for mariadb_server on pacemaker03.svealiden.se | transition_key=16:3308:0:7c487611-27b4-49ce-b931-c548d64ecc98 op_key=mariadb_server_start_0
Nov 29 04:02:03.891 pacemaker03.svealiden.se pacemaker-execd     [1203] (log_execute)   info: executing - rsc:mariadb_server action:start call_id:45
Nov 29 04:02:03.891 pacemaker03.svealiden.se pacemaker-based     [1201] (cib_process_request)   info: Forwarding cib_modify operation for section status to all (origin=local/crmd/51)
Nov 29 04:02:03.893 pacemaker03.svealiden.se pacemaker-based     [1201] (cib_perform_op)        info: Diff: --- 0.173.121 2
Nov 29 04:02:03.894 pacemaker03.svealiden.se pacemaker-based     [1201] (cib_perform_op)        info: Diff: +++ 0.173.122 (null)
Nov 29 04:02:03.894 pacemaker03.svealiden.se pacemaker-based     [1201] (cib_perform_op)        info: +  /cib:  @num_updates=122
Nov 29 04:02:03.894 pacemaker03.svealiden.se pacemaker-based     [1201] (cib_perform_op)        info: +  /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resource[@id='mariadb_server']/lrm_rsc_op[@id='mariadb_server_last_0']:  @operation_key=mariadb_server_start_0, @operation=start, @transition-key=16:3308:0:7c487611-27b4-49ce-b931-c548d64ecc98, @transition-magic=-1:193;16:3308:0:7c487611-27b4-49ce-b931-c548d64ecc98, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1638154923, @exec-time=0
Nov 29 04:02:03.894 pacemaker03.svealiden.se pacemaker-based     [1201] (cib_process_request)   info: Completed cib_modify operation for section status: OK (rc=0, origin=pacemaker03.svealiden.se/crmd/51, version=0.173.122)
Nov 29 04:02:04  mariadb(mariadb_server)[1456733]:    INFO: MySQL is not running
Nov 29 04:02:05  mariadb(mariadb_server)[1456733]:    INFO: MySQL is not running
Nov 29 04:02:08.906 pacemaker03.svealiden.se pacemaker-based     [1201] (cib_process_ping)      info: Reporting our current digest to pacemaker01.svealiden.se: 506f4f6824d1cd4857592724a902db4b for 0.173.122 (0x5646a9a5eac0 0)
Nov 29 04:02:09  mariadb(mariadb_server)[1456733]:    INFO: Changing MariaDB configuration to replicate from pacemaker02.svealiden.se.
Nov 29 04:02:10  mariadb(mariadb_server)[1456733]:    ERROR: MariaDB slave io has failed (1236): Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 1-3-30418952, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions'
Nov 29 04:02:10.888 pacemaker03.svealiden.se pacemaker-execd     [1203] (log_op_output)         notice: mariadb_server_start_0[1456733] error output [ ocf-exit-reason:MariaDB slave io has failed (1236): Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 1-3-30418952, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions' ]

So it seems Pacemaker stopped MariaDB and then started it in a wonky state. Haven’t seen that before. But it’s not the first thing to make me go “Is this setup really solid?”

’cause I ran Galera without a hitch for like a year and half at least. Sure, when quorum is lost you’re in a world of hurt but there are recovery methods that work in artifically generated scenarios worst case scenarios at least: https://deref.se/2019/10/29/percona-xtradb-cluster-again/

I’m thinking of writing my own master-slave runner. It sounds almost as bad as writing your own encryption algorithm but I see benefits in creating something that is written with the sole purpose of dealing with MariaDB master-slave setups. Also something that I can debug. That’s a big plus. Now I’m not quite crazy enough to try to implement my own consistency protocol. Obviously corosync or etcd will have to serve as a coordinator for any code I write. I’ll dive into that part right now so I know if this is even halfway workable.

Addendum:

Okey, so that seemed less than ideal. I think I’m going to go with keepalived. It’s not super-safe but if the slave doesn’t get a heartbeat from the master than it can fence the master and start in master mode. If the master won’t fence the slave(why would I give it that ability? If we’re concerned about the slave assuming the master role while the master is still running, the slave fencing the master “solves” that) then we can avoid mutual killing in case of a glitch. The worst case scenario should be that the slave kills the master unnecessarily.

Kubernetes and clusters in general

I’ve hated Kubernetes for a long time. Must be nigh on seven years at this point. I got in pretty early when Traefik wasn’t around and it was generally not viable to run a Kubernetes cluster outside the cloud due to the lack of LoadBalancer implementations. StatefulSets weren’t a thing. But everyone else seems to have been crazy for it.

Maybe what made me hostile to Kubernetes was its way of pawning off all the difficult parts of clustering to someone else. When you don’t have to deal with state and mutual exclusion, clustering becomes way easier. But that’s not really solving the underlying problem, just saying “If you figured out the hard parts, feel free to use Kubernetes to simplify the other parts”.

It also doesn’t work in its favor that it is so complex and obtuse. I’ve spent years tinkering with it(on and off naturally, I don’t engage that frequently with things I hate) and over the past few weeks I’ve got a working setup using microk8s and Rook that lets me create persistent volumes in my external Ceph cluster running on Proxmox. I now run my web UI for the pdns authoritative DNS servers in that cluster using a Deployment that can be scaled quite easily:

[root@kube01 ~]# mkctl get pods -o wide
NAME                                    READY   STATUS    RESTARTS   AGE   IP              NODE                  NOMINATED NODE   READINESS GATES
traefik-ingress-controller-gmf7k        1/1     Running   1          20h   192.168.1.172   kube02.svealiden.se   <none>           <none>
traefik-ingress-controller-94clq        1/1     Running   1          20h   192.168.1.173   kube03.svealiden.se   <none>           <none>
traefik-ingress-controller-77fxr        1/1     Running   1          20h   192.168.1.171   kube01.svealiden.se   <none>           <none>
whoami-78447d957f-t82sd                 1/1     Running   1          20h   10.1.154.29     kube01.svealiden.se   <none>           <none>
whoami-78447d957f-bwg7p                 1/1     Running   1          20h   10.1.154.30     kube01.svealiden.se   <none>           <none>
pdnsadmin-deployment-856dcfdfd8-5d45p   1/1     Running   0          50s   10.1.173.137    kube02.svealiden.se   <none>           <none>
pdnsadmin-deployment-856dcfdfd8-rdv88   1/1     Running   0          50s   10.1.246.237    kube03.svealiden.se   <none>           <none>
pdnsadmin-deployment-856dcfdfd8-vmzws   1/1     Running   0          50s   10.1.154.44     kube01.svealiden.se   <none>           <none>

[root@kube01 ~]# mkctl scale deployment.v1.apps/pdnsgui-deployment --replicas=2
Error from server (NotFound): deployments.apps "pdnsgui-deployment" not found

[root@kube01 ~]# mkctl get deployments
NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
whoami                 2/2     2            2           20h
pdnsadmin-deployment   3/3     3            3           2m23s

[root@kube01 ~]# mkctl scale deployment.v1.apps/pdnsadmin-deployment --replicas=2
deployment.apps/pdnsadmin-deployment scaled

[root@kube01 ~]# mkctl get pods -o wide
NAME                                    READY   STATUS        RESTARTS   AGE     IP              NODE                  NOMINATED NODE   READINESS GATES
traefik-ingress-controller-gmf7k        1/1     Running       1          20h     192.168.1.172   kube02.svealiden.se   <none>           <none>
traefik-ingress-controller-94clq        1/1     Running       1          20h     192.168.1.173   kube03.svealiden.se   <none>           <none>
traefik-ingress-controller-77fxr        1/1     Running       1          20h     192.168.1.171   kube01.svealiden.se   <none>           <none>
whoami-78447d957f-t82sd                 1/1     Running       1          20h     10.1.154.29     kube01.svealiden.se   <none>           <none>
whoami-78447d957f-bwg7p                 1/1     Running       1          20h     10.1.154.30     kube01.svealiden.se   <none>           <none>
pdnsadmin-deployment-856dcfdfd8-rdv88   1/1     Running       0          2m33s   10.1.246.237    kube03.svealiden.se   <none>           <none>
pdnsadmin-deployment-856dcfdfd8-vmzws   1/1     Running       0          2m33s   10.1.154.44     kube01.svealiden.se   <none>           <none>
pdnsadmin-deployment-856dcfdfd8-5d45p   0/1     Terminating   0          2m33s   10.1.173.137    kube02.svealiden.se   <none>           <none>

[root@kube01 ~]# mkctl get pods -o wide
NAME                                    READY   STATUS    RESTARTS   AGE     IP              NODE                  NOMINATED NODE   READINESS GATES
traefik-ingress-controller-gmf7k        1/1     Running   1          20h     192.168.1.172   kube02.svealiden.se   <none>           <none>
traefik-ingress-controller-94clq        1/1     Running   1          20h     192.168.1.173   kube03.svealiden.se   <none>           <none>
traefik-ingress-controller-77fxr        1/1     Running   1          20h     192.168.1.171   kube01.svealiden.se   <none>           <none>
whoami-78447d957f-t82sd                 1/1     Running   1          20h     10.1.154.29     kube01.svealiden.se   <none>           <none>
whoami-78447d957f-bwg7p                 1/1     Running   1          20h     10.1.154.30     kube01.svealiden.se   <none>           <none>
pdnsadmin-deployment-856dcfdfd8-rdv88   1/1     Running   0          2m39s   10.1.246.237    kube03.svealiden.se   <none>           <none>
pdnsadmin-deployment-856dcfdfd8-vmzws   1/1     Running   0          2m39s   10.1.154.44     kube01.svealiden.se   <none>           <none>

Yet I have only the flimsiest idea of how it works and how to fix it if something breaks. Maybe I’ll learn how Calico uses vxlan to connect everything magically and why I had to reset my entire cluster on thursday to remove a custom resource definition.

By the way, try to create a custom resource definition not using the beta-API! You’ll have to provide a schema: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/

I suspect many people will go mad trying to make heads or tails of that. Anyway, if I had to make a decision about how to run a set of microservices in production I’d still go with something like docker containers run as systemd-services and HAProxy to load balance the traffic. Less automation for rolling upgrades, scaling and so on but I wouldn’t be worried about relying entirely on a system where it’s not even clear if I can find a way to describe the malfunction that keeps all services from running! I mean, I added podAntiAffinity to me deployment before:

affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - pdns-admin-gui
            topologyKey: "kubernetes.io/hostname"

And this is what all three pods logged when I started the deployment:

  Warning  FailedScheduling  39m   default-scheduler  0/3 nodes are available: 3 node(s) didn't match pod affinity/anti-affinity rules, 3 node(s) didn't match pod anti-affinity rules.
  Warning  FailedScheduling  39m   default-scheduler  0/3 nodes are available: 3 node(s) didn't match pod affinity/anti-affinity rules, 3 node(s) didn't match pod anti-affinity rules.
  Normal   Scheduled         39m   default-scheduler  Successfully assigned default/pdnsadmin-deployment-856dcfdfd8-vmzws to kube01.svealiden.se
  Normal   Pulled            39m   kubelet            Container image "cacher.svealiden.se:5000/pdnsadmin:20210925" already present on machine
  Normal   Created           39m   kubelet            Created container pdnsadmin
  Normal   Started           39m   kubelet            Started container pdnsadmin

Obviously “Successfully assigned default/pdnsadmin-deployment-856dcfdfd8-vmzws to kube01.svealiden.se” was a good thing but Kubernetes telling me that zero nodes were initially available? When the only requirement I set out was that pdns-admin-gui pods shouldn’t be run on the same node? And there are three nodes? And I asked for three pods? That’s the sort of stuff that would make me very nervous if this was used for production. What if Kubernetes gets stuck in the “All pods forbidden because reasons”-mode?

This is also why I’m terrified of running Ceph for production applications. Multiple independent Ceph cluster? Okey, now we’re talking but a single cluster? It’s just a matter of time before Ceph locks up on you and you have a week’s downtime while trying to figure out what the hell is going on.

The keen observer will ask “Aren’t you a massive fan of clusters?” and that’s entirely correct. I’ve run Ceph for my own applications for just over two years and have Elasticsearch, MongoDB and MariaDB clusters set up. But the key point is disaster recovery. Clusters can be great for high availability which is what I’m really a fan of but clusters where there isn’t an override in case the complex logic of the distributed system goes on the fritz are a huge gamble. If MongoDB gets confused I can choose a node and force it to become the primary. If I can get the others to join afterwards that’s fine, otherwise I’ll just have to blank them and rebuild. Same with MariaDB, I can kill the two nodes and make the remaining one a master and take it from there. I don’t need any distributed algorithm to give me permission to bring systems back in a diminished capacity.

By the way, nothing essential runs on my Ceph cluster. Recursive DNS servers, file-shares, backup of file-shares and so on are all running on local storage in a multimaster or master/slave configuration. Ceph going down will disable some convience-functions, my Zabbix monitoring server, Prometheus, Grafana and so on, but I can live without them for a couple of hours while I swear angrily at Ceph. In fairness I haven’t had a serious Ceph-issue now for (checking…) about a year now!

Galera to pacemaker switch

So I know in theory how to switch a database from one MariaDB server to another with extremely little downtime. It turns out to be trickier than I had thought in reality when the source is a Galera cluster but let’s go through it step by step:

Step 1

Choose a Galera node and make sure it has binary logging activated and log_slaves_updates = ON. Otherwise it won’t log things in a manner that can be sent to a slave that isn’t a Galera node and the log_slaves_updates-thing is to make it actually do send updates to the slave. I added log_slave_updates = ON to the slave as well because it’s not entirely clear how updates from the Galera node is viewed. I mean, in a sense all Galera nodes are both slaves and masters…

Also make sure the server id of the target node(temporary slave) doesn’t match any of the Galera nodes. Otherwise any Galera node will think that server_id = 1 is already caught up! That took me a couple of hours to figure out.

Step 2

Dump the database to be transferred from the Galera node you chose earlier:

mysqldump --single-transaction --master-data=1 --triggers --events --routines zabbix > zabbix.sql

Copy it to the target server.

Step 3

Create the necessary users and an empty database on the target server and then import the SQL-dump.

mysql zabbix < zabbix.sql

Step 4

Extract the replication coordinates from the top of the sql dump:

root@cluster3:~# head -30 zabbix.sql | grep MASTER
CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000150', MASTER_LOG_POS=691605499;

Create an SQL-file with the full information required to start streaming updates from the Galera node using this data:

CHANGE MASTER TO MASTER_HOST='cluster3.svealiden.se', MASTER_PORT=3306, MASTER_USER="replication_user", MASTER_PASSWORD="SECRETREPLICATIONPASSWORD", MASTER_LOG_FILE='mysql-bin.000150', MASTER_LOG_POS=691605499;
SET GLOBAL replicate_do_db='zabbix';
START SLAVE;

The SET GLOBAL replicate_do_db=’zabbix’; thing is for situations where there is just one database that you want to bring from a source that has multiple databases.

Step 5

Wait for slave to catch up:

MariaDB [(none)]> SHOW SLAVE STATUS\G 
*************************** 1. row ***************************               
                Slave_IO_State: Waiting for master to send event  
                   Master_Host: cluster3.svealiden.se         
                   Master_User: replication_user                             
                   Master_Port: 3306                                         
                 Connect_Retry: 60                                           
               Master_Log_File: mysql-bin.000150
           Read_Master_Log_Pos: 797684680                                    
                Relay_Log_File: pacemaker01-relay-bin.000002
                 Relay_Log_Pos: 10010385                                     
         Relay_Master_Log_File: mysql-bin.000150
              Slave_IO_Running: Yes                                          
             Slave_SQL_Running: Yes
               Replicate_Do_DB: zabbix                                                                                                                    
           Replicate_Ignore_DB: 
            Replicate_Do_Table:                                              
        Replicate_Ignore_Table:      
       Replicate_Wild_Do_Table: 
   Replicate_Wild_Ignore_Table:                                              
                    Last_Errno: 0    
                    Last_Error: 
                  Skip_Counter: 0                                            
           Exec_Master_Log_Pos: 707563585                                                                                                                 
               Relay_Log_Space: 66709818                                                                                                                  
               Until_Condition: None                                         
                Until_Log_File:                                              
                 Until_Log_Pos: 0                                                                                                                                     Master_SSL_Allowed: No                                           
            Master_SSL_CA_File:       
            Master_SSL_CA_Path:                                              
               Master_SSL_Cert: 
             Master_SSL_Cipher:                                              
                Master_SSL_Key:                                              
         Seconds_Behind_Master: 21398

Note the Seconds_Behind_Master-part that tells you how far the slave has been able to catch up. As you get close to zero, get ready to stop writing data to the Galera cluster.

Step 6

Stop writing data to Galera. In my case this entailed changing two configuration files for Zabbix(zabbix-server and the Zabbix Web UI) and stopping the associated services.

root@monitor:~# systemctl stop zabbix-server && systemctl stop nginx

Then stop replicating data from the Galera cluster. Since I had set the slave to read_only mode I had to disable that as well:

MariaDB [(none)]> STOP SLAVE;
Query OK, 0 rows affected (0.014 sec)

MariaDB [(none)]> SET GLOBAL read_only = false;
Query OK, 0 rows affected (0.001 sec)

Then just start up the services now that they have been told to contact the new database server.

root@monitor:~# systemctl start zabbix-server && systemctl start nginx

Can you spot where the switch happened?

Can you spot where the two failed attempts at making this switch seamlessly happened?

Sidenote: this is an example of a sentence that would stand to gain from English having more syntactic options. Like Can you spot where the two failed attempts at making this switch seamlessly happened. Here it’s more clear that we are talking about X = making a switch seamlessly and how there were two failed attempts at making X happen. Compare that to the more nonsensical Can you spot where the two failed attempts at making this switch seamlessly happened. No, we’re not talking about failures seemlessly happening, the switch was supposed to be seemless.

So anyway… Seconds_Behind_Master is a great piece of data to read when you’re about to switch over to a replica that you think is caught up.

Use cases

This is useful for a lot of situations. Maybe you want to convert a 180GB database from MyISAM to InnoDB? Maybe you need to perform an upgrade where it’s safest to do a mysqldump->import?

The lower limit for how fast this switchover can be is the replication lag. If the slave is always 10 seconds behind, that is the absolute lowest limit for how fast you can make the switch. If you do all the steps manually that adds maybe 20 seconds. Script it and we’re down to a adding a few seconds on top of the replication lag. Short of banks and amazon.com, downtime measured in seconds(maybe multiples of ten seconds) isn’t a cause for great concern.

ProxySQL for MariaDB master/slave

I’ve set up a Pacemaker cluster with MariaDB and fence_pve so that we have 1 writeable master and 2 read-only slaves. But writes are not made to either of them directly, instead all queries are routed through ProxySQL and it distributes queries across the three nodes. I first didn’t understand why the developers of ProxySQL advised against a pure read/write split of queries. Surely we don’t want SELECTs to go to the writeable node? It’s busy with INSERTs and UPDATEs. But now I get it; it’s an issue of replication lag. If a client makes an INSERT and then shortly thereafter a SELECT that needs to display the changes that were just made, things can fall apart if the SELECT goes to a slave that hasn’t received the changes from the master yet. So we only want to send queries that we know aren’t millisecond-sensitive to slaves. Well, the replication lag sets the time element… If your slaves are five seconds behind the master then you should only send queries that can tolerate a five second discrepancy to the slaves.

Update 2021-07-30: I realized it was waaaay better to set read-only=1 in the conf files for MariaDB to avoid situations where proxysql accidentally sends INSERT-operations to a non-master because it happens to start out read-write.

Aaaanyway… So installing ProxySQL is pretty much fire-and-forget so the interesting thing is how to feed it the right data. I let each server in the cluster run ProxySQL and whichever node happens to host the master-instance of MariaDB also gets to route the traffic. You could let one of the slaves host the active ProxySQL instance but this was easier for me to implement. Here’s a diagram of how things work:

The dashed lines indicate replication of data and the solid lines represent database queries. The ellipse shows where the virtual IP for the database cluster is located(always on the master node). So with thus setup I can use the exact same config for all three ProxySQL instances, which I’ve turned into a single SQL-file proxyinit.sql:

INSERT INTO mysql_servers(hostgroup_id,hostname,port) VALUES (1,'192.168.1.181',3306);
INSERT INTO mysql_servers(hostgroup_id,hostname,port) VALUES (1,'192.168.1.182',3306);
INSERT INTO mysql_servers(hostgroup_id,hostname,port) VALUES (1,'192.168.1.183',3306);

-- Since hostgroup 0 is the default it makes sense to make it the writer hostgroup
-- To avoid a situation where INSERTs are accidentally sent to a read-only node
INSERT INTO mysql_replication_hostgroups (writer_hostgroup,reader_hostgroup,comment) VALUES (0,1,'cluster1');

-- These are two queries I've identified that take more time than usual and are safe to run on replicas
INSERT INTO mysql_query_rules (rule_id,active,digest,destination_hostgroup,apply) VALUES (1,1,'0xA1D7DC02F2B414BB',1,1);
INSERT INTO mysql_query_rules (rule_id,active,digest,destination_hostgroup,apply) VALUES (2,1,'0x112A8E68C94FFE30',1,1);

-- By default all queries should go to the master so default hostgroup should be 0
INSERT INTO mysql_users(username,password,default_hostgroup) VALUES ('zabbix','MYZABBIXPASSWORD',0);

-- Now we have to load our data into the active runtime
LOAD MYSQL USERS TO RUNTIME;
LOAD MYSQL SERVERS TO RUNTIME;
LOAD MYSQL QUERY RULES TO RUNTIME;

-- And I don't want to have to enter this data all the time so better save it to disk
SAVE MYSQL USERS TO DISK;
SAVE MYSQL SERVERS TO DISK;
SAVE MYSQL QUERY RULES TO DISK;

To load this into each ProxySQL instance I just run:

mysql -u admin -pMYADMINPASSWORD -h 0.0.0.0 -P6032 < proxyinit.sql

Testing

Let’s try to take pcmk2 out of the cluster so that some other node has to take over the role of primary:

[root@pcmk2 ~]# pcs resource
  * db_primary_ip       (ocf::heartbeat:IPaddr2):        Started pcmk2.svealiden.se
  * Clone Set: mariadb_server-clone [mariadb_server] (promotable):
    * Masters: [ pcmk2.svealiden.se ]
    * Slaves: [ pcmk1.svealiden.se pcmk3.svealiden.se ]
[root@pcmk2 ~]# pcs cluster stop
Stopping Cluster (pacemaker)...
Stopping Cluster (corosync)...
[root@pcmk2 ~]# 

Okey, let’s see who is the new master.

[root@pcmk1 ~]# pcs resource
  * db_primary_ip       (ocf::heartbeat:IPaddr2):        Started pcmk1.svealiden.se
  * Clone Set: mariadb_server-clone [mariadb_server] (promotable):
    * Masters: [ pcmk1.svealiden.se ]
    * Slaves: [ pcmk3.svealiden.se ]
    * Stopped: [ pcmk2.svealiden.se ]

Okey, pcmk1. Let’s see if we see writes being made from ProxySQL to pcmk1:

[root@pcmk1 ~]# tcpdump -nnX -i lo port 3306 | grep -i insert
 dropped privs to tcpdump
 tcpdump: verbose output suppressed, use -v[v]… for full protocol decode
 listening on lo, link-type EN10MB (Ethernet), snapshot length 262144 bytes
         0x0030:  54af e6dc 9500 0000 0369 6e73 6572 7420  T……..insert.
         0x0030:  54af e31e 4d01 0000 0369 6e73 6572 7420  T…M….insert.
         0x0030:  54af e83e 8302 0000 0369 6e73 6572 7420  T..>…..insert.
         0x0030:  54af ee9f 1b01 0000 0369 6e73 6572 7420  T……..insert.
         0x0030:  54af f287 5401 0000 0369 6e73 6572 7420  T…T….insert.
         0x0030:  54af eac0 ff01 0000 0369 6e73 6572 7420  T……..insert.
         0x0030:  54af f3f9 1601 0000 0369 6e73 6572 7420  T……..insert.
         0x0030:  54af eabc a900 0000 0369 6e73 6572 7420  T……..insert.
         0x0030:  54af f29f c301 0000 0369 6e73 6572 7420  T……..insert.
         0x0030:  54af f6b0 a501 0000 0369 6e73 6572 7420  T……..insert.
         0x0030:  54af fa9b 0b02 0000 0369 6e73 6572 7420  T……..insert.

Seems about right. We can check that ProxySQL correctly identifies pcmk1 as master now:

ProxySQL Admin> SELECT hostgroup_id,hostname,status FROM mysql_servers;
+--------------+---------------+--------+
| hostgroup_id | hostname      | status |
+--------------+---------------+--------+
| 0            | 192.168.1.181 | ONLINE | <-- Yes, pcmk1 is the only writable node
| 1            | 192.168.1.183 | ONLINE |
| 1            | 192.168.1.182 | ONLINE |
| 1            | 192.168.1.181 | ONLINE |
+--------------+---------------+--------+
4 rows in set (0.001 sec)

Let’s see what happens when I bring pcmk2 back up again. It should be a slave.

[root@pcmk1 ~]# pcs resource
  * db_primary_ip       (ocf::heartbeat:IPaddr2):        Started pcmk1.svealiden.se
  * Clone Set: mariadb_server-clone [mariadb_server] (promotable):
    * mariadb_server    (ocf::heartbeat:mariadb):        Starting pcmk2.svealiden.se
    * Masters: [ pcmk1.svealiden.se ]
    * Slaves: [ pcmk3.svealiden.se ]
[root@pcmk1 ~]# pcs resource
  * db_primary_ip       (ocf::heartbeat:IPaddr2):        Started pcmk1.svealiden.se
  * Clone Set: mariadb_server-clone [mariadb_server] (promotable):
    * Masters: [ pcmk1.svealiden.se ]
    * Slaves: [ pcmk2.svealiden.se pcmk3.svealiden.se ]

So far so good. Let’s just double-check to see that pcmk2 isn’t out of sync and emitting errors:

MariaDB [(none)]> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: pcmk1
                   Master_User: replication_user
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: pcmk1-bin.000005
           Read_Master_Log_Pos: 931920
                Relay_Log_File: pcmk2-relay-bin.000002
                 Relay_Log_Pos: 932262
         Relay_Master_Log_File: pcmk1-bin.000005
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 931920
               Relay_Log_Space: 932571
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: 0
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 1
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Current_Pos
                   Gtid_IO_Pos: 0-1-309263
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: optimistic
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 1512
1 row in set (0.000 sec)

All good!

Purpose

I think this is pretty attractive as a way of merging high-availability with load balancing. I started tinkering with this when fiddling with Kubernetes and WordPress. No point in having multiple pods spread out over multiple servers if you can’t run multiple database-instances. Well, not in my book anyway. So we’re left to choose between Galera multi-master or the more traditional master-slave setup. There are constraints placed on what can be run in Galera(like a hard primary key requirement) so I think master-slave is more reliable when you yourself might not have complete control over database schemas. ProxySQL would work just fine with Galera too but I wanted to test out the slightly more challenging setup.

A big downside to ProxySQL is that it doesn’t do pass-through authentication. So user added to MariaDB and any user password changed would also need to be made to all ProxySQL instances. That’s fine when the setup is meant for a large WordPress webshop which has one(1) database user that you as an admin manage but for a shared hosting setup it’s a nightmare.

MariaDB MaxScale is an alternative to ProxySQL that solves the issue of replication lag using some fancy footwork and therefore claims to be able to do read/write split automagically and it offers pass-through authentication. Not sure if it has the fancy statistics function that ProxySQL has though.

Behind the scenes

I had to have data to test out this setup and that’s not trivial. Lots of data is sent to my Galera-cluster from Zabbix though… Hmm… Can’t ProxySQL do data mirroring? Indeed it can! First we need to add both the usual database cluster virtual IP and the new testing-cluster’s virtual IP as separate hostgroups:

ProxySQL Admin> SELECT hostgroup_id,hostname,port,status FROM mysql_servers;
+--------------+---------------+------+--------+
| hostgroup_id | hostname      | port | status |
+--------------+---------------+------+--------+
| 0            | 192.168.1.62  | 3306 | ONLINE |
| 1            | 192.168.1.180 | 6033 | ONLINE |
+--------------+---------------+------+--------+
2 rows in set (0.000 sec)

And then a query rule that sends all traffic for the zabbix-user to both hostgroups(not load-balanced, every query goes to both):

ProxySQL Admin> SELECT rule_id,active,username,destination_hostgroup,mirror_hostgroup,apply FROM mysql_query_rules;
+---------+--------+----------+-----------------------+------------------+-------+
| rule_id | active | username | destination_hostgroup | mirror_hostgroup | apply |
+---------+--------+----------+-----------------------+------------------+-------+
| 1       | 1      | zabbix   | 0                     | 1                | 1     |
+---------+--------+----------+-----------------------+------------------+-------+
1 row in set (0.000 sec)

Pacemaker for MariaDB

It’s been two days now and I feel like this lady (from an interesting article at cnet who are apparently still in business). But now I got a MariaDB master-slave cluster set up through pacemaker. I thought it was going to be a breeze since I wrote down how to do it for MySQL 5.7 on this very website! But no…

First off, the resource agent which worked for MySQL 5.7 does _not_ seem to work for MariaDB 10.x. And pulling in the latest scripts from the clusterlabs Github repo into the correct folder in an Ubuntu 20.04-install is bananas. So I abandoned my dear testbench and installed Fedora 33 because their resource-agents package seemed to contain the right stuff. And they do!

But the documentation has some catching up to do. For my three nodes pacemaker01, pacemaker02, pacemaker03 the following sets up the cluster correctly once I set the same password for the hacluster-user on all nodes:

pcs host auth pacemaker01 pacemaker02 pacemaker03
pcs cluster setup mariaha pacemaker01 pacemaker02 pacemaker03

Note that this either requires that the hostnames can be resolved to the correct IPs on all three nodes. I did this through /etc/hosts since it’s a testbench.

You’ll also need these packages:

yum install pacemaker pacemaker-cli mariadb-server

Here’s my MariaDB conf file for pacemaker01:

[mariadb]
log-bin
server_id=1
log-basename=pacemaker01
binlog-format=mixed
expire_logs_days=2
relay_log_purge=0 # I want relay logs in case I need to debug duplicate entries etc.
relay_log_space_limit=1094967295 # But only like 1 GB of relay logs.
read-only=1 # Best to start in read-only mode by default to avoid duplicate entries accidents

I got some weird synchronization errors when the mysql-database was included in the replication so I tried locking replication down to this one database “sysbench”. log_bin=something is important for binary logging to be enabled and that’s a prerequisuite for replication. Obviously server_id needs to be unique to each server and pacemaker01 is replaced by the hostname of the other two servers in their conf-file.

pcs cluster cib mariadb_cfg
pcs -f mariadb_cfg resource create db_primary_ip ocf:heartbeat:IPaddr2 ip=192.168.1.209 cidr_netmask=21 op monitor interval=30s on-fail=restart
pcs -f mariadb_cfg resource create mariadb_server ocf:heartbeat:mariadb binary="/usr/libexec/mysqld" \
replication_user="replication_user" replication_passwd="randompassword" \
node_list="pacemaker01 pacemaker02 pacemaker03" \
op start timeout=120 interval=0 \
op stop timeout=120 interval=0 \
op promote timeout=120 interval=0 \
op demote timeout=120 interval=0 \
op monitor role=Master timeout=30 interval=10 \
op monitor role=Slave timeout=30 interval=20 \
op notify timeout="60s" interval="0s"
pcs -f mariadb_cfg resource promotable mariadb_server master-max=1 master-node-max=1 clone-max=3 clone-node-max=1 notify=true
pcs -f mariadb_cfg constraint colocation add master mariadb_server-clone with db_primary_ip score=INFINITY
pcs -f mariadb_cfg constraint order promote mariadb_server-clone then db_primary_ip symmetrical=true
pcs cluster cib-push mariadb_cfg

Here’s a nice thing: the command for creating a master-slave setup used to be something like “pcs -f mariadb_cfg resource master mariadb_server master-max=1 master-node-max=1 clone-max=3 clone-node-max=1 notify=true

Yeah, the word master has been replaced by “promotable”. MariaDB are also working to expunge politically incorrect language in case you were wondering. Here was me thinking that slavery was a problem when applied to humans but I guess we’ll just purge these terms no matter what they refer to.

So anyway… What took all sunday was to figure out why pacemaker kept saying that zero nodes participated in the election of a new master. There were clearly three nodes registered as Online in the cluster. After equipping the file /usr/lib/ocf/resource.d/heartbeat/mariadb with some additional debug statements it became clear that the issue was that none of the servers were announcing their gtid_current_pos. Admittedly, that makes it difficult to elect which server is most advanced and should be promoted.

For whatever reason my practice of dumping the master DB like so:

mysqldump --master-data -A > master.sql

is okey but I neglected to consider that Pacemaker assumes that the “gtid_current_pos” value is usable. But this doesn’t get updated until after the replication gets going. So this ended up in the log:

Dec 13 21:14:17 mariadb(mariadb_server)[43240]: INFO: Value to be set as gtid for pacemaker03: 0-1-1
Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker/pacemaker.log
Dec 13 21:14:17 mariadb(mariadb_server)[43240]: INFO: /usr/sbin/attrd_updater -p -N pacemaker03 -n mariadb_server-gtid -Q
Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker/pacemaker.log
Dec 13 21:14:17 mariadb(mariadb_server)[43240]: INFO: Result: name="mariadb_server-gtid" host="pacemaker03" value=""
Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker/pacemaker.log
Dec 13 21:14:17 mariadb(mariadb_server)[43240]: INFO: Query result: name="mariadb_server-gtid" host="pacemaker01" value="0-1-1"
Dec 13 21:14:17 mariadb(mariadb_server)[43240]: INFO: Value in read_gtid: 0-1-1
Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker/pacemaker.log
Dec 13 21:14:17 mariadb(mariadb_server)[43240]: INFO: Query result: name="mariadb_server-gtid" host="pacemaker02" value=""
Dec 13 21:14:17 mariadb(mariadb_server)[43240]: INFO: Value in read_gtid:
Dec 13 21:14:17 mariadb(mariadb_server)[43240]: INFO: No reply from pacemaker02
Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker/pacemaker.log
Dec 13 21:14:17 mariadb(mariadb_server)[43240]: INFO: Query result: name="mariadb_server-gtid" host="pacemaker03" value=""
Dec 13 21:14:17 mariadb(mariadb_server)[43240]: INFO: Value in read_gtid:
Dec 13 21:14:17 mariadb(mariadb_server)[43240]: INFO: No reply from pacemaker03
Dec 13 21:14:17 mariadb(mariadb_server)[43240]: INFO: Not enough nodes (1) contributed to select a master, need 3 nodes.

But by logging in to the slaves and running SET GLOBAL gtid_slave_pos=’0-1-1′; and then “START SLAVE;” did the trick.

A few quick restarts(both pull-the-plug style and orderly) later it seems to work fine. The problem I encountered before was that a master could not rejoin the cluster automatically after it was demoted, even if it was done nicely with “pcs cluster stop pacemaker01“. There would be “Failed to replicate, duplicate entry yada-yada” errors.

Not seeing that now, which is nice! I’ll rewrite my own sysbench script with transactions and “SELECT X FROM Y FOR UPDATE” and things that strain the DB a bit more to work with MariaDB and do some more tests.

Oh, for some reason the ocf:heartbeat:mariadb-script enables semi-synchronous replication for some reason. You might want to edit that segment(search for rpl_semi) if you don’t want that tradeoff.

Update 2021-07-08

Went through this again for some ProxySQL-testing and ended up writing a bash-script for setting up the whole thing with fencing and all:

#!/bin/bash

# Create a virtual IP for the database cluster
pcs -f mariadb_cfg resource create db_primary_ip ocf:heartbeat:IPaddr2 ip=192.168.1.180 cidr_netmask=21 op monitor interval=30s on-fail=restart
# Create the MariaDB clone on all three nodes
pcs -f mariadb_cfg resource create mariadb_server ocf:heartbeat:mariadb binary="/usr/sbin/mysqld" \
replication_user="replication_user" replication_passwd="REPLICATIONPASSWORD" \
node_list="pcmk1.svealiden.se pcmk2.svealiden.se pcmk3.svealiden.se" \
        op start timeout=120 interval=0 \
        op stop timeout=120 interval=0 \
        op promote timeout=120 interval=0 \
        op demote timeout=120 interval=0 \
        op monitor role=Master timeout=30 interval=10 \
        op monitor role=Slave timeout=30 interval=20 \
        op notify timeout="60s" interval="0s"
echo MariaDB Server clone set created

# Indicate that there must only ever be one(1) master among the MariaDB clones
pcs -f mariadb_cfg resource promotable mariadb_server master-max=1 master-node-max=1 clone-max=3 clone-node-max=1 notify=true
# The virtual IP MUST be on the same node as the MariaDB master
pcs -f mariadb_cfg constraint colocation add master mariadb_server-clone with db_primary_ip score=INFINITY
pcs -f mariadb_cfg constraint order promote mariadb_server-clone then db_primary_ip symmetrical=true
echo MariaDB Server clone set constraints set

# Let pcmk1.svealiden.se be fence by the fence_pve agent that needs to connect to
# 192.168.1.21 with the provided credentials to kill VM 139(pcmk1)
pcs -f mariadb_cfg stonith create p_fence_pcmk1 fence_pve \
        ip="192.168.1.21" nodename="pve1.svealiden.se" \
        login="root@pam" password="PROXMOXROOTPASSWORD" plug="139" delay="15" vmtype="qemu" \
        pcmk_host_list="pcmk1.svealiden.se" \
        op monitor interval="60s" \
        meta target-role="Started" is-managed="true"
pcs -f mariadb_cfg stonith create p_fence_pcmk2 fence_pve \
        ip="192.168.1.22" nodename="pve2.svealiden.se" \
        login="root@pam" password="PROXMOXROOTPASSWORD" plug="140" delay="15" vmtype="qemu" \
        pcmk_host_list="pcmk2.svealiden.se" \
        op monitor interval="60s" \
        meta target-role="Started" is-managed="true"
pcs -f mariadb_cfg stonith create p_fence_pcmk3 fence_pve \
        ip="192.168.1.23" nodename="pve3.svealiden.se" \
        login="root@pam" password="PROXMOXROOTPASSWORD" plug="141" delay="15" vmtype="qemu" \
        pcmk_host_list="pcmk3.svealiden.se" \
        op monitor interval="60s" \
        meta target-role="Started" is-managed="true"
echo Fence devices created

# Make sure each fencing device is located away from the node it is supposed to 
# fence. It makes no sense to expect a node to be able to fence itself in the event
# of some malfunction.
pcs -f mariadb_cfg constraint location p_fence_pcmk1 avoids pcmk1.svealiden.se=INFINITY
pcs -f mariadb_cfg constraint location p_fence_pcmk2 avoids pcmk2.svealiden.se=INFINITY
pcs -f mariadb_cfg constraint location p_fence_pcmk3 avoids pcmk3.svealiden.se=INFINITY
echo Fence location constraints added

# Activate fencing
pcs -f mariadb_cfg property set stonith-enabled=true
pcs cluster cib-push mariadb_cfg

Had to increase the rights of the replication_user for some reason(I gave it SUPER for the sake of simplicity).

Oh, a nicer mysqldump variant:

mysqldump -A --single-transaction --gtid --master-data=1 --triggers --events --routines > pcmk2.sql

Update 2021-07-30

pcs cluster move for master doesn’t work in the latest Fedora due to promoted/master inconsistency between crm_resource and pcs. How surprising that changing keywords used to describe specific states causes problems… Anyway, this works:

[root@pcmk2 ~]# crm_resource --move --resource mariadb_server-clone --node pcmk1.svealiden.se --promoted --lifetime=P300S
Migration will take effect until: 2021-07-30 00:38:26 +02:00
[root@pcmk2 ~]# pcs constraint list --full
Location Constraints:
Resource: mariadb_server-clone
Constraint: cli-prefer-mariadb_server-clone
Rule: boolean-op=and score=INFINITY (id:cli-prefer-rule-mariadb_server-clone)
Expression: #uname eq string pcmk1.svealiden.se (id:cli-prefer-expr-mariadb_server-clone)
Expression: date lt 2021-07-30 00:38:26 +02:00 (id:cli-prefer-lifetime-end-mariadb_server-clone)
Resource: p_fence_pcmk1
Disabled on:
Node: pcmk1.svealiden.se (score:-INFINITY) (id:location-p_fence_pcmk1-pcmk1.svealiden.se--INFINITY)
Resource: p_fence_pcmk2
Disabled on:
Node: pcmk2.svealiden.se (score:-INFINITY) (id:location-p_fence_pcmk2-pcmk2.svealiden.se--INFINITY)
Resource: p_fence_pcmk3
Disabled on:
Node: pcmk3.svealiden.se (score:-INFINITY) (id:location-p_fence_pcmk3-pcmk3.svealiden.se--INFINITY)
Ordering Constraints:
promote mariadb_server-clone then start db_primary_ip (kind:Mandatory) (id:order-mariadb_server-clone-db_primary_ip-mandatory)
Colocation Constraints:
mariadb_server-clone with db_primary_ip (score:INFINITY) (rsc-role:Master) (with-rsc-role:Started) (id:colocation-mariadb_server-clone-db_primary_ip-INFINITY)
Ticket Constraints:
[root@pcmk2 ~]# pcs resource
  * db_primary_ip       (ocf::heartbeat:IPaddr2):        Started pcmk1.svealiden.se
  * Clone Set: mariadb_server-clone [mariadb_server] (promotable):
    * Masters: [ pcmk1.svealiden.se ]
    * Slaves: [ pcmk2.svealiden.se pcmk3.svealiden.se ]

Update 2021-08-23

Forgot to include a really fast way of creating a new clone. No need for mysqldump or locking the source DB:

mariabackup -u root --backup --stream=xbstream | ssh NEWSLAVE "mbstream -x -C /var/lib/mysql/data"

Then chown -R mysql: /var/lib/mysql/data and using the output from the mariabackup command

[00] 2021-08-23 20:40:11         ...done
[00] 2021-08-23 20:40:11 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
[00] 2021-08-23 20:40:11 mariabackup: The latest check point (for incremental): '5352582709'
[00] 2021-08-23 20:40:11 >> log scanned up to (5356297811)
mariabackup: Stopping log copying thread
[00] 2021-08-23 20:40:11 >> log scanned up to (5356297811)
[00] 2021-08-23 20:40:11 Executing BACKUP STAGE END
[00] 2021-08-23 20:40:11 All tables unlocked
[00] 2021-08-23 20:40:11 Streaming ib_buffer_pool to <STDOUT>
[00] 2021-08-23 20:40:11         ...done
[00] 2021-08-23 20:40:11 Backup created in directory '/root/xtrabackup_backupfiles/'
[00] 2021-08-23 20:40:11 MySQL binlog position: filename 'pacemaker01-bin.000007', position '968336', GTID of the last change '0-2-59839204,1-1-176656'
[00] 2021-08-23 20:40:11 Streaming backup-my.cnf
[00] 2021-08-23 20:40:11         ...done
[00] 2021-08-23 20:40:11 Streaming xtrabackup_info
[00] 2021-08-23 20:40:11         ...done
[00] 2021-08-23 20:40:11 Redo log (from LSN 5352582709 to 5356297811) was copied.
[00] 2021-08-23 20:40:11 completed OK!

Then log on to the new slave CHANGE MASTER TO MASTER_HOST=’pacemaker01.svealiden.se’, MASTER_USER=’replication_user’, MASTER_PASSWORD=’SECRETREPLICATIONPASSWORD’, MASTER_PORT=6603, MASTER_LOG_FILE=’pacemaker01-bin.000007′, MASTER_LOG_POS=968336;

Update 2021-08-23 – part 2

Running things and promoting and demoting nodes seems to be working fine. Consistency has thus far been maintained, which was otherwise the big issue the last time I ran an MariaDB Pacemaker cluster. read_only = ON in the config file is probably the number one contributor but maybe the dedicated MariaDB resource agent also plays a role.

Update 2021-08-29

New config for setting up fencing:

pcs cluster cib > conf_old
cp conf_old fence_cfg
# Let pacemaker01.svealiden.se be fence by the fence_pve agent that needs to connect to
# 192.168.1.21 with the provided credentials to kill VM 127(pacemaker01.svealiden.se)
pcs -f fence_cfg stonith create p_fence_pacemaker01 fence_pve \
        ip="192.168.1.21" nodename="pve1" \
        login="root@pam" password="PROXMOXPASSWORD" plug="127" delay="15" vmtype="qemu" \
        pcmk_host_check="static-list" pcmk_host_list="pacemaker01.svealiden.se" \
        op monitor interval="60s" \
        meta target-role="Started" is-managed="true"
pcs -f fence_cfg stonith create p_fence_pacemaker02 fence_pve \
        ip="192.168.1.22" nodename="pve2" \
        login="root@pam" password="PROXMOXPASSWORD" plug="128" delay="15" vmtype="qemu" \
        pcmk_host_check="static-list" pcmk_host_list="pacemaker02.svealiden.se" \
        op monitor interval="60s" \
        meta target-role="Started" is-managed="true"
pcs -f fence_cfg stonith create p_fence_pacemaker03 fence_pve \
        ip="192.168.1.23" nodename="pve3" \
        login="root@pam" password="PROXMOXPASSWORD" plug="129" delay="15" vmtype="qemu" \
        pcmk_host_check="static-list" pcmk_host_list="pacemaker03.svealiden.se" \
        op monitor interval="60s" \
        meta target-role="Started" is-managed="true"
echo Fence devices created

# Make sure each fencing device is located away from the node it is supposed to
# fence. It makes no sense to expect a node to be able to fence itself in the event
# of some malfunction.
pcs -f fence_cfg constraint location p_fence_pacemaker01 avoids pacemaker01.svealiden.se=INFINITY
pcs -f fence_cfg constraint location p_fence_pacemaker02 avoids pacemaker02.svealiden.se=INFINITY
pcs -f fence_cfg constraint location p_fence_pacemaker03 avoids pacemaker03.svealiden.se=INFINITY
echo Fence location constraints added

# Activate fencing
pcs cluster cib-push fence_cfg

Super-important details: the part pcmk_host_list=”pacemaker02.svealiden.se” is how Pacemaker figures out which fencing “device” can fence which node. But this is only done if pcmk_host_check=”static-list” . Otherwise Pacemaker tries to figure things out itself which I don’t necessarily object to but I feel more confident putting in the right values myself. Note for instance that my VMs don’t use FQDN names so pacemaker02.svealiden.se is actually called pacemaker02 in Proxmox.

nodename=”pve2″ has to be the correct name of the Proxmox host as the cluster knows it. More specifically, if you don’t provide nodename at all, Proxmox will try to find an SSL-cert under /etc/pve/nodes/pacemaker02.svealiden.se which is just all kinds of wrong:

root@pve3:~# ls -lh /etc/pve/nodes/pacemaker02.svealiden.se
ls: cannot access '/etc/pve/nodes/pacemaker02.svealiden.se': No such file or directory

Note that I might think of pve3 as pve3.svealiden.se but in the Proxmox GUI it’s just pve3. So nodename has to be exactly pve3. I spent a lot of time figuring out why things weren’t working only to find myself here:

Aug 29 21:49:42 pve3 pveproxy[3137968]: '/etc/pve/nodes/pve3.svealiden.se/pve-ssl.pem' does not exist!

So that was fun!

Now to another problem. The mariadb resource agent doesn’t understand gtids that involve more than one gtid domain for the purposes of figuring out which node should be elected master:

Aug 29 22:01:30 pacemaker02 pacemaker-attrd[80247]: notice: Setting mariadb_server-gtid[pacemaker02.svealiden.se]: 0-2-59851595,1-1-1672237 -> 0-2-59851595,1-1-1672310
Aug 29 22:01:30 pacemaker02 mariadb(mariadb_server)[124915]: INFO: Not enough nodes (1) contributed to select a master, need 2 nodes.
Aug 29 22:01:30 pacemaker02 pacemaker-execd[80246]: notice: mariadb_server_monitor_20000[124915] error output [ (standard_in) 1: syntax error ]
Aug 29 22:01:30 pacemaker02 pacemaker-execd[80246]: notice: mariadb_server_monitor_20000[124915] error output [ /usr/lib/ocf/resource.d/heartbeat/mariadb: line 276: [: : integer expression expected ]

Turns out line 276 is a utility function which compares numbers and it’s called by a function that compares gtids and apparently it doesn’t work so well when there are two gtids. So now I need to figure out how to purge the errant domain id. I’m leaning towards shutting down Pacemaker, stopping the slaves, blanking all gtid data from the master and then reinitialize the slaves.

Can I blank all GTID data? Oh, even better! I can blank the domain id I don’t want: https://jira.mariadb.org/browse/MDEV-12012

MariaDB [(none)]> SELECT @@global.gtid_binlog_state;
+--------------------------------------+
| @@global.gtid_binlog_state           |
+--------------------------------------+
| 0-2-59851595,1-1-1678111,1-2-1678300 |
+--------------------------------------+
1 row in set (0.000 sec)

MariaDB [(none)]> SELECT @@global.gtid_binlog_state;
+--------------------------------------+
| @@global.gtid_binlog_state           |
+--------------------------------------+
| 0-2-59851595,1-1-1678111,1-2-1678337 |
+--------------------------------------+
1 row in set (0.000 sec)

MariaDB [(none)]> SELECT @@global.gtid_binlog_state;
+--------------------------------------+
| @@global.gtid_binlog_state           |
+--------------------------------------+
| 0-2-59851595,1-1-1678111,1-2-1678338 |
+--------------------------------------+
1 row in set (0.000 sec)

MariaDB [(none)]> FLUSH BINARY LOGS DELETE_DOMAIN_ID=(0);
Query OK, 0 rows affected (0.029 sec)

MariaDB [(none)]> SELECT @@global.gtid_binlog_state;
+----------------------------+
| @@global.gtid_binlog_state |
+----------------------------+
| 1-1-1678111,1-2-1678626    |
+----------------------------+
1 row in set (0.000 sec)

MariaDB [(none)]> SELECT @@global.gtid_binlog_state;
+----------------------------+
| @@global.gtid_binlog_state |
+----------------------------+
| 1-1-1678111,1-2-1678629    |
+----------------------------+
1 row in set (0.000 sec)

MariaDB [(none)]>

Seems to have worked. Switching on Pacemaker messed up the synchronization but after running good old mariabackup -u root –backup –stream=xbstream | ssh NEWSLAVE “mbstream -x -C /var/lib/mysql/data” got things back into the right groove.

HA WooCommerce on a budget

There is a draft on this website called “HA WordPress on a slightly bigger budget” and details the futility of trying to use Keepalived and Cloudflare Load Balancer to make a working primary/backup high availability server setup. I do not consider this to be any failing on the part of Keepalived as it quite clearly isn’t meant to be used for things like databases where split brain is a concern.

Cloudflare could have been a bit more helpful by providing a “no failback” option so that if traffic was ever directed to the backup, traffic wouldn’t go over to the primary automatically if that server came back online again. But I’ve implemented that using their REST API instead. So it’s still Cloudflare Load Balancer, MariaDB, Nginx, PHP-FPM, lsyncd and some janky bash scripts. It’s just that Keepalived isn’t being asked to do the work of Pacemaker/Corosync.

A planned switchover

Important numbers:

Thu 24 Dec 2020 11:37:11 PM CET Read worked. | Write worked. 185.20.12.24
Thu 24 Dec 2020 11:37:30 PM CET Read worked. | Write worked. 172.31.35.57

We had 19 seconds during which writes didn’t work due to the switchover process. Reads worked throughout. Full log:

cjp@util:~$ ./wootest.sh
Thu 24 Dec 2020 11:36:53 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:36:54 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:36:56 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:36:58 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:36:59 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:37:01 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:37:03 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:37:04 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:37:06 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:37:07 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:37:09 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:37:11 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:37:12 PM CET Read worked. | Write failed. 185.20.12.24

Thu 24 Dec 2020 11:37:14 PM CET Read worked. | Write failed. 185.20.12.24

Thu 24 Dec 2020 11:37:15 PM CET Read worked. | Write failed. 185.20.12.24

Thu 24 Dec 2020 11:37:17 PM CET Read worked. | Write failed. 185.20.12.24

Thu 24 Dec 2020 11:37:19 PM CET Read worked. | Write failed. 185.20.12.24

Thu 24 Dec 2020 11:37:20 PM CET Read worked. | Write failed. 185.20.12.24

Thu 24 Dec 2020 11:37:22 PM CET Read worked. | Write failed. 185.20.12.24

Thu 24 Dec 2020 11:37:23 PM CET Read worked. | Write failed. 172.31.35.57

Thu 24 Dec 2020 11:37:25 PM CET Read worked. | Write failed. 172.31.35.57

Thu 24 Dec 2020 11:37:27 PM CET Read worked. | Write failed. 172.31.35.57

Thu 24 Dec 2020 11:37:30 PM CET Read worked. | Write worked. 172.31.35.57

Thu 24 Dec 2020 11:37:32 PM CET Read worked. | Write worked. 172.31.35.57

Thu 24 Dec 2020 11:37:34 PM CET Read worked. | Write worked. 172.31.35.57

Thu 24 Dec 2020 11:37:36 PM CET Read worked. | Write worked. 172.31.35.57

Thu 24 Dec 2020 11:37:38 PM CET Read worked. | Write worked. 172.31.35.57

Thu 24 Dec 2020 11:37:40 PM CET Read worked. | Write worked. 172.31.35.57

Thu 24 Dec 2020 11:37:42 PM CET Read worked. | Write worked. 172.31.35.57

Thu 24 Dec 2020 11:37:44 PM CET Read worked. | Write worked. 172.31.35.57

Yes, I _am_ an idiot. But not for the reasons that I thought…

Installing Zabbix Agent on RHEL was difficult for me somehow

Failover test

Okey, so now for the (hopefully) more rare case of the primary going down all of a sudden:

Sat 26 Dec 2020 06:44:18 PM CET Read worked. | Write worked. 172.31.35.57

Sat 26 Dec 2020 06:44:21 PM CET Read worked. | Write worked. 172.31.35.57

Sat 26 Dec 2020 06:44:23 PM CET Read worked. | Write worked. 172.31.35.57

Sat 26 Dec 2020 06:44:26 PM CET Read worked. | Write worked. 172.31.35.57

Sat 26 Dec 2020 06:44:27 PM CET Read failed. | Write failed. <html>
<head><title>521 Origin Down</title></head>
<body bgcolor="white">
<center><h1>521 Origin Down</h1></center>
<hr><center>cloudflare-nginx</center>
</body>
</html>

Sat 26 Dec 2020 06:44:28 PM CET Read failed. | Write failed. <html>
<head><title>521 Origin Down</title></head>
<body bgcolor="white">
<center><h1>521 Origin Down</h1></center>
<hr><center>cloudflare-nginx</center>
</body>
</html>

Sat 26 Dec 2020 06:45:30 PM CET Read failed. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:32 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:33 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:35 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:36 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:38 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:39 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:40 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:42 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:43 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:45 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:46 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:48 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:49 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:50 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:52 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:53 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:55 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:56 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 06:45:58 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 06:45:59 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 06:46:00 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 06:46:02 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 06:46:03 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 06:46:04 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 06:46:06 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 06:46:07 PM CET Read worked. | Write worked. 185.20.12.24

Important numbers:

Sat 26 Dec 2020 06:44:26 PM CET Read worked. | Write worked. 172.31.35.57
Sat 26 Dec 2020 06:45:32 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:56 PM CET Read worked. | Write worked. 185.20.12.24

So a minute with site being down for reads and writes, and another half a minute during which only writes didn’t work. Not too shabby. And Zabbix is suitably concerned:

Crashed server rehabilitation

Problems getting a crashed primary to sync with backup:

[root@ip-172-31-35-57 ~]# mysql -e "SHOW SLAVE STATUS\G;"
*************************** 1. row ***************************
                Slave_IO_State:
                   Master_Host: backup
                   Master_User: replication_user
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File:
           Read_Master_Log_Pos: 4
                Relay_Log_File: woo1-relay-log.000001
                 Relay_Log_Pos: 4
         Relay_Master_Log_File:
              Slave_IO_Running: No
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 4
               Relay_Log_Space: 256
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: NULL
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 1236
                 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-1-58257, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions'
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 2
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Current_Pos
                   Gtid_IO_Pos: 0-1-58257
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
              Slave_DDL_Groups: 2
Slave_Non_Transactional_Groups: 14585
    Slave_Transactional_Groups: 8746

Oh, dear! What have we missed? I.e. what happened on primary when it was disconnected from the backup and _should_ have been read-only?

[root@ip-172-31-35-57 ~]# mysqlbinlog /var/lib/mysql/woo1-bin.000054 | less
<snip>
#201225 17:32:10 server id 1  end_log_pos 8224049 CRC32 0x780272cf      GTID 0-1-58257

/*!100001 SET @@session.gtid_seq_no=58257*//*!*/;
START TRANSACTION
/*!*/;
# at 8224049
#201225 17:32:10 server id 1  end_log_pos 8224198 CRC32 0x07e3910e      Query   thread_id=3456  exec_time=0     error_code=0
SET TIMESTAMP=1608917530/*!*/;
DELETE FROM `wp9k_options` WHERE `option_name` = '_transient_doing_cron'
/*!*/;
# at 8224198
#201225 17:32:10 server id 1  end_log_pos 8224281 CRC32 0xb1ecec93      Query   thread_id=3456  exec_time=0     error_code=0
SET TIMESTAMP=1608917530/*!*/;
COMMIT
/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

Oh, DELETE _transient_doing_cron… Well that’s no biggie. I was feigning concern that somehow lots of orders to my entirely not-real web shop had somehow happened to be written to the wrong database.

Well MariaDB is very stubborn about these sorts of things. The only way I know how to get primary back in sync with backup is:

root@woo2:~# mysqldump -A --master-data > backup_2020-12-25_1854.sql
root@woo2:~# scp backup_2020-12-25_1854.sql primary:/root
backup_2020-12-25_1854.sql

Back to woo1(primary):

[root@ip-172-31-35-57 ~]# systemctl stop mariadb
[root@ip-172-31-35-57 ~]# rm -rf /var/lib/mysql/* # <--- EXTREMELY DANGER IF YOU DON'T KNOW PRECISELY WHAT YOU ARE DOING. Lost hours of work doing this on the wrong server once...
[root@ip-172-31-35-57 ~]# mysql_install_db
Installing MariaDB/MySQL system tables in '/var/lib/mysql' ...
OK

To start mysqld at boot time you have to copy
support-files/mysql.server to the right place for your system


PLEASE REMEMBER TO SET A PASSWORD FOR THE MariaDB root USER !
To do so, start the server, then issue the following commands:

'/usr/bin/mysqladmin' -u root password 'new-password'
'/usr/bin/mysqladmin' -u root -h ip-172-31-35-57.eu-north-1.compute.internal password 'new-password'

Alternatively you can run:
'/usr/bin/mysql_secure_installation'

which will also give you the option of removing the test
databases and anonymous user created by default.  This is
strongly recommended for production servers.

See the MariaDB Knowledgebase at http://mariadb.com/kb or the
MySQL manual for more instructions.

You can start the MariaDB daemon with:
cd '/usr' ; /usr/bin/mysqld_safe --datadir='/var/lib/mysql'

You can test the MariaDB daemon with mysql-test-run.pl
cd '/usr/mysql-test' ; perl mysql-test-run.pl

Please report any problems at http://mariadb.org/jira

The latest information about MariaDB is available at http://mariadb.org/.
You can find additional information about the MySQL part at:
http://dev.mysql.com
Consider joining MariaDB's strong and vibrant community:
https://mariadb.org/get-involved/

[root@ip-172-31-35-57 ~]# systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code.
See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@ip-172-31-35-57 ~]# journalctl -u mariadb
-- Logs begin at Wed 2020-12-23 00:40:59 UTC, end at Fri 2020-12-25 17:55:55 UTC. --
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Starting MariaDB 10.3 database server...
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-socket[1506]: Socket file /var/lib/mysql/mysql.sock exists.
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-socket[1506]: No process is using /var/lib/mysql/mysql.sock, which means it is a garbage, so it will be removed automatically.
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[1535]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[1535]: If this is not the case, make sure the /var/lib/mysql is empty before running mysql-prepare-db-dir.
Dec 23 00:44:23 ip-172-31-35-57.eu-north-1.compute.internal mysqld[1573]: 2020-12-23  0:44:23 0 [Note] /usr/libexec/mysqld (mysqld 10.3.27-MariaDB-log) starting as process 1573 ...
Dec 23 00:44:23 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Started MariaDB 10.3 database server.
Dec 25 17:55:26 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Stopping MariaDB 10.3 database server...
Dec 25 17:55:28 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: mariadb.service: Succeeded.
Dec 25 17:55:28 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Stopped MariaDB 10.3 database server.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Starting MariaDB 10.3 database server...
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[66558]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[66558]: If this is not the case, make sure the /var/lib/mysql is empty before running mysql-prepare-db-dir.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: 2020-12-25 17:55:55 0 [Note] /usr/libexec/mysqld (mysqld 10.3.27-MariaDB-log) starting as process 66596 ...
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: /usr/libexec/mysqld: Can't create file './woo1.err' (errno: 13 "Permission denied")
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: 2020-12-25 17:55:55 0 [ERROR] mysqld: File './woo1-bin.index' not found (Errcode: 13 "Permission denied")
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: 2020-12-25 17:55:55 0 [ERROR] Aborting
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: mariadb.service: Failed with result 'exit-code'.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Failed to start MariaDB 10.3 database server.
[root@ip-172-31-35-57 ~]# ls -lh /var/lib/
total 4.0K
drwxr-xr-x. 2 root    root      85 Dec 18 20:17 alternatives
drwxr-xr-x. 3 root    root     209 Oct 31 05:07 authselect
drwxr-xr-x. 2 chrony  chrony    19 Dec 25 17:49 chrony
drwxr-xr-x. 8 root    root     105 Dec 23 00:41 cloud
drwx------. 2 apache  apache     6 Jun 15  2020 dav
drwxr-xr-x. 2 root    root      24 Oct 31 05:02 dbus
drwxr-xr-x. 2 root    root       6 May  7  2020 dhclient
drwxr-xr-x. 4 root    root     115 Dec 25 16:32 dnf
drwxr-xr-x. 2 root    root       6 Apr 23  2020 games
drwx------. 2 apache  apache     6 Jun 15  2020 httpd
drwxr-xr-x. 2 root    root       6 Aug  4 12:03 initramfs
drwxr-xr-x. 2 root    root      30 Dec 25 03:51 logrotate
drwxr-xr-x. 2 root    root       6 Apr 23  2020 misc
drwxr-xr-x. 5 mysql   mysql    264 Dec 25 17:55 mysql
drwxr-xr-x. 4 root    root      45 Dec 18 22:13 net-snmp
drwx------. 2 root    root     169 Dec 25 17:51 NetworkManager
drwxrwx---. 3 nginx   root      33 Dec 18 23:20 nginx
drwxr-xr-x. 2 root    root       6 Aug 12  2018 os-prober
drwxr-xr-x. 6 root    root      68 Dec 18 21:37 php
drwxr-x---. 3 root    polkitd   28 Oct 31 05:02 polkit-1
drwx------. 2 root    root       6 Oct 31 05:02 portables
drwx------. 2 root    root       6 Oct 31 05:02 private
drwxr-x---. 6 root    root      71 Oct 31 05:03 rhsm
drwxr-xr-x. 2 root    root    4.0K Dec 23 00:51 rpm
drwxr-xr-x. 3 root    root      21 Dec 18 20:30 rpm-state
drwx------. 2 root    root      29 Dec 25 17:55 rsyslog
drwxr-xr-x. 5 root    root      46 Oct 31 05:03 selinux
drwxr-xr-x. 9 root    root     105 Oct 31 05:03 sss
drwxr-xr-x. 5 root    root      70 Nov  6 13:06 systemd
drwx------. 2 tss     tss        6 Jun 10  2019 tpm
drwxr-xr-x. 2 root    root       6 Oct  2 16:12 tuned
drwxr-xr-x. 2 unbound unbound   22 Dec 25 00:00 unbound
[root@ip-172-31-35-57 ~]# ls -lh /var/lib/mysql/
total 109M
-rw-rw---- 1 root root  16K Dec 25 17:55 aria_log.00000001
-rw-rw---- 1 root root   52 Dec 25 17:55 aria_log_control
-rw-rw---- 1 root root  972 Dec 25 17:55 ib_buffer_pool
-rw-rw---- 1 root root  12M Dec 25 17:55 ibdata1
-rw-rw---- 1 root root  48M Dec 25 17:55 ib_logfile0
-rw-rw---- 1 root root  48M Dec 25 17:55 ib_logfile1
drwx------ 2 root root 4.0K Dec 25 17:55 mysql
drwx------ 2 root root   20 Dec 25 17:55 performance_schema
drwx------ 2 root root   20 Dec 25 17:55 test
-rw-rw---- 1 root root  28K Dec 25 17:55 woo1-bin.000001
-rw-rw---- 1 root root   18 Dec 25 17:55 woo1-bin.index
-rw-rw---- 1 root root    7 Dec 25 17:55 woo1-bin.state
-rw-rw---- 1 root root  642 Dec 25 17:55 woo1.err
[root@ip-172-31-35-57 ~]# chown -R mysql:mysql /var/lib/mysql/
[root@ip-172-31-35-57 ~]# systemctl restart mariadb
[root@ip-172-31-35-57 ~]# journalctl -u mariadb
-- Logs begin at Wed 2020-12-23 00:40:59 UTC, end at Fri 2020-12-25 17:56:22 UTC. --
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Starting MariaDB 10.3 database server...
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-socket[1506]: Socket file /var/lib/mysql/mysql.sock exists.
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-socket[1506]: No process is using /var/lib/mysql/mysql.sock, which means it is a garbage, so it will be removed automatically.
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[1535]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[1535]: If this is not the case, make sure the /var/lib/mysql is empty before running mysql-prepare-db-dir.
Dec 23 00:44:23 ip-172-31-35-57.eu-north-1.compute.internal mysqld[1573]: 2020-12-23  0:44:23 0 [Note] /usr/libexec/mysqld (mysqld 10.3.27-MariaDB-log) starting as process 1573 ...
Dec 23 00:44:23 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Started MariaDB 10.3 database server.
Dec 25 17:55:26 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Stopping MariaDB 10.3 database server...
Dec 25 17:55:28 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: mariadb.service: Succeeded.
Dec 25 17:55:28 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Stopped MariaDB 10.3 database server.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Starting MariaDB 10.3 database server...
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[66558]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[66558]: If this is not the case, make sure the /var/lib/mysql is empty before running mysql-prepare-db-dir.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: 2020-12-25 17:55:55 0 [Note] /usr/libexec/mysqld (mysqld 10.3.27-MariaDB-log) starting as process 66596 ...
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: /usr/libexec/mysqld: Can't create file './woo1.err' (errno: 13 "Permission denied")
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: 2020-12-25 17:55:55 0 [ERROR] mysqld: File './woo1-bin.index' not found (Errcode: 13 "Permission denied")
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: 2020-12-25 17:55:55 0 [ERROR] Aborting
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: mariadb.service: Failed with result 'exit-code'.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Failed to start MariaDB 10.3 database server.
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Starting MariaDB 10.3 database server...
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[66636]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[66636]: If this is not the case, make sure the /var/lib/mysql is empty before running mysql-prepare-db-dir.
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66675]: 2020-12-25 17:56:22 0 [Note] /usr/libexec/mysqld (mysqld 10.3.27-MariaDB-log) starting as process 66675 ...
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-upgrade[66707]: The datadir located at /var/lib/mysql needs to be upgraded using 'mysql_upgrade' tool. This can be done using the following steps:
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-upgrade[66707]:   1. Back-up your data before with 'mysql_upgrade'
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-upgrade[66707]:   2. Start the database daemon using 'service mariadb start'
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-upgrade[66707]:   3. Run 'mysql_upgrade' with a database user that has sufficient privileges
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-upgrade[66707]: Read more about 'mysql_upgrade' usage at:
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-upgrade[66707]: https://mariadb.com/kb/en/mariadb/documentation/sql-commands/table-commands/mysql_upgrade/
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Started MariaDB 10.3 database server.
[root@ip-172-31-35-57 ~]# mysql_upgrade
Phase 1/7: Checking and upgrading mysql database
Processing databases
mysql
mysql.column_stats                                 OK
mysql.columns_priv                                 OK
mysql.db                                           OK
mysql.event                                        OK
mysql.func                                         OK
mysql.gtid_slave_pos                               OK
mysql.help_category                                OK
mysql.help_keyword                                 OK
mysql.help_relation                                OK
mysql.help_topic                                   OK
mysql.host                                         OK
mysql.index_stats                                  OK
mysql.innodb_index_stats                           OK
mysql.innodb_table_stats                           OK
mysql.plugin                                       OK
mysql.proc                                         OK
mysql.procs_priv                                   OK
mysql.proxies_priv                                 OK
mysql.roles_mapping                                OK
mysql.servers                                      OK
mysql.table_stats                                  OK
mysql.tables_priv                                  OK
mysql.time_zone                                    OK
mysql.time_zone_leap_second                        OK
mysql.time_zone_name                               OK
mysql.time_zone_transition                         OK
mysql.time_zone_transition_type                    OK
mysql.transaction_registry                         OK
mysql.user                                         OK
Phase 2/7: Installing used storage engines... Skipped
Phase 3/7: Fixing views
Phase 4/7: Running 'mysql_fix_privilege_tables'
Phase 5/7: Fixing table and database names
Phase 6/7: Checking and upgrading tables
Processing databases
information_schema
performance_schema
test
Phase 7/7: Running 'FLUSH PRIVILEGES'
OK
[root@ip-172-31-35-57 ~]# mysql -e "SHOW SLAVE STATUS\G;"
[root@ip-172-31-35-57 ~]# mysql < backup_2020-12-25_1854.sql
[root@ip-172-31-35-57 ~]# grep gtid_slave_pos backup_2020-12-25_1854.sql  | head -1
-- SET GLOBAL gtid_slave_pos='0-2-61170';

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 32
Server version: 10.3.27-MariaDB-log MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> GRANT ALL PRIVILEGES ON `johante2_wp508`.* TO `johante2_wp508`@`localhost`;
Query OK, 0 rows affected (0.001 sec)

MariaDB [(none)]> GRANT USAGE,REPLICATION CLIENT,PROCESS,SHOW DATABASES,SHOW VIEW ON *.* TO 'zbx_monitor'@'%';
Query OK, 0 rows affected (0.000 sec)


MariaDB [(none)]> SET GLOBAL gtid_slave_pos='0-2-61170';
Query OK, 0 rows affected (0.021 sec)

MariaDB [(none)]> Bye
[root@ip-172-31-35-57 ~]#

Oki, let’s try to get primary to replicate from backup now.



MariaDB [(none)]> CHANGE MASTER TO master_host="backup", master_port=3306, master_user="replication_user", master_password="REPLICATION_PASSWORD", master_use_gtid=slave_pos;
Query OK, 0 rows affected (0.005 sec)

MariaDB [(none)]> START SLAVE;
Query OK, 0 rows affected (0.009 sec)

MariaDB [(none)]> SHOW SLAVE STATUS;
+----------------------------------+-------------+------------------+-------------+---------------+-----------------+---------------------+-----------------------+---------------+-----------------------+------------------+-------------------+-----------------+---------------------+--------------------+------------------------+-------------------------+-----------------------------+------------+------------+--------------+---------------------+-----------------+-----------------+----------------+---------------+--------------------+--------------------+--------------------+-----------------+-------------------+----------------+-----------------------+-------------------------------+---------------+---------------+----------------+----------------+-----------------------------+------------------+----------------+--------------------+------------+-------------+-------------------------+-----------------------------+---------------+-----------+---------------------+-----------------------------------------------------------------------------+------------------+--------------------------------+----------------------------+
| Slave_IO_State                   | Master_Host | Master_User      | Master_Port | Connect_Retry | Master_Log_File | Read_Master_Log_Pos | Relay_Log_File        | Relay_Log_Pos | Relay_Master_Log_File | Slave_IO_Running | Slave_SQL_Running | Replicate_Do_DB | Replicate_Ignore_DB | Replicate_Do_Table | Replicate_Ignore_Table | Replicate_Wild_Do_Table | Replicate_Wild_Ignore_Table | Last_Errno | Last_Error | Skip_Counter | Exec_Master_Log_Pos | Relay_Log_Space | Until_Condition | Until_Log_File | Until_Log_Pos | Master_SSL_Allowed | Master_SSL_CA_File | Master_SSL_CA_Path | Master_SSL_Cert | Master_SSL_Cipher | Master_SSL_Key | Seconds_Behind_Master | Master_SSL_Verify_Server_Cert | Last_IO_Errno | Last_IO_Error | Last_SQL_Errno | Last_SQL_Error | Replicate_Ignore_Server_Ids | Master_Server_Id | Master_SSL_Crl | Master_SSL_Crlpath | Using_Gtid | Gtid_IO_Pos | Replicate_Do_Domain_Ids | Replicate_Ignore_Domain_Ids | Parallel_Mode | SQL_Delay | SQL_Remaining_Delay | Slave_SQL_Running_State                                                     | Slave_DDL_Groups | Slave_Non_Transactional_Groups | Slave_Transactional_Groups |
+----------------------------------+-------------+------------------+-------------+---------------+-----------------+---------------------+-----------------------+---------------+-----------------------+------------------+-------------------+-----------------+---------------------+--------------------+------------------------+-------------------------+-----------------------------+------------+------------+--------------+---------------------+-----------------+-----------------+----------------+---------------+--------------------+--------------------+--------------------+-----------------+-------------------+----------------+-----------------------+-------------------------------+---------------+---------------+----------------+----------------+-----------------------------+------------------+----------------+--------------------+------------+-------------+-------------------------+-----------------------------+---------------+-----------+---------------------+-----------------------------------------------------------------------------+------------------+--------------------------------+----------------------------+
| Waiting for master to send event | backup      | replication_user |        3306 |            60 | woo2-bin.000002 |             6356025 | woo1-relay-log.000002 |         95872 | woo2-bin.000002       | Yes              | Yes               |                 |                     |                    |                        |                         |                             |          0 |            |            0 |             6356025 |           96180 | None            |                |             0 | No                 |                    |                    |                 |                   |                |                     0 | No                            |             0 |               |              0 |                |                             |                2 |                |                    | Slave_Pos  | 0-2-61237   |                         |                             | conservative  |         0 |                NULL | Slave has read all relay log; waiting for the slave I/O thread to update it |                0 |                             40 |                         27 |
+----------------------------------+-------------+------------------+-------------+---------------+-----------------+---------------------+-----------------------+---------------+-----------------------+------------------+-------------------+-----------------+---------------------+--------------------+------------------------+-------------------------+-----------------------------+------------+------------+--------------+---------------------+-----------------+-----------------+----------------+---------------+--------------------+--------------------+--------------------+-----------------+-------------------+----------------+-----------------------+-------------------------------+---------------+---------------+----------------+----------------+-----------------------------+------------------+----------------+--------------------+------------+-------------+-------------------------+-----------------------------+---------------+-----------+---------------------+-----------------------------------------------------------------------------+------------------+--------------------------------+----------------------------+
1 row in set (0.000 sec)

MariaDB [(none)]> Bye

From now on we can start replication by saying master_use_gtid=current_pos; at the end but this time around we had to use slave_pos as the starting point. Because reasons.

Failover test

Whatever system you set up for high availability – even a less janky one – needs to be tested a lot before being used live. Example: did you know that Amazon EC2 will assign a VM a different IP address after it was Stopped? Well I sure didn’t! Nor did I realize that Cloudflare’s Origin Server editing UI can’t handle spaces:

Took me a while to figure out I had copied the string “13.49.145.244 “.

The main issue with this test was that I had forgot to turn on the failover service on backup… But that made it clear there should be a Zabbix item counting the number of failover processes running and a trigger to warn if it is zero.

Trigger only if 5 consequtive measurements indicate that process is down(there can be some noise):

Implementation

So now to the details of how it works. Let’s start with the main script that runs on the backup monitoring the primary:

root@woo2:~/ha_scripts# cat monitor_primary.sh
#!/bin/bash

HTTP_FAIL="0"

while true; do
        mysql -e "SHOW VARIABLES LIKE 'read_only';" | grep -iq off
        if [ $? -eq 0 ]
        then
                # We're apparently accepting writes so we are in the
                # master already, little point in promoting this node.
                exit 1
        fi
        curl --connect-timeout 5 -fks -H "Host: woo.deref.se" https://primary -o /dev/null
        if [ $? -ne 0 ]
        then
                ((HTTP_FAIL=HTTP_FAIL+1))
                echo "Failures: $HTTP_FAIL."
        else
                ((HTTP_FAIL=0))
        fi
        if [ $HTTP_FAIL -gt 3 ]
        then
                ./get_primary_health.sh | grep "healthy" | grep -q "true"
                if [ $? -ne 0 ]
                then

                        # Not sure primary is reachable, but let's try getting it to
                        # stop sending changes to us
                        ssh primary "systemctl stop lsyncd"
                        # If the primary is still ticking along but not reachable
                        # we need to prevent it from overwriting our file system
                        unlink /root/.ssh/authorized_keys
                        ln -s /root/.ssh/authorized_keys_no_primary /root/.ssh/authorized_keys

                        # primary cut off from sending changes to us, we can push changes
                        # in the other direction then. Might work, might not.
                        systemctl start lsyncd

                        # Wait for replication to catch up
                        ./checkslave.sh
                        # We've caught up, stop replicating.
                        mysql -e "STOP SLAVE;"
                        sleep 1
                        mysql -e "RESET SLAVE ALL;"

                        sleep 5
                        # It's now safe to accept writes
                        mysql -e 'SET GLOBAL read_only = OFF;'

                        # Got to tell Cloudflare not to switch back to primary.
                        ./set_backup_alone_in_lb.sh

                        # Let's give it a shot to get the primary to replicate data from us
                        # Started as a background job since it's not critical
                        ./make_primary_slave.sh &
                        exit 0
                else
                        # Okey, if Cloudflare says the server is okey, let's give it another
                        # chance by starting the counting from zero
                        echo "Cloudflare says primary is healthy, even though we failed to reach it. Resetting fail-counter."
                        ((HTTP_FAIL=0))
                fi
        fi
        sleep 5
done

Obviously “curl –connect-timeout 5 -fks -H “Host: woo.deref.se” https://primary -o /dev/null” needs to be modified to your environment.

There are some other scripts being called from this one, checkslave.sh is a chunk of code I copied straight from ocf:heartbeat:mariadb:

#!/bin/bash

# Swiped from ocf:heartbeat:mariadb resource agent
SLAVESTATUS=$(mysql -e "SHOW SLAVE STATUS;" | wc -l)
if [ $SLAVESTATUS -gt 0 ]; then
        tmpfile="/tmp/processlist.txt"
        while true; do
                mysql -e 'SHOW PROCESSLIST\G' > $tmpfile
                if grep -i 'Has read all relay log' $tmpfile >/dev/null; then
                    echo "MariaDB slave has finished processing relay log"
                    break
                fi
                if ! grep -q 'system user' $tmpfile; then
                    echo "Slave not running - not waiting to finish"
                    break
                fi
                echo "Waiting for MariaDB slave to finish processing relay log"
                sleep 1
        done
fi

How best to test if a MariaDB server is busy catching up to the master it was replicating from is tricky. If you can get your hands on the results that worked well for other people in a lot of scenarios then you should use that.

set_backup_alone_in_lb.sh is an API-call to Cloudflare:

#!/bin/bash

source cloudflare_vars.sh
source site_vars.sh

# Promote primary in the Load Balancer
# It takes a few seconds

make_payload()
{
      cat <<EOF
      {
      "description": "",
      "id": "$LB_ID",
      "proxied": true,
      "enabled": true,
      "name": "$DOMAIN_NAME",
      "session_affinity": "none",
      "session_affinity_attributes": {
        "samesite": "Auto",
        "secure": "Auto",
        "drain_duration": 0
      },
      "steering_policy": "off",
      "fallback_pool": "$BACKUP_POOL_ID",
      "default_pools": [
        "$BACKUP_POOL_ID"
      ],
      "pop_pools": {},
      "region_pools": {}
      }
EOF
}

curl -fX PUT "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/load_balancers/$LB_ID" -H "Authorization: Bearer $API_TOKEN" -H "Content-Type:application/json" --data "$(make_payload)"

Cloudflare_vars and site_vars are just files assigning the right values to variables used throughout:

# Cloudflare vars

PRIMARY_POOL_ID="5a3ccSECRETSECRETSECRET0797e2d9ce"
BACKUP_POOL_ID="0ef15eSECRETSECRETSECRETf06686ac1e"
LB_ID="93ff3da14adSECRETSECRETSECRETe6ad1ee3d"
CLIENT_ID="a7e9f6SECRETSECRETSECRETb864cb4a5"
ZONE_ID="5e98d4dSECRETSECRETSECRET87cb6ad8"
API_TOKEN="noMamSECRETSECRETSECRETZaO1X"

And:

#!/bin/bash

DOMAIN_NAME="woo.deref.se"
REPLICATION_PASSWORD="SECRETSECRET"

It’s not guaranteed that the primary stopped working because it crashed and won’t restart. So we should at least try getting the primary to take on the roll as slave, hence the invocation of make_primary_slave.sh:

#!/bin/bash

source site_vars.sh

# We want the primary to be read only for as much of this process
# as possible because we can't proceed until the backup has caught
# up with all the writes performed on the master
ssh primary "mysql -e 'SET GLOBAL read_only = ON;'"

# Let's try to get the primary to replicate from us
ssh primary "mysql -e 'CHANGE MASTER TO master_host=\"backup\", master_port=3306, master_user=\"replication_user\", master_password=\"$REPLICATION_PASSWORD\", master_use_gtid=current_pos;'"
ssh primary "mysql -e 'START SLAVE;'"

Scripts for switchover

It’s good to sanity check the database of the server we’re trying to promote to master:

#!/bin/bash

tmpfile="/tmp/remoteslavestatus"
ssh $1 "mysql -e 'SHOW SLAVE STATUS\G;'" > $tmpfile

SLAVE_IO_STATE=$(grep -i "slave_io_state" $tmpfile | wc -w)
SLAVE_IO_RUNNING=$?
SLAVE_SQL_RUNNING=$?
SLAVE_SECONDS_BEHIND=$(grep -i seconds_behind_master $tmpfile | awk '{print $2}')

if [ $SLAVE_IO_STATE -lt 2 ]
then
        echo "Slave IO State is empty. Is $1 properly synchronized? Bailing out."
        exit 1
else
        echo "Slave IO state is not null. Good."
fi

if [ $SLAVE_IO_RUNNING -ne 0 ]
then
        echo "Slave IO seems not to be running. Bailing out."
        exit 1
else
        echo "Slave SQL running. Good."
fi

if [ $SLAVE_SQL_RUNNING -ne 0 ]
then
        echo "Slave SQL seems not to be running. Bailing out."
        exit 1
else
        echo "Slave SQL running. Good."
fi

if [ $SLAVE_SECONDS_BEHIND == "NULL" ]
then
        echo "Slave is not processing relay log. Bailing out."
        exit 1
fi

if [ $SLAVE_SECONDS_BEHIND -gt 5 ]
then
        echo "Slave is more than 5 seconds behind master. Won't have time to catch up! Bailing out."
        exit 1
else
        echo "Slave is reasonably caught up, only $SLAVE_SECONDS_BEHIND s behind."
fi

For instance, let’s check if it’s okey to switch over the primary after it rebooted and failover occurred:

root@woo2:~/ha_scripts# ./sanity_check.sh primary
Slave IO State is empty. Is primary properly synchronized? Bailing out.

Nope. Not at all. Have to make it replicate changes first:

root@woo2:~/ha_scripts# ./make_primary_slave.sh
root@woo2:~/ha_scripts# ./sanity_check.sh primary
 Slave IO state is not null. Good.
 Slave SQL running. Good.
 Slave SQL running. Good.
 Slave is reasonably caught up, only 0 s behind.
root@woo2:~/ha_scripts#

Let’s just check how things are going with lsyncd:

root@woo2:~/ha_scripts# systemctl status lsyncd
● lsyncd.service - LSB: lsyncd daemon init script
     Loaded: loaded (/etc/init.d/lsyncd; generated)
     Active: active (exited) since Sat 2020-12-26 18:45:50 CET; 28min ago
       Docs: man:systemd-sysv-generator(8)
    Process: 1081833 ExecStart=/etc/init.d/lsyncd start (code=exited, status=0/SUCCESS)

Dec 26 18:45:50 woo2 systemd[1]: Starting LSB: lsyncd daemon init script...
Dec 26 18:45:50 woo2 lsyncd[1081833]:  * Starting synchronization daemon lsyncd
Dec 26 18:45:50 woo2 lsyncd[1081838]: 18:45:50 Normal: --- Startup, daemonizing ---
Dec 26 18:45:50 woo2 lsyncd[1081833]:    ...done.
Dec 26 18:45:50 woo2 lsyncd[1081838]: Normal, --- Startup, daemonizing ---
Dec 26 18:45:50 woo2 systemd[1]: Started LSB: lsyncd daemon init script.
Dec 26 18:45:50 woo2 lsyncd[1081839]: Normal, recursive startup rsync: /var/www/html/ -> primary:/var/www/html/
Dec 26 18:45:52 woo2 lsyncd[1081839]: Error, Temporary or permanent failure on startup of "/var/www/html/". Terminating since "insist" is not set.
root@woo2:~/ha_scripts# sudo -u www-data date >> /var/www/html/testfile
root@woo2:~/ha_scripts# systemctl restart lsyncd

Testfile didn’t show up on primary so indeed the error message was right about lsyncd having given up on synchronizing to primary. A restart did the trick. So let’s switch over using this script called demote_backup.sh :

#!/bin/bash

source site_vars.sh

ssh primary "systemctl status mariadb"

# If either we couldn't reach primary or mariadb was down
# then we really should switch over
if [ $? -ne 0 ];
then
        echo "Error! Couldn't verify that primary has MariaDB running! Bailing out."
        exit 1
fi

curl -fks -H "Host: woo.deref.se" https://primary/

if [ $? -ne 0 ];
then
        echo "Error! Couldn't verify that primary has Nginx running! Bailing out."
        exit 1
fi

mysql -e "SET GLOBAL read_only = ON;"
systemctl stop lsyncd

# Wait for replication to catch up, if necessary
ssh primary "./checkslave.sh"

# Promote primary in the Load Balancer
# It takes a few seconds
./set_primary_first_in_lb.sh

# Just in case there was replication active on primary
ssh primary "mysql -e 'STOP SLAVE;'"
sleep 1
ssh primary "mysql -e 'RESET SLAVE ALL;'"

# Time for us to stop reading updates from primary
mysql -e "STOP SLAVE;"


# Let's allow primary's SSH key so lsyncd will work
unlink /root/.ssh/authorized_keys
ln -s /root/.ssh/authorized_keys_primary_allowed /root/.ssh/authorized_keys

# primary can start accepting writes now that it's no longer replicating
ssh primary "mysql -e 'SET GLOBAL read_only = OFF;'"
ssh primary "systemctl start lsyncd"

# Time for us to start replicating
mysql -e "CHANGE MASTER TO master_host='primary', master_port=3306, master_user='replication_user', master_password='$REPLICATION_PASSWORD', master_use_gtid=current_pos;"
mysql -e "START SLAVE;"

Again, “curl -fks -H “Host: woo.deref.se” https://primary/” isn’t neatly generated based on the contents of the _vars-files but whatever. Let’s see if it works out. I’ll start the same timing script as before on another server.

cjp@util:~$ ./wootest.sh | tee test_woo_scripted_switchover6.txt
Sat 26 Dec 2020 07:18:28 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 07:18:30 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 07:18:31 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 07:18:33 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 07:18:34 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 07:18:36 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 07:18:37 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 07:18:39 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 07:18:41 PM CET Read worked. | Write worked. 172.31.35.57

Sat 26 Dec 2020 07:18:43 PM CET Read worked. | Write worked. 172.31.35.57

Sat 26 Dec 2020 07:18:44 PM CET Read worked. | Write worked. 172.31.35.57

Sat 26 Dec 2020 07:18:46 PM CET Read worked. | Write worked. 172.31.35.57

Boy, that was fast.

Output on backup shows that we get a good response from the primary regarding mariadb(just another safety check) followed by the contents of a call to Nginx on the primary, finally showing the new state of the load balancer at Cloudflare:

root@woo2:~/ha_scripts# ./demote_backup.sh
● mariadb.service - MariaDB 10.3 database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
   Active: active (running) since Sat 2020-12-26 18:10:16 UTC; 8min ago
     Docs: man:mysqld(8)
           https://mariadb.com/kb/en/library/systemd/
  Process: 932 ExecStartPost=/usr/libexec/mysql-check-upgrade (code=exited, status=0/SUCCESS)
  Process: 828 ExecStartPre=/usr/libexec/mysql-prepare-db-dir mariadb.service (code=exited, status=0/SUCCESS)
  Process: 779 ExecStartPre=/usr/libexec/mysql-check-socket (code=exited, status=0/SUCCESS)
 Main PID: 870 (mysqld)
   Status: "Taking your SQL requests now..."
    Tasks: 36 (limit: 10984)
   Memory: 120.0M
   CGroup: /system.slice/mariadb.service
           └─870 /usr/libexec/mysqld --basedir=/usr

Dec 26 18:10:15 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Starting MariaDB 10.3 database server...
Dec 26 18:10:15 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[828]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Dec 26 18:10:15 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[828]: If this is not the case, make sure the /var/lib/mysql is empty before running mysql-prepare-db-dir.
Dec 26 18:10:15 ip-172-31-35-57.eu-north-1.compute.internal mysqld[870]: 2020-12-26 18:10:15 0 [Note] /usr/libexec/mysqld (mysqld 10.3.27-MariaDB-log) starting as process 870 ...
Dec 26 18:10:16 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Started MariaDB 10.3 database server.

<huge html blog deleted>

MariaDB slave has finished processing relay log
{
  "result": {
    "description": "",
    "created_on": "0001-01-01T00:00:00Z",
    "modified_on": "2020-12-26T18:18:38.282592Z",
    "id": "93ff3SECRETSECRETSECRET9e6ad1ee3d",
    "ttl": 30,
    "proxied": true,
    "enabled": true,
    "name": "woo.deref.se",
    "session_affinity": "none",
    "session_affinity_attributes": {
      "samesite": "Auto",
      "secure": "Auto",
      "drain_duration": 0
    },
    "steering_policy": "off",
    "fallback_pool": "0ef15SECRETSECRETSECRET6686ac1e",
    "default_pools": [
      "5a3cc32SECRETSECRETSECRET7e2d9ce",
      "0ef15SECRETSECRETSECRET6686ac1e"
    ],
    "pop_pools": {},
    "region_pools": {}
  },
  "success": true,
  "errors": [],
  "messages": []
}

promote_backup.sh is good for when primary needs to be rebooted or otherwise undergo maintenance:

#!/bin/bash

source site_vars.sh

# We want the primary to be read only for as much of this process
# as possible because we can't proceed until the backup has caught
# up with all the writes performed on the master
ssh primary "mysql -e 'SET GLOBAL read_only = ON;'"

# Wait for replication to catch up
./checkslave.sh

# We've caught up, stop replicating.
mysql -e "STOP SLAVE;"
sleep 1
mysql -e "RESET SLAVE ALL;"

sleep 5
# It's now safe to accept writes
mysql -e 'SET GLOBAL read_only = OFF;'

# Switch roles for who writes file system updates
# We're the only one expected to see any writes
# but no point in having lsyncd running on the primary
ssh primary "systemctl stop lsyncd"
systemctl start lsyncd


# Time for us to start replicating
ssh primary "mysql -e \"CHANGE MASTER TO master_host='primary', master_port=3306, master_user='replication_user', master_password='$REPLICATION_PASSWORD', master_use_gtid=current_pos;\""
ssh primary "mysql -e \"START SLAVE;\""

# Time to tell Cloudflare that traffic should be routed to this node.
# The primary is left as a fallback to this node just in case.
# We can do this doing an orderly switch. If this was a failover
# we'd have to get the primary out of the load balancer entirely
# until we had time to inspect whether it could safely resume master
# status.
./set_backup_first_in_lb.sh

Settings

It’s going to be useful for the two servers to both believe themselves to be woo.deref.se so that they can make curl-calls to that address to check their own status. Master node /etc/hosts therefore gets two additional rows:

10.0.0.2 backup # VPN IP for backup node
13.53.118.162 woo.deref.se # Public IP of master node, which Amazon EC2 keeps changing on reboot...

The backup node gets something similar:

10.0.0.1 primary
185.20.12.24 woo.deref.se

You will also need passwordless SSH between the root accounts of the two servers.

MariaDB on second node:

user                    = mysql
pid-file                = /run/mysqld/mysqld.pid
socket                  = /run/mysqld/mysqld.sock
#port                   = 3306
basedir                 = /usr
datadir                 = /var/lib/mysql
tmpdir                  = /tmp
lc-messages-dir         = /usr/share/mysql
#skip-external-locking
server_id=2
log-basename=woo2
binlog-format=mixed
relay_log=woo2-relay-log
log_bin=woo2-bin
read-only=0
bind-address            = 0.0.0.0

Default site config file for Nginx on second node:


server {
        listen 80 default_server;
        listen [::]:80 default_server;


        server_name woo.deref.se;
        return 301 https://woo.deref.se$request_uri;
}

server {

        listen 443 ssl default_server;
        listen [::]:443 ssl default_server;

        root /var/www/html;

        # Add index.php to the list if you are using PHP
        index index.php index.html index.htm index.nginx-debian.html;

        server_name woo.deref.se;
        include snippets/cloudflaressl.conf;

        location / {
                # First attempt to serve request as file, then
                # as directory, then fall back to displaying a 404.
                #try_files $uri $uri/ =404;
                try_files $uri $uri/ /index.php?$args ;
        }


        location ~ \.php$ {
                #NOTE: You should have "cgi.fix_pathinfo = 0;" in php.ini
                include fastcgi_params;
                fastcgi_intercept_errors on;
                fastcgi_pass unix:/run/php/php7.4-fpm.sock;
                #The following parameter can be also included in fastcgi_params file
                fastcgi_param  SCRIPT_FILENAME $document_root$fastcgi_script_name;
        }

        location ~* \.(js|css|png|jpg|jpeg|gif|ico)$ {
                expires max;
                log_not_found off;
        }

        location ~ /\.ht {
                deny all;
        }
}

I wonder why I had to add this manually to the first node to get Zabbix calls to /basic_status to work? Maybe it’s enabled by default on Ubuntu?

server {
        listen 80 default_server;

        server_name _;
        location = /basic_status {
            stub_status;
        }
    }

lsyncd.conf is tricky because on Ubuntu it’s supposed to be installed as /etc/lsyncd/lsyncd.conf.lua while it’s just plain /etc/lsyncd.conf on RHEL.

 sync {
         default.rsyncssh,
         source="/var/www/html",
         host="primary",
         targetdir="/var/www/html/",
         rsync = {
             owner = true,
             group = true,
             chown = "nginx:nginx"
         }
 }

This was from Ubuntu, on RHEL it says host=”backup” and chown = “www-data:www-data” instead. This is confusing at first until your realize chown is executed on the other host. So it makes sense for RHEL to run “chown www-data:www-data” on the Ubuntu node which is the one that uses the user/group www-data. Not sure that was made super clear but it’s easy enough to test out. Or you can just run a single distro and not have to deal with that.

Sensible questions

Why not Galera? Multi-master without any need for pacemaker and fencing! Galera don’t work too well when the response time between nodes is highly variable: https://www.percona.com/blog/2018/11/15/how-not-to-do-mysql-high-availability-geographic-node-distribution-with-galera-based-replication-misuse/

Why two different Linux distros? I want to minimize common points of faiure. I could have tried this with a RHEL/FreeBSD combo but vulnerabilities that hit all Linux distros are rare. This is my balance between ease-of-use and availability.

Would I use this for real? With some more testing, sure. And keep in mind this isn’t intended to be the ultimate high availability solution for WooCommerce. That would be something like three datacenters with dedicated interconnect running three servers with the Pacemaker cluster manager, MariaDB with Galera for multi-master writes within each data center, LiteSpeed or maybe Nginx+Varnish for high speed serving of data. Though I have a bit of a hard time understanding why someone with that kind of money would use WooCommerce. It has a low barrier for entry – sure – but it’s ecommerce functionality inelegantly taped onto a blogging engine.

Basic failover for WordPress

So you’re not content with the 99.9% uptime that you get from your premium hosting provider? Well that’s 8h 45m 56s of downtime per year so I guess I can understand that. For a lot of providers 99.6 and 99.7 is more to be expected. And that’s more than a day of downtime per year. If you have a website for your company in the fields of engineering, finance, medicine or IT this might not feel great. First impressions last a life time and “The IT company that can’t keep their own web page up” isn’t good PR.

Well you can set up a halfway decent backup to your website, ready to take over if the primary server goes down. I’ll be assuming that you use Cloudflare’s Load Balancer but other solutions will work too, I guess. What we’re interested in here is how to make the backup properly synchronized with the version of the website that is publically accessible when things are running smoothly. Now we have to get kind of “engineeringy” and specify what we can and can not fit into this box.

First off, what will be presented below will mostly be suitable for information websites that don’t see updates every few minutes like an ecommerce site. Even a website that sees a lot of comments to articles will seem kind of wonky after failover when set up as below.

It’s also important for the people running the site to keep track of failovers happening. This is trivial since Cloudflare will send out emails about the primary and the backup going up or down, but that doesn’t mean people can’t screw it up. Like having the emails go to one person and then that person goes on vacation and well, you can see where this is going.

So the idea here is that we simply use rsync and MySQL command line tools to sync the primary web site to the backup every couple of hours. The crontab entry running on my backup server looks like this:

0 */4 * * * rsync -rlptD --chown=www-data:www-data USERNAME@ker.oderland.com:/home/USERNAME/deref.se/ /var/www/html

A run every four hours is fine since I update this page like every couple of months : )

But this only covers the files, not the database. For that I have this script(keeps database dumps for 5 days):

#!/bin/bash

TIMESTAMP=$(date +"%Y_%m_%d_%H%M")
ssh USERNAME@ker.oderland.com "mysqldump -u db_wp32 -p'MY_SECRET_DB_PASSWORD' db_wp32" > deref_se_$TIMESTAMP.sql
mysql db_wp32 < deref_se_$TIMESTAMP.sql
find . -maxdepth 1 -mtime +5 -name "*.sql" -exec rm {} \;

I’ve actually packaged both commands as separate scripts so my crontab looks like this:

0 */4 * * * /root/rsync_from_ker.sh
0 */4 * * * /root/dump_and_import.sh

Here is an important detail : this is not just done when the backup hasn’t received any writes. If this was a live site that failed over and happy-go-lucky editors made changes to the backup web site thinking that it’s the primary then they would have at most four hours before their dreams were crushed.

You could try to make this contingent on there not having been any successful POSTs made recently and send out an email if it has. Like “Can’t update backup website. It has recent changes that would be lost!”

But it’s more robust to just keep track of failovers happening as notified by Cloudflare. High availability always comes at some cost, whether it be features, money, usability, overhead etc. Here the cost is mostly overhead in the form of people needing to keep an eye out for emails like this:

Fri, 04 Dec 2020 19:28:59 UTC | DOWN | Origin ker.oderland.com from Pool Main_server | Response code mismatch error

Now, could you use this same setup for an ecommerce site? Yes… But that’s pushing your luck. I would strongly recommend against it. There are assumptions made here about the database dumps being small, the updates spaced out a lot over time and so on that don’t fit ecommerce. I have a better example of how one can set that up but the complexity goes up a bit.

Extra security

To limit the backup server to accepting HTTP/HTTPS requests from Cloudflare servers and only accept SSH requests from your own IP:

iptables -A INPUT -i eth0 -p tcp -s YOURIPADDRESS --dport 22 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp --sport 22 -m state --state ESTABLISHED -j ACCEPT
iptables -A OUTPUT -o eth0 -p tcp --sport 22 -m state --state ESTABLISHED -j ACCEPT
iptables -A OUTPUT -o eth0 -p tcp --dport 22 -m state --state NEW,ESTABLISHED -j ACCEPT

iptables -A INPUT -i eth0 -p tcp -s 173.245.48.0/20 --dport 80 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 103.21.244.0/22 --dport 80 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 103.22.200.0/22 --dport 80 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 103.31.4.0/22 --dport 80 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 141.101.64.0/18 --dport 80 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 108.162.192.0/18 --dport 80 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 190.93.240.0/20 --dport 80 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 188.114.96.0/20 --dport 80 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 197.234.240.0/22 --dport 80 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 198.41.128.0/17 --dport 80 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 162.158.0.0/15 --dport 80 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 104.16.0.0/12 --dport 80 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 172.64.0.0/13 --dport 80 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 131.0.72.0/22 --dport 80 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 10.0.0.2/32  --dport 80 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 173.245.48.0/20 --dport 443 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 103.21.244.0/22 --dport 443 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 103.22.200.0/22 --dport 443 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 103.31.4.0/22 --dport 443 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 141.101.64.0/18 --dport 443 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 108.162.192.0/18 --dport 443 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 190.93.240.0/20 --dport 443 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 188.114.96.0/20 --dport 443 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 197.234.240.0/22 --dport 443 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 198.41.128.0/17 --dport 443 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 162.158.0.0/15 --dport 443 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 104.16.0.0/12 --dport 443 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 172.64.0.0/13 --dport 443 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 131.0.72.0/22 --dport 443 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -s 10.0.0.2/32 --dport 443 -m state --state NEW,ESTABLISHED -j ACCEPT

iptables -A OUTPUT -o eth0 -p tcp --sport 80 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A OUTPUT -o eth0 -p tcp --sport 443 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A OUTPUT -o eth0 -p tcp --dport 80 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A OUTPUT -o eth0 -p tcp --dport 443 -m state --state NEW,ESTABLISHED -j ACCEPT

iptables -A INPUT -i eth0 -p udp --sport 53 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp --sport 80 -m state --state ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p tcp --sport 443 -m state --state ESTABLISHED -j ACCEPT
iptables -A INPUT -i eth0 -p udp --dport 51820 -j ACCEPT

iptables -A OUTPUT -o eth0 -p udp --dport 53 -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A OUTPUT -o eth0 -p udp --sport 51820 -j ACCEPT


iptables -P INPUT DROP
iptables -P FORWARD DROP
iptables -P OUTPUT DROP

Allows outgoing DNS, HTTP and HTTPS connections so you can update the server as per usual. If you want to disable these rules for whatever reason, run this:

iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT

iptables -F

iptables -F only flushes the explicit rules, it doesn’t change the default action for INPUT, OUTPUT and FORWARD so they remain “DROP” and you can’t access the server remotely any more. Learnt that the hard way!

Cloudflare failover

Cloudflare has a Load Balancer service that you can also use a failover service. The base version costs $5 a month(as of 2020-12-05) and allows for two origins which goes a long way. Configure your domain to use their name servers and set your domain up in the client area to be proxied through Cloudflare before proceeding with the load balancer setup.

Some nomenclature:

Origin => Server
Pool => Collection of Origins
Fallback pool => Pool of last resort(not relevant when there are only two pools)

To get failover functionality only we create one pool for the primary server and another pool for the backup. This is a working setup using the server ker.oderland.com as the primary and a VPS external.svealiden.se as the backup:

Let’s look at how each origin is configured. First the primary:

Note how we have one Origin only and that it is referred to not by the website name deref.se(which points to Cloudflare remember?) but the server name. Don’t worry about virtual hosting. Cloudflare understands that it needs to preserve the Host-header when forwarding traffic to origin servers.

Basically it’s the same for the backup just different information for the Origin. Same Health Check Region, same Monitor and the same Notification settings.

Health Check Region should be pretty self-explanatory but Monitoring isn’t. I have a single Monitor that checks the start page of my web page for status code 200 and the string “Digging”:

Note the advanced settings where the check in Response Body and the Host-header specification at the bottom are non-standard. This is a WordPress page so the Host-header needs to be correct even for my VPS that doesn’t rely on virtual hosting. Checking more frequently than every 60 seconds costs more money but it already seems reasonable to me. The monitor should be assigned to both origins as we saw earlier.

If you want to avoid even a few minutes of downtime when planned maintenance is scheduled for your primary origin you can just promote the backup manually by changing the order of the pools:

There are a number of gotchas to consider for any primary/secondary failover setup and I’ll go through some alternatives in separate posts(see Basic “information website” failover and “the other one I’m going to write maybe”) but here are some specific to Cloudflare’s UI:

  1. Do not just have one pool in the “Origin Pools” category even if the backup pool is chosen as your fallback. That’s not how it works. It’s the contents of the Origin Pools section that determines which Pools are used and what priority they have. It makes no sense to not include the Fallback Pool among the Origin Pools-section.
  2. The fallback pool will always show “No health”. Don’t worry about it. If any pool in the Origins Pool section shows “No health”, then you have a problem. But all you need to do is to add a monitor to that pool.
  3. Don’t add multiple origins to a single pool if your don’t actually want to spread out traffic across multiple servers. And you probably don’t because that’s a multi-master setup and that means huge head aches.

SSL

What to do about SSL? We can’t use Let’s Encrypt on the backup and it might even be tricky on the primary. Not to worry, just install Cloudflare’s self-signed certificates: https://support.cloudflare.com/hc/en-us/articles/115000479507

They have 15 year validity so while it’s slightly awkward to install you don’t have to do it very often. Cloudflare of course shows a valid certificate for the ordinary visitor who connects to the Cloudflare proxies(click on the certificate for this page to see for yourself) but the communication between Cloudflare and the origin servers uses these 15-year self-signed certificates.

Drawbacks

So is Cloudflare Load Balancer a good failover solution? Well… I’m not 100% positive to it. Cloudflare have had global outages lasting from 30 minutes to an hour. Not every year but often enough for me to be skeptical of trusting them with the role of being a high availability load balancer.

I would have been okey with using external name servers made to point to Cloudflare proxies and use short TTLs. Then – if Cloudflare goes down – I can change the pointers and then we’re talking (my response time) + (TTL) downtime. But that complicates matters a bit for issuing SSL certificates so it’s not ideal.

Note that the two servers used in this setup are hosted by the same company. That’s fine for a test setup but it’s a bad idea in a live environment. Don’t trust Amazon Availability Zones to be independent either. Have your primary and your backup with different companies using different infrastructure!