Percona XtraDB Cluster again

So I got fed up with Percona XtraDB Cluster some time back. It was the backend for Zabbix and a bad crash left me with a cluster that just wouldn’t form a quorum and that was it for me. Other problems had included lots of errors on commits due to inconsistencies between nodes which I felt was difficult to justify since only one of the nodes actually received any writes…

Well I have since written a python script that can write two multiple master servers compatible with MySQL. It complies with Percona’s Strict Mode and after some grinding I got it to maintain consistency of data. The code is absolutely awful but it works! One of the neat tricks is that it can at any point go through the database to determine that the “business logic” is maintained. That’s how I dug up tricky corner cases where a customer could have a purchase registered to him after he had been removed from the Customer-table.

So what happens when I run the script at full tilt(it’s not fast but 15-30 transactions per second isn’t bad given only three queries can run in parallel due to the aforementioned awful code) and I pull the plug on all three Percona XtraDB Cluster VMs? Well the script stops printing out information about completed operations because it can’t continue. So I kill the script.

Next I reboot all three nodes and they just automatically get back into working order. No fiddling with extraneous tables or having to set innodb_force_recovery. I was surprised. I’ve tried it a few times and it just works. Running the consistency-checker shows that atomicity has been maintained. I even took minute-by-minute snapshots of the mysql partition(running Btrfs) and forced a node back in time so it looked like it had lost five minutes of updates. Still no problem!

Now I find myself in the surprising position of having to roll up my sleeves and meticulously engineer a scenario to replicate what I once encountered in the wild. I will basically have to:

Reset the database back to a shared state X
Advance the cluster by writing from my script
Snapshot node 1
Reset the database back to the previous state X
Write some more
Snapshot node 2
Repeat for node 3

Eventually I will have three snapshots – one on each node – which constitutes a completely split brain. No quorum and no automatic syncing possible. Then I can test how to recover manually. Hmm,. I should probably snapshot all three nodes in each of the split brain versions. Then I can try a 2-1 situation and see how Percona XtraDB Cluster will handle it. It should let the 2 nodes that are in agreement start up but the 1 should be kept out, not resynced or overwritten – just excluded.

Well, that’s what I think anyway.

Addendum 1: Now something interesting happened for some reason. I had shut down the cluster cleanly and yet standard start via systemctl didn’t work:

root@pcmk1:~# systemctl start mysql
Job for mysql.service failed because the control process exited with error code.
See “systemctl status mysql.service” and “journalctl -xe” for details.

I had to cat the contents of /var/lib/mysql/data/grastate.dat to find the node with either the highest sequence number or safe_to_bootstrap: 1. I found the latter:

root@pcmk1:~# cat /var/lib/mysql/data/grastate.dat
GALERA saved state
version: 2.1
uuid: ee445958-ed07-11e9-a650-a6dcefe074b9
seqno: 458661
safe_to_bootstrap: 1

The other two nodes had the same seqno. Anyway, with this information in hand pcmk1 could be started with /etc/init.d/mysql bootstrap-pxc and then I could start the other two nodes with systemctl start mysql and then all was well again.

I’m getting the idea that the startup script for Percona XtraDB Cluster does some self-bootstrapping that isn’t just done by running systemctl start mysql. Or this is just one more mystery.

Notes to self:
Point X: Snapshots/data/data@auto-2019-10-29-2111_48h
Start of brain 1: Snapshots/data/data@auto-2019-10-29-2123_48h
End of brain 1: Snapshots/data/data@auto-2019-10-29-2126_48h
Start of brain 2: Snapshots/data/data@auto-2019-10-29-2134_48h
End of brain 2: Snapshots/data/data@auto-2019-10-29-2137_48h
Start of brain 3: Snapshots/data/data@auto-2019-10-29-2143_48h
End of brain 3: Snapshots/data/data@auto-2019-10-29-2146_48h

Okey, testing brain 2 on pcmk3 and pcmk2 and let pcmk1 use brain 1. pcmk2 and pcmk3 had slightly different seqno which isn’t so weird since the snapshots can’t be initiated at the exact same microsecond and can’t really be expected to complete in exactly the same number of hard drive writes. So I had to edit the grastate.dat file for pcmk3 so that safe_to_bootstrap was set to 1 and then ran /etc/init.d/mysql bootstrap-pxc. I needed to run systemctl start mysql a couple of times to get pcmk2 to sync but eventually it got into place.

As expected pcmk1 couldn’t join and was left in its own partition. So far so good. Now to let each node use their own brain.

2019-10-30T21:21:22.923454Z 0 [Note] WSREP: Waiting for SST/IST to complete.
2019-10-30T21:21:22.923806Z 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 390ca26a-fb5b-11e9-805a-9fd389d6c554
2019-10-30T21:21:22.924357Z 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 390ca26a-fb5b-11e9-805a-9fd389d6c554
2019-10-30T21:21:22.924830Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 390ca26a-fb5b-11e9-805a-9fd389d6c554 from 0 (pxc-cluster-node-1)
2019-10-30T21:21:22.925159Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 390ca26a-fb5b-11e9-805a-9fd389d6c554 from 1 (pxc-cluster-node-2)
2019-10-30T21:21:22.925188Z 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:group_post_state_exchange():324: Reversing history: 464082 -> 462077, this member has applied 2005 more events than the primary component.Data loss is possible. Aborting.
2019-10-30T21:21:22.925212Z 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.

Seems reasonable. One node bailed out real quick but two kept trying.

| wsrep_cluster_size | 2 |
| wsrep_cluster_state_uuid | ee445958-ed07-11e9-a650-a6dcefe074b9 |
| wsrep_cluster_status | non-Primary

But the cluster didn’t process queries during this attempt to form a quorum. Just as it intended. So let’s say we know for a fact that brain 3 is the “best”. Brain 1 and brain 2 are just wrong, not actually holding information that needs to be preserved. Percona XtraDB Cluster shouldn’t make this determination, as it hasn’t. But we need to be able to do this when we’ve reviewed the situation(and need to get things back up and running). So let’s run this on pcmk2:

rm -rf /var/lib/mysql/data/*
systemctl start mysql

And a check of the STATUS:

wsrep_local_state 4
wsrep_local_state_comment Synced
wsrep_cert_index_size 0
wsrep_cert_bucket_count 22
wsrep_gcache_pool_size 1456
wsrep_causal_reads 0
wsrep_cert_interval 0.000000
wsrep_open_transactions 0
wsrep_open_connections 0
wsrep_ist_receive_status
wsrep_ist_receive_seqno_start 0
wsrep_ist_receive_seqno_current 0
wsrep_ist_receive_seqno_end 0
wsrep_incoming_addresses 192.168.1.65:3306,192.168.1.66:3306
wsrep_cluster_weight 2
wsrep_desync_count 0
wsrep_evs_delayed
wsrep_evs_evict_list
wsrep_evs_repl_latency 0/0/0/0/0
wsrep_evs_state OPERATIONAL
wsrep_gcomm_uuid 0a119cbd-fb5e-11e9-a854-a317f06d04a7
wsrep_cluster_conf_id 2
wsrep_cluster_size 2
wsrep_cluster_state_uuid ee445958-ed07-11e9-a650-a6dcefe074b9
wsrep_cluster_status Primary
wsrep_connected ON

And the logic check passes when running my Python script. Okey, so let’s join the last node… And yeah, works the same. So if one is patient all you need to do is completely blank the datadir(which in my case is /var/lib/mysql/data but for most people will be /var/lib/mysql) and start the node again. It will do state transfer to initialize itself and then join.

wsrep_local_state 4
wsrep_local_state_comment Synced
wsrep_cert_index_size 0
wsrep_cert_bucket_count 22
wsrep_gcache_pool_size 1456
wsrep_causal_reads 0
wsrep_cert_interval 0.000000
wsrep_open_transactions 0
wsrep_open_connections 0
wsrep_ist_receive_status
wsrep_ist_receive_seqno_start 0
wsrep_ist_receive_seqno_current 0
wsrep_ist_receive_seqno_end 0
wsrep_incoming_addresses 192.168.1.65:3306,192.168.1.66:3306,192.168.1.64:3306
wsrep_cluster_weight 3
wsrep_desync_count 0
wsrep_evs_delayed
wsrep_evs_evict_list
wsrep_evs_repl_latency 0/0/0/0/0
wsrep_evs_state OPERATIONAL
wsrep_gcomm_uuid ce76d2c9-fb5e-11e9-a04e-926b86eb0f23
wsrep_cluster_conf_id 3
wsrep_cluster_size 3
wsrep_cluster_state_uuid ee445958-ed07-11e9-a650-a6dcefe074b9
wsrep_cluster_status Primary
wsrep_connected ON

What to do if you have a split brain scenario and no node is actually “wrong” but there actually is a discrepancy in valuable data? Well then you’re boned! Databases do not have a convenient way of merging data like that. I’m of the opinion that any application that writes to a database should have some fallback mechanism whereby it can scan versions of the database that diverged away from the master and then inject the changes into the master.

In my dummy application this would involve retrieving all purchases and all deposits of credit from the diverged copy. Any purchase or deposit of credit unique to the diverged copy would then be processed as if though they were brand new transactions coming in from customers, with the customers’ credit balance being changed accordingly. In some systems this may not be possible but it’s really a desirable feature when you don’t want a split brain to be an insurmountable obstacle to getting your records straight.