There is a draft on this website called “HA WordPress on a slightly bigger budget” and details the futility of trying to use Keepalived and Cloudflare Load Balancer to make a working primary/backup high availability server setup. I do not consider this to be any failing on the part of Keepalived as it quite clearly isn’t meant to be used for things like databases where split brain is a concern.
Cloudflare could have been a bit more helpful by providing a “no failback” option so that if traffic was ever directed to the backup, traffic wouldn’t go over to the primary automatically if that server came back online again. But I’ve implemented that using their REST API instead. So it’s still Cloudflare Load Balancer, MariaDB, Nginx, PHP-FPM, lsyncd and some janky bash scripts. It’s just that Keepalived isn’t being asked to do the work of Pacemaker/Corosync.
A planned switchover
Important numbers:
Thu 24 Dec 2020 11:37:11 PM CET Read worked. | Write worked. 185.20.12.24
Thu 24 Dec 2020 11:37:30 PM CET Read worked. | Write worked. 172.31.35.57
We had 19 seconds during which writes didn’t work due to the switchover process. Reads worked throughout. Full log:
cjp@util:~$ ./wootest.sh
Thu 24 Dec 2020 11:36:53 PM CET Read worked. | Write worked. 185.20.12.24
Thu 24 Dec 2020 11:36:54 PM CET Read worked. | Write worked. 185.20.12.24
Thu 24 Dec 2020 11:36:56 PM CET Read worked. | Write worked. 185.20.12.24
Thu 24 Dec 2020 11:36:58 PM CET Read worked. | Write worked. 185.20.12.24
Thu 24 Dec 2020 11:36:59 PM CET Read worked. | Write worked. 185.20.12.24
Thu 24 Dec 2020 11:37:01 PM CET Read worked. | Write worked. 185.20.12.24
Thu 24 Dec 2020 11:37:03 PM CET Read worked. | Write worked. 185.20.12.24
Thu 24 Dec 2020 11:37:04 PM CET Read worked. | Write worked. 185.20.12.24
Thu 24 Dec 2020 11:37:06 PM CET Read worked. | Write worked. 185.20.12.24
Thu 24 Dec 2020 11:37:07 PM CET Read worked. | Write worked. 185.20.12.24
Thu 24 Dec 2020 11:37:09 PM CET Read worked. | Write worked. 185.20.12.24
Thu 24 Dec 2020 11:37:11 PM CET Read worked. | Write worked. 185.20.12.24
Thu 24 Dec 2020 11:37:12 PM CET Read worked. | Write failed. 185.20.12.24
Thu 24 Dec 2020 11:37:14 PM CET Read worked. | Write failed. 185.20.12.24
Thu 24 Dec 2020 11:37:15 PM CET Read worked. | Write failed. 185.20.12.24
Thu 24 Dec 2020 11:37:17 PM CET Read worked. | Write failed. 185.20.12.24
Thu 24 Dec 2020 11:37:19 PM CET Read worked. | Write failed. 185.20.12.24
Thu 24 Dec 2020 11:37:20 PM CET Read worked. | Write failed. 185.20.12.24
Thu 24 Dec 2020 11:37:22 PM CET Read worked. | Write failed. 185.20.12.24
Thu 24 Dec 2020 11:37:23 PM CET Read worked. | Write failed. 172.31.35.57
Thu 24 Dec 2020 11:37:25 PM CET Read worked. | Write failed. 172.31.35.57
Thu 24 Dec 2020 11:37:27 PM CET Read worked. | Write failed. 172.31.35.57
Thu 24 Dec 2020 11:37:30 PM CET Read worked. | Write worked. 172.31.35.57
Thu 24 Dec 2020 11:37:32 PM CET Read worked. | Write worked. 172.31.35.57
Thu 24 Dec 2020 11:37:34 PM CET Read worked. | Write worked. 172.31.35.57
Thu 24 Dec 2020 11:37:36 PM CET Read worked. | Write worked. 172.31.35.57
Thu 24 Dec 2020 11:37:38 PM CET Read worked. | Write worked. 172.31.35.57
Thu 24 Dec 2020 11:37:40 PM CET Read worked. | Write worked. 172.31.35.57
Thu 24 Dec 2020 11:37:42 PM CET Read worked. | Write worked. 172.31.35.57
Thu 24 Dec 2020 11:37:44 PM CET Read worked. | Write worked. 172.31.35.57
Yes, I _am_ an idiot. But not for the reasons that I thought…
Installing Zabbix Agent on RHEL was difficult for me somehow
Failover test
Okey, so now for the (hopefully) more rare case of the primary going down all of a sudden:
Sat 26 Dec 2020 06:44:18 PM CET Read worked. | Write worked. 172.31.35.57
Sat 26 Dec 2020 06:44:21 PM CET Read worked. | Write worked. 172.31.35.57
Sat 26 Dec 2020 06:44:23 PM CET Read worked. | Write worked. 172.31.35.57
Sat 26 Dec 2020 06:44:26 PM CET Read worked. | Write worked. 172.31.35.57
Sat 26 Dec 2020 06:44:27 PM CET Read failed. | Write failed. <html>
<head><title>521 Origin Down</title></head>
<body bgcolor="white">
<center><h1>521 Origin Down</h1></center>
<hr><center>cloudflare-nginx</center>
</body>
</html>
Sat 26 Dec 2020 06:44:28 PM CET Read failed. | Write failed. <html>
<head><title>521 Origin Down</title></head>
<body bgcolor="white">
<center><h1>521 Origin Down</h1></center>
<hr><center>cloudflare-nginx</center>
</body>
</html>
Sat 26 Dec 2020 06:45:30 PM CET Read failed. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:32 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:33 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:35 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:36 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:38 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:39 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:40 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:42 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:43 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:45 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:46 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:48 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:49 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:50 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:52 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:53 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:55 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:56 PM CET Read worked. | Write worked. 185.20.12.24
Sat 26 Dec 2020 06:45:58 PM CET Read worked. | Write worked. 185.20.12.24
Sat 26 Dec 2020 06:45:59 PM CET Read worked. | Write worked. 185.20.12.24
Sat 26 Dec 2020 06:46:00 PM CET Read worked. | Write worked. 185.20.12.24
Sat 26 Dec 2020 06:46:02 PM CET Read worked. | Write worked. 185.20.12.24
Sat 26 Dec 2020 06:46:03 PM CET Read worked. | Write worked. 185.20.12.24
Sat 26 Dec 2020 06:46:04 PM CET Read worked. | Write worked. 185.20.12.24
Sat 26 Dec 2020 06:46:06 PM CET Read worked. | Write worked. 185.20.12.24
Sat 26 Dec 2020 06:46:07 PM CET Read worked. | Write worked. 185.20.12.24
Important numbers:
Sat 26 Dec 2020 06:44:26 PM CET Read worked. | Write worked. 172.31.35.57
Sat 26 Dec 2020 06:45:32 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:56 PM CET Read worked. | Write worked. 185.20.12.24
So a minute with site being down for reads and writes, and another half a minute during which only writes didn’t work. Not too shabby. And Zabbix is suitably concerned:

Crashed server rehabilitation
Problems getting a crashed primary to sync with backup:
[root@ip-172-31-35-57 ~]# mysql -e "SHOW SLAVE STATUS\G;"
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: backup
Master_User: replication_user
Master_Port: 3306
Connect_Retry: 60
Master_Log_File:
Read_Master_Log_Pos: 4
Relay_Log_File: woo1-relay-log.000001
Relay_Log_Pos: 4
Relay_Master_Log_File:
Slave_IO_Running: No
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 4
Relay_Log_Space: 256
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-1-58257, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions'
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 2
Master_SSL_Crl:
Master_SSL_Crlpath:
Using_Gtid: Current_Pos
Gtid_IO_Pos: 0-1-58257
Replicate_Do_Domain_Ids:
Replicate_Ignore_Domain_Ids:
Parallel_Mode: conservative
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
Slave_DDL_Groups: 2
Slave_Non_Transactional_Groups: 14585
Slave_Transactional_Groups: 8746
Oh, dear! What have we missed? I.e. what happened on primary when it was disconnected from the backup and _should_ have been read-only?
[root@ip-172-31-35-57 ~]# mysqlbinlog /var/lib/mysql/woo1-bin.000054 | less
<snip>
#201225 17:32:10 server id 1 end_log_pos 8224049 CRC32 0x780272cf GTID 0-1-58257
/*!100001 SET @@session.gtid_seq_no=58257*//*!*/;
START TRANSACTION
/*!*/;
# at 8224049
#201225 17:32:10 server id 1 end_log_pos 8224198 CRC32 0x07e3910e Query thread_id=3456 exec_time=0 error_code=0
SET TIMESTAMP=1608917530/*!*/;
DELETE FROM `wp9k_options` WHERE `option_name` = '_transient_doing_cron'
/*!*/;
# at 8224198
#201225 17:32:10 server id 1 end_log_pos 8224281 CRC32 0xb1ecec93 Query thread_id=3456 exec_time=0 error_code=0
SET TIMESTAMP=1608917530/*!*/;
COMMIT
/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
Oh, DELETE _transient_doing_cron… Well that’s no biggie. I was feigning concern that somehow lots of orders to my entirely not-real web shop had somehow happened to be written to the wrong database.
Well MariaDB is very stubborn about these sorts of things. The only way I know how to get primary back in sync with backup is:
root@woo2:~# mysqldump -A --master-data > backup_2020-12-25_1854.sql
root@woo2:~# scp backup_2020-12-25_1854.sql primary:/root
backup_2020-12-25_1854.sql
Back to woo1(primary):
[root@ip-172-31-35-57 ~]# systemctl stop mariadb
[root@ip-172-31-35-57 ~]# rm -rf /var/lib/mysql/* # <--- EXTREMELY DANGER IF YOU DON'T KNOW PRECISELY WHAT YOU ARE DOING. Lost hours of work doing this on the wrong server once...
[root@ip-172-31-35-57 ~]# mysql_install_db
Installing MariaDB/MySQL system tables in '/var/lib/mysql' ...
OK
To start mysqld at boot time you have to copy
support-files/mysql.server to the right place for your system
PLEASE REMEMBER TO SET A PASSWORD FOR THE MariaDB root USER !
To do so, start the server, then issue the following commands:
'/usr/bin/mysqladmin' -u root password 'new-password'
'/usr/bin/mysqladmin' -u root -h ip-172-31-35-57.eu-north-1.compute.internal password 'new-password'
Alternatively you can run:
'/usr/bin/mysql_secure_installation'
which will also give you the option of removing the test
databases and anonymous user created by default. This is
strongly recommended for production servers.
See the MariaDB Knowledgebase at http://mariadb.com/kb or the
MySQL manual for more instructions.
You can start the MariaDB daemon with:
cd '/usr' ; /usr/bin/mysqld_safe --datadir='/var/lib/mysql'
You can test the MariaDB daemon with mysql-test-run.pl
cd '/usr/mysql-test' ; perl mysql-test-run.pl
Please report any problems at http://mariadb.org/jira
The latest information about MariaDB is available at http://mariadb.org/.
You can find additional information about the MySQL part at:
http://dev.mysql.com
Consider joining MariaDB's strong and vibrant community:
https://mariadb.org/get-involved/
[root@ip-172-31-35-57 ~]# systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code.
See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@ip-172-31-35-57 ~]# journalctl -u mariadb
-- Logs begin at Wed 2020-12-23 00:40:59 UTC, end at Fri 2020-12-25 17:55:55 UTC. --
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Starting MariaDB 10.3 database server...
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-socket[1506]: Socket file /var/lib/mysql/mysql.sock exists.
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-socket[1506]: No process is using /var/lib/mysql/mysql.sock, which means it is a garbage, so it will be removed automatically.
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[1535]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[1535]: If this is not the case, make sure the /var/lib/mysql is empty before running mysql-prepare-db-dir.
Dec 23 00:44:23 ip-172-31-35-57.eu-north-1.compute.internal mysqld[1573]: 2020-12-23 0:44:23 0 [Note] /usr/libexec/mysqld (mysqld 10.3.27-MariaDB-log) starting as process 1573 ...
Dec 23 00:44:23 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Started MariaDB 10.3 database server.
Dec 25 17:55:26 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Stopping MariaDB 10.3 database server...
Dec 25 17:55:28 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: mariadb.service: Succeeded.
Dec 25 17:55:28 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Stopped MariaDB 10.3 database server.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Starting MariaDB 10.3 database server...
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[66558]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[66558]: If this is not the case, make sure the /var/lib/mysql is empty before running mysql-prepare-db-dir.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: 2020-12-25 17:55:55 0 [Note] /usr/libexec/mysqld (mysqld 10.3.27-MariaDB-log) starting as process 66596 ...
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: /usr/libexec/mysqld: Can't create file './woo1.err' (errno: 13 "Permission denied")
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: 2020-12-25 17:55:55 0 [ERROR] mysqld: File './woo1-bin.index' not found (Errcode: 13 "Permission denied")
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: 2020-12-25 17:55:55 0 [ERROR] Aborting
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: mariadb.service: Failed with result 'exit-code'.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Failed to start MariaDB 10.3 database server.
[root@ip-172-31-35-57 ~]# ls -lh /var/lib/
total 4.0K
drwxr-xr-x. 2 root root 85 Dec 18 20:17 alternatives
drwxr-xr-x. 3 root root 209 Oct 31 05:07 authselect
drwxr-xr-x. 2 chrony chrony 19 Dec 25 17:49 chrony
drwxr-xr-x. 8 root root 105 Dec 23 00:41 cloud
drwx------. 2 apache apache 6 Jun 15 2020 dav
drwxr-xr-x. 2 root root 24 Oct 31 05:02 dbus
drwxr-xr-x. 2 root root 6 May 7 2020 dhclient
drwxr-xr-x. 4 root root 115 Dec 25 16:32 dnf
drwxr-xr-x. 2 root root 6 Apr 23 2020 games
drwx------. 2 apache apache 6 Jun 15 2020 httpd
drwxr-xr-x. 2 root root 6 Aug 4 12:03 initramfs
drwxr-xr-x. 2 root root 30 Dec 25 03:51 logrotate
drwxr-xr-x. 2 root root 6 Apr 23 2020 misc
drwxr-xr-x. 5 mysql mysql 264 Dec 25 17:55 mysql
drwxr-xr-x. 4 root root 45 Dec 18 22:13 net-snmp
drwx------. 2 root root 169 Dec 25 17:51 NetworkManager
drwxrwx---. 3 nginx root 33 Dec 18 23:20 nginx
drwxr-xr-x. 2 root root 6 Aug 12 2018 os-prober
drwxr-xr-x. 6 root root 68 Dec 18 21:37 php
drwxr-x---. 3 root polkitd 28 Oct 31 05:02 polkit-1
drwx------. 2 root root 6 Oct 31 05:02 portables
drwx------. 2 root root 6 Oct 31 05:02 private
drwxr-x---. 6 root root 71 Oct 31 05:03 rhsm
drwxr-xr-x. 2 root root 4.0K Dec 23 00:51 rpm
drwxr-xr-x. 3 root root 21 Dec 18 20:30 rpm-state
drwx------. 2 root root 29 Dec 25 17:55 rsyslog
drwxr-xr-x. 5 root root 46 Oct 31 05:03 selinux
drwxr-xr-x. 9 root root 105 Oct 31 05:03 sss
drwxr-xr-x. 5 root root 70 Nov 6 13:06 systemd
drwx------. 2 tss tss 6 Jun 10 2019 tpm
drwxr-xr-x. 2 root root 6 Oct 2 16:12 tuned
drwxr-xr-x. 2 unbound unbound 22 Dec 25 00:00 unbound
[root@ip-172-31-35-57 ~]# ls -lh /var/lib/mysql/
total 109M
-rw-rw---- 1 root root 16K Dec 25 17:55 aria_log.00000001
-rw-rw---- 1 root root 52 Dec 25 17:55 aria_log_control
-rw-rw---- 1 root root 972 Dec 25 17:55 ib_buffer_pool
-rw-rw---- 1 root root 12M Dec 25 17:55 ibdata1
-rw-rw---- 1 root root 48M Dec 25 17:55 ib_logfile0
-rw-rw---- 1 root root 48M Dec 25 17:55 ib_logfile1
drwx------ 2 root root 4.0K Dec 25 17:55 mysql
drwx------ 2 root root 20 Dec 25 17:55 performance_schema
drwx------ 2 root root 20 Dec 25 17:55 test
-rw-rw---- 1 root root 28K Dec 25 17:55 woo1-bin.000001
-rw-rw---- 1 root root 18 Dec 25 17:55 woo1-bin.index
-rw-rw---- 1 root root 7 Dec 25 17:55 woo1-bin.state
-rw-rw---- 1 root root 642 Dec 25 17:55 woo1.err
[root@ip-172-31-35-57 ~]# chown -R mysql:mysql /var/lib/mysql/
[root@ip-172-31-35-57 ~]# systemctl restart mariadb
[root@ip-172-31-35-57 ~]# journalctl -u mariadb
-- Logs begin at Wed 2020-12-23 00:40:59 UTC, end at Fri 2020-12-25 17:56:22 UTC. --
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Starting MariaDB 10.3 database server...
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-socket[1506]: Socket file /var/lib/mysql/mysql.sock exists.
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-socket[1506]: No process is using /var/lib/mysql/mysql.sock, which means it is a garbage, so it will be removed automatically.
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[1535]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[1535]: If this is not the case, make sure the /var/lib/mysql is empty before running mysql-prepare-db-dir.
Dec 23 00:44:23 ip-172-31-35-57.eu-north-1.compute.internal mysqld[1573]: 2020-12-23 0:44:23 0 [Note] /usr/libexec/mysqld (mysqld 10.3.27-MariaDB-log) starting as process 1573 ...
Dec 23 00:44:23 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Started MariaDB 10.3 database server.
Dec 25 17:55:26 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Stopping MariaDB 10.3 database server...
Dec 25 17:55:28 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: mariadb.service: Succeeded.
Dec 25 17:55:28 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Stopped MariaDB 10.3 database server.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Starting MariaDB 10.3 database server...
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[66558]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[66558]: If this is not the case, make sure the /var/lib/mysql is empty before running mysql-prepare-db-dir.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: 2020-12-25 17:55:55 0 [Note] /usr/libexec/mysqld (mysqld 10.3.27-MariaDB-log) starting as process 66596 ...
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: /usr/libexec/mysqld: Can't create file './woo1.err' (errno: 13 "Permission denied")
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: 2020-12-25 17:55:55 0 [ERROR] mysqld: File './woo1-bin.index' not found (Errcode: 13 "Permission denied")
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: 2020-12-25 17:55:55 0 [ERROR] Aborting
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: mariadb.service: Failed with result 'exit-code'.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Failed to start MariaDB 10.3 database server.
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Starting MariaDB 10.3 database server...
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[66636]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[66636]: If this is not the case, make sure the /var/lib/mysql is empty before running mysql-prepare-db-dir.
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66675]: 2020-12-25 17:56:22 0 [Note] /usr/libexec/mysqld (mysqld 10.3.27-MariaDB-log) starting as process 66675 ...
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-upgrade[66707]: The datadir located at /var/lib/mysql needs to be upgraded using 'mysql_upgrade' tool. This can be done using the following steps:
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-upgrade[66707]: 1. Back-up your data before with 'mysql_upgrade'
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-upgrade[66707]: 2. Start the database daemon using 'service mariadb start'
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-upgrade[66707]: 3. Run 'mysql_upgrade' with a database user that has sufficient privileges
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-upgrade[66707]: Read more about 'mysql_upgrade' usage at:
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-upgrade[66707]: https://mariadb.com/kb/en/mariadb/documentation/sql-commands/table-commands/mysql_upgrade/
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Started MariaDB 10.3 database server.
[root@ip-172-31-35-57 ~]# mysql_upgrade
Phase 1/7: Checking and upgrading mysql database
Processing databases
mysql
mysql.column_stats OK
mysql.columns_priv OK
mysql.db OK
mysql.event OK
mysql.func OK
mysql.gtid_slave_pos OK
mysql.help_category OK
mysql.help_keyword OK
mysql.help_relation OK
mysql.help_topic OK
mysql.host OK
mysql.index_stats OK
mysql.innodb_index_stats OK
mysql.innodb_table_stats OK
mysql.plugin OK
mysql.proc OK
mysql.procs_priv OK
mysql.proxies_priv OK
mysql.roles_mapping OK
mysql.servers OK
mysql.table_stats OK
mysql.tables_priv OK
mysql.time_zone OK
mysql.time_zone_leap_second OK
mysql.time_zone_name OK
mysql.time_zone_transition OK
mysql.time_zone_transition_type OK
mysql.transaction_registry OK
mysql.user OK
Phase 2/7: Installing used storage engines... Skipped
Phase 3/7: Fixing views
Phase 4/7: Running 'mysql_fix_privilege_tables'
Phase 5/7: Fixing table and database names
Phase 6/7: Checking and upgrading tables
Processing databases
information_schema
performance_schema
test
Phase 7/7: Running 'FLUSH PRIVILEGES'
OK
[root@ip-172-31-35-57 ~]# mysql -e "SHOW SLAVE STATUS\G;"
[root@ip-172-31-35-57 ~]# mysql < backup_2020-12-25_1854.sql
[root@ip-172-31-35-57 ~]# grep gtid_slave_pos backup_2020-12-25_1854.sql | head -1
-- SET GLOBAL gtid_slave_pos='0-2-61170';
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 32
Server version: 10.3.27-MariaDB-log MariaDB Server
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [(none)]> GRANT ALL PRIVILEGES ON `johante2_wp508`.* TO `johante2_wp508`@`localhost`;
Query OK, 0 rows affected (0.001 sec)
MariaDB [(none)]> GRANT USAGE,REPLICATION CLIENT,PROCESS,SHOW DATABASES,SHOW VIEW ON *.* TO 'zbx_monitor'@'%';
Query OK, 0 rows affected (0.000 sec)
MariaDB [(none)]> SET GLOBAL gtid_slave_pos='0-2-61170';
Query OK, 0 rows affected (0.021 sec)
MariaDB [(none)]> Bye
[root@ip-172-31-35-57 ~]#
Oki, let’s try to get primary to replicate from backup now.
MariaDB [(none)]> CHANGE MASTER TO master_host="backup", master_port=3306, master_user="replication_user", master_password="REPLICATION_PASSWORD", master_use_gtid=slave_pos;
Query OK, 0 rows affected (0.005 sec)
MariaDB [(none)]> START SLAVE;
Query OK, 0 rows affected (0.009 sec)
MariaDB [(none)]> SHOW SLAVE STATUS;
+----------------------------------+-------------+------------------+-------------+---------------+-----------------+---------------------+-----------------------+---------------+-----------------------+------------------+-------------------+-----------------+---------------------+--------------------+------------------------+-------------------------+-----------------------------+------------+------------+--------------+---------------------+-----------------+-----------------+----------------+---------------+--------------------+--------------------+--------------------+-----------------+-------------------+----------------+-----------------------+-------------------------------+---------------+---------------+----------------+----------------+-----------------------------+------------------+----------------+--------------------+------------+-------------+-------------------------+-----------------------------+---------------+-----------+---------------------+-----------------------------------------------------------------------------+------------------+--------------------------------+----------------------------+
| Slave_IO_State | Master_Host | Master_User | Master_Port | Connect_Retry | Master_Log_File | Read_Master_Log_Pos | Relay_Log_File | Relay_Log_Pos | Relay_Master_Log_File | Slave_IO_Running | Slave_SQL_Running | Replicate_Do_DB | Replicate_Ignore_DB | Replicate_Do_Table | Replicate_Ignore_Table | Replicate_Wild_Do_Table | Replicate_Wild_Ignore_Table | Last_Errno | Last_Error | Skip_Counter | Exec_Master_Log_Pos | Relay_Log_Space | Until_Condition | Until_Log_File | Until_Log_Pos | Master_SSL_Allowed | Master_SSL_CA_File | Master_SSL_CA_Path | Master_SSL_Cert | Master_SSL_Cipher | Master_SSL_Key | Seconds_Behind_Master | Master_SSL_Verify_Server_Cert | Last_IO_Errno | Last_IO_Error | Last_SQL_Errno | Last_SQL_Error | Replicate_Ignore_Server_Ids | Master_Server_Id | Master_SSL_Crl | Master_SSL_Crlpath | Using_Gtid | Gtid_IO_Pos | Replicate_Do_Domain_Ids | Replicate_Ignore_Domain_Ids | Parallel_Mode | SQL_Delay | SQL_Remaining_Delay | Slave_SQL_Running_State | Slave_DDL_Groups | Slave_Non_Transactional_Groups | Slave_Transactional_Groups |
+----------------------------------+-------------+------------------+-------------+---------------+-----------------+---------------------+-----------------------+---------------+-----------------------+------------------+-------------------+-----------------+---------------------+--------------------+------------------------+-------------------------+-----------------------------+------------+------------+--------------+---------------------+-----------------+-----------------+----------------+---------------+--------------------+--------------------+--------------------+-----------------+-------------------+----------------+-----------------------+-------------------------------+---------------+---------------+----------------+----------------+-----------------------------+------------------+----------------+--------------------+------------+-------------+-------------------------+-----------------------------+---------------+-----------+---------------------+-----------------------------------------------------------------------------+------------------+--------------------------------+----------------------------+
| Waiting for master to send event | backup | replication_user | 3306 | 60 | woo2-bin.000002 | 6356025 | woo1-relay-log.000002 | 95872 | woo2-bin.000002 | Yes | Yes | | | | | | | 0 | | 0 | 6356025 | 96180 | None | | 0 | No | | | | | | 0 | No | 0 | | 0 | | | 2 | | | Slave_Pos | 0-2-61237 | | | conservative | 0 | NULL | Slave has read all relay log; waiting for the slave I/O thread to update it | 0 | 40 | 27 |
+----------------------------------+-------------+------------------+-------------+---------------+-----------------+---------------------+-----------------------+---------------+-----------------------+------------------+-------------------+-----------------+---------------------+--------------------+------------------------+-------------------------+-----------------------------+------------+------------+--------------+---------------------+-----------------+-----------------+----------------+---------------+--------------------+--------------------+--------------------+-----------------+-------------------+----------------+-----------------------+-------------------------------+---------------+---------------+----------------+----------------+-----------------------------+------------------+----------------+--------------------+------------+-------------+-------------------------+-----------------------------+---------------+-----------+---------------------+-----------------------------------------------------------------------------+------------------+--------------------------------+----------------------------+
1 row in set (0.000 sec)
MariaDB [(none)]> Bye
From now on we can start replication by saying master_use_gtid=current_pos; at the end but this time around we had to use slave_pos as the starting point. Because reasons.
Failover test
Whatever system you set up for high availability – even a less janky one – needs to be tested a lot before being used live. Example: did you know that Amazon EC2 will assign a VM a different IP address after it was Stopped? Well I sure didn’t! Nor did I realize that Cloudflare’s Origin Server editing UI can’t handle spaces:

Took me a while to figure out I had copied the string “13.49.145.244 “.
The main issue with this test was that I had forgot to turn on the failover service on backup… But that made it clear there should be a Zabbix item counting the number of failover processes running and a trigger to warn if it is zero.

Trigger only if 5 consequtive measurements indicate that process is down(there can be some noise):

Implementation
So now to the details of how it works. Let’s start with the main script that runs on the backup monitoring the primary:
root@woo2:~/ha_scripts# cat monitor_primary.sh
#!/bin/bash
HTTP_FAIL="0"
while true; do
mysql -e "SHOW VARIABLES LIKE 'read_only';" | grep -iq off
if [ $? -eq 0 ]
then
# We're apparently accepting writes so we are in the
# master already, little point in promoting this node.
exit 1
fi
curl --connect-timeout 5 -fks -H "Host: woo.deref.se" https://primary -o /dev/null
if [ $? -ne 0 ]
then
((HTTP_FAIL=HTTP_FAIL+1))
echo "Failures: $HTTP_FAIL."
else
((HTTP_FAIL=0))
fi
if [ $HTTP_FAIL -gt 3 ]
then
./get_primary_health.sh | grep "healthy" | grep -q "true"
if [ $? -ne 0 ]
then
# Not sure primary is reachable, but let's try getting it to
# stop sending changes to us
ssh primary "systemctl stop lsyncd"
# If the primary is still ticking along but not reachable
# we need to prevent it from overwriting our file system
unlink /root/.ssh/authorized_keys
ln -s /root/.ssh/authorized_keys_no_primary /root/.ssh/authorized_keys
# primary cut off from sending changes to us, we can push changes
# in the other direction then. Might work, might not.
systemctl start lsyncd
# Wait for replication to catch up
./checkslave.sh
# We've caught up, stop replicating.
mysql -e "STOP SLAVE;"
sleep 1
mysql -e "RESET SLAVE ALL;"
sleep 5
# It's now safe to accept writes
mysql -e 'SET GLOBAL read_only = OFF;'
# Got to tell Cloudflare not to switch back to primary.
./set_backup_alone_in_lb.sh
# Let's give it a shot to get the primary to replicate data from us
# Started as a background job since it's not critical
./make_primary_slave.sh &
exit 0
else
# Okey, if Cloudflare says the server is okey, let's give it another
# chance by starting the counting from zero
echo "Cloudflare says primary is healthy, even though we failed to reach it. Resetting fail-counter."
((HTTP_FAIL=0))
fi
fi
sleep 5
done
Obviously “curl –connect-timeout 5 -fks -H “Host: woo.deref.se” https://primary -o /dev/null” needs to be modified to your environment.
There are some other scripts being called from this one, checkslave.sh is a chunk of code I copied straight from ocf:heartbeat:mariadb:
#!/bin/bash
# Swiped from ocf:heartbeat:mariadb resource agent
SLAVESTATUS=$(mysql -e "SHOW SLAVE STATUS;" | wc -l)
if [ $SLAVESTATUS -gt 0 ]; then
tmpfile="/tmp/processlist.txt"
while true; do
mysql -e 'SHOW PROCESSLIST\G' > $tmpfile
if grep -i 'Has read all relay log' $tmpfile >/dev/null; then
echo "MariaDB slave has finished processing relay log"
break
fi
if ! grep -q 'system user' $tmpfile; then
echo "Slave not running - not waiting to finish"
break
fi
echo "Waiting for MariaDB slave to finish processing relay log"
sleep 1
done
fi
How best to test if a MariaDB server is busy catching up to the master it was replicating from is tricky. If you can get your hands on the results that worked well for other people in a lot of scenarios then you should use that.
set_backup_alone_in_lb.sh is an API-call to Cloudflare:
#!/bin/bash
source cloudflare_vars.sh
source site_vars.sh
# Promote primary in the Load Balancer
# It takes a few seconds
make_payload()
{
cat <<EOF
{
"description": "",
"id": "$LB_ID",
"proxied": true,
"enabled": true,
"name": "$DOMAIN_NAME",
"session_affinity": "none",
"session_affinity_attributes": {
"samesite": "Auto",
"secure": "Auto",
"drain_duration": 0
},
"steering_policy": "off",
"fallback_pool": "$BACKUP_POOL_ID",
"default_pools": [
"$BACKUP_POOL_ID"
],
"pop_pools": {},
"region_pools": {}
}
EOF
}
curl -fX PUT "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/load_balancers/$LB_ID" -H "Authorization: Bearer $API_TOKEN" -H "Content-Type:application/json" --data "$(make_payload)"
Cloudflare_vars and site_vars are just files assigning the right values to variables used throughout:
# Cloudflare vars
PRIMARY_POOL_ID="5a3ccSECRETSECRETSECRET0797e2d9ce"
BACKUP_POOL_ID="0ef15eSECRETSECRETSECRETf06686ac1e"
LB_ID="93ff3da14adSECRETSECRETSECRETe6ad1ee3d"
CLIENT_ID="a7e9f6SECRETSECRETSECRETb864cb4a5"
ZONE_ID="5e98d4dSECRETSECRETSECRET87cb6ad8"
API_TOKEN="noMamSECRETSECRETSECRETZaO1X"
And:
#!/bin/bash
DOMAIN_NAME="woo.deref.se"
REPLICATION_PASSWORD="SECRETSECRET"
It’s not guaranteed that the primary stopped working because it crashed and won’t restart. So we should at least try getting the primary to take on the roll as slave, hence the invocation of make_primary_slave.sh:
#!/bin/bash
source site_vars.sh
# We want the primary to be read only for as much of this process
# as possible because we can't proceed until the backup has caught
# up with all the writes performed on the master
ssh primary "mysql -e 'SET GLOBAL read_only = ON;'"
# Let's try to get the primary to replicate from us
ssh primary "mysql -e 'CHANGE MASTER TO master_host=\"backup\", master_port=3306, master_user=\"replication_user\", master_password=\"$REPLICATION_PASSWORD\", master_use_gtid=current_pos;'"
ssh primary "mysql -e 'START SLAVE;'"
Scripts for switchover
It’s good to sanity check the database of the server we’re trying to promote to master:
#!/bin/bash
tmpfile="/tmp/remoteslavestatus"
ssh $1 "mysql -e 'SHOW SLAVE STATUS\G;'" > $tmpfile
SLAVE_IO_STATE=$(grep -i "slave_io_state" $tmpfile | wc -w)
SLAVE_IO_RUNNING=$?
SLAVE_SQL_RUNNING=$?
SLAVE_SECONDS_BEHIND=$(grep -i seconds_behind_master $tmpfile | awk '{print $2}')
if [ $SLAVE_IO_STATE -lt 2 ]
then
echo "Slave IO State is empty. Is $1 properly synchronized? Bailing out."
exit 1
else
echo "Slave IO state is not null. Good."
fi
if [ $SLAVE_IO_RUNNING -ne 0 ]
then
echo "Slave IO seems not to be running. Bailing out."
exit 1
else
echo "Slave SQL running. Good."
fi
if [ $SLAVE_SQL_RUNNING -ne 0 ]
then
echo "Slave SQL seems not to be running. Bailing out."
exit 1
else
echo "Slave SQL running. Good."
fi
if [ $SLAVE_SECONDS_BEHIND == "NULL" ]
then
echo "Slave is not processing relay log. Bailing out."
exit 1
fi
if [ $SLAVE_SECONDS_BEHIND -gt 5 ]
then
echo "Slave is more than 5 seconds behind master. Won't have time to catch up! Bailing out."
exit 1
else
echo "Slave is reasonably caught up, only $SLAVE_SECONDS_BEHIND s behind."
fi
For instance, let’s check if it’s okey to switch over the primary after it rebooted and failover occurred:
root@woo2:~/ha_scripts# ./sanity_check.sh primary
Slave IO State is empty. Is primary properly synchronized? Bailing out.
Nope. Not at all. Have to make it replicate changes first:
root@woo2:~/ha_scripts# ./make_primary_slave.sh
root@woo2:~/ha_scripts# ./sanity_check.sh primary
Slave IO state is not null. Good.
Slave SQL running. Good.
Slave SQL running. Good.
Slave is reasonably caught up, only 0 s behind.
root@woo2:~/ha_scripts#
Let’s just check how things are going with lsyncd:
root@woo2:~/ha_scripts# systemctl status lsyncd
● lsyncd.service - LSB: lsyncd daemon init script
Loaded: loaded (/etc/init.d/lsyncd; generated)
Active: active (exited) since Sat 2020-12-26 18:45:50 CET; 28min ago
Docs: man:systemd-sysv-generator(8)
Process: 1081833 ExecStart=/etc/init.d/lsyncd start (code=exited, status=0/SUCCESS)
Dec 26 18:45:50 woo2 systemd[1]: Starting LSB: lsyncd daemon init script...
Dec 26 18:45:50 woo2 lsyncd[1081833]: * Starting synchronization daemon lsyncd
Dec 26 18:45:50 woo2 lsyncd[1081838]: 18:45:50 Normal: --- Startup, daemonizing ---
Dec 26 18:45:50 woo2 lsyncd[1081833]: ...done.
Dec 26 18:45:50 woo2 lsyncd[1081838]: Normal, --- Startup, daemonizing ---
Dec 26 18:45:50 woo2 systemd[1]: Started LSB: lsyncd daemon init script.
Dec 26 18:45:50 woo2 lsyncd[1081839]: Normal, recursive startup rsync: /var/www/html/ -> primary:/var/www/html/
Dec 26 18:45:52 woo2 lsyncd[1081839]: Error, Temporary or permanent failure on startup of "/var/www/html/". Terminating since "insist" is not set.
root@woo2:~/ha_scripts# sudo -u www-data date >> /var/www/html/testfile
root@woo2:~/ha_scripts# systemctl restart lsyncd
Testfile didn’t show up on primary so indeed the error message was right about lsyncd having given up on synchronizing to primary. A restart did the trick. So let’s switch over using this script called demote_backup.sh :
#!/bin/bash
source site_vars.sh
ssh primary "systemctl status mariadb"
# If either we couldn't reach primary or mariadb was down
# then we really should switch over
if [ $? -ne 0 ];
then
echo "Error! Couldn't verify that primary has MariaDB running! Bailing out."
exit 1
fi
curl -fks -H "Host: woo.deref.se" https://primary/
if [ $? -ne 0 ];
then
echo "Error! Couldn't verify that primary has Nginx running! Bailing out."
exit 1
fi
mysql -e "SET GLOBAL read_only = ON;"
systemctl stop lsyncd
# Wait for replication to catch up, if necessary
ssh primary "./checkslave.sh"
# Promote primary in the Load Balancer
# It takes a few seconds
./set_primary_first_in_lb.sh
# Just in case there was replication active on primary
ssh primary "mysql -e 'STOP SLAVE;'"
sleep 1
ssh primary "mysql -e 'RESET SLAVE ALL;'"
# Time for us to stop reading updates from primary
mysql -e "STOP SLAVE;"
# Let's allow primary's SSH key so lsyncd will work
unlink /root/.ssh/authorized_keys
ln -s /root/.ssh/authorized_keys_primary_allowed /root/.ssh/authorized_keys
# primary can start accepting writes now that it's no longer replicating
ssh primary "mysql -e 'SET GLOBAL read_only = OFF;'"
ssh primary "systemctl start lsyncd"
# Time for us to start replicating
mysql -e "CHANGE MASTER TO master_host='primary', master_port=3306, master_user='replication_user', master_password='$REPLICATION_PASSWORD', master_use_gtid=current_pos;"
mysql -e "START SLAVE;"
Again, “curl -fks -H “Host: woo.deref.se” https://primary/” isn’t neatly generated based on the contents of the _vars-files but whatever. Let’s see if it works out. I’ll start the same timing script as before on another server.
cjp@util:~$ ./wootest.sh | tee test_woo_scripted_switchover6.txt
Sat 26 Dec 2020 07:18:28 PM CET Read worked. | Write worked. 185.20.12.24
Sat 26 Dec 2020 07:18:30 PM CET Read worked. | Write worked. 185.20.12.24
Sat 26 Dec 2020 07:18:31 PM CET Read worked. | Write worked. 185.20.12.24
Sat 26 Dec 2020 07:18:33 PM CET Read worked. | Write worked. 185.20.12.24
Sat 26 Dec 2020 07:18:34 PM CET Read worked. | Write worked. 185.20.12.24
Sat 26 Dec 2020 07:18:36 PM CET Read worked. | Write worked. 185.20.12.24
Sat 26 Dec 2020 07:18:37 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 07:18:39 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 07:18:41 PM CET Read worked. | Write worked. 172.31.35.57
Sat 26 Dec 2020 07:18:43 PM CET Read worked. | Write worked. 172.31.35.57
Sat 26 Dec 2020 07:18:44 PM CET Read worked. | Write worked. 172.31.35.57
Sat 26 Dec 2020 07:18:46 PM CET Read worked. | Write worked. 172.31.35.57
Boy, that was fast.
Output on backup shows that we get a good response from the primary regarding mariadb(just another safety check) followed by the contents of a call to Nginx on the primary, finally showing the new state of the load balancer at Cloudflare:
root@woo2:~/ha_scripts# ./demote_backup.sh
● mariadb.service - MariaDB 10.3 database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: active (running) since Sat 2020-12-26 18:10:16 UTC; 8min ago
Docs: man:mysqld(8)
https://mariadb.com/kb/en/library/systemd/
Process: 932 ExecStartPost=/usr/libexec/mysql-check-upgrade (code=exited, status=0/SUCCESS)
Process: 828 ExecStartPre=/usr/libexec/mysql-prepare-db-dir mariadb.service (code=exited, status=0/SUCCESS)
Process: 779 ExecStartPre=/usr/libexec/mysql-check-socket (code=exited, status=0/SUCCESS)
Main PID: 870 (mysqld)
Status: "Taking your SQL requests now..."
Tasks: 36 (limit: 10984)
Memory: 120.0M
CGroup: /system.slice/mariadb.service
└─870 /usr/libexec/mysqld --basedir=/usr
Dec 26 18:10:15 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Starting MariaDB 10.3 database server...
Dec 26 18:10:15 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[828]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Dec 26 18:10:15 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[828]: If this is not the case, make sure the /var/lib/mysql is empty before running mysql-prepare-db-dir.
Dec 26 18:10:15 ip-172-31-35-57.eu-north-1.compute.internal mysqld[870]: 2020-12-26 18:10:15 0 [Note] /usr/libexec/mysqld (mysqld 10.3.27-MariaDB-log) starting as process 870 ...
Dec 26 18:10:16 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Started MariaDB 10.3 database server.
<huge html blog deleted>
MariaDB slave has finished processing relay log
{
"result": {
"description": "",
"created_on": "0001-01-01T00:00:00Z",
"modified_on": "2020-12-26T18:18:38.282592Z",
"id": "93ff3SECRETSECRETSECRET9e6ad1ee3d",
"ttl": 30,
"proxied": true,
"enabled": true,
"name": "woo.deref.se",
"session_affinity": "none",
"session_affinity_attributes": {
"samesite": "Auto",
"secure": "Auto",
"drain_duration": 0
},
"steering_policy": "off",
"fallback_pool": "0ef15SECRETSECRETSECRET6686ac1e",
"default_pools": [
"5a3cc32SECRETSECRETSECRET7e2d9ce",
"0ef15SECRETSECRETSECRET6686ac1e"
],
"pop_pools": {},
"region_pools": {}
},
"success": true,
"errors": [],
"messages": []
}
promote_backup.sh is good for when primary needs to be rebooted or otherwise undergo maintenance:
#!/bin/bash
source site_vars.sh
# We want the primary to be read only for as much of this process
# as possible because we can't proceed until the backup has caught
# up with all the writes performed on the master
ssh primary "mysql -e 'SET GLOBAL read_only = ON;'"
# Wait for replication to catch up
./checkslave.sh
# We've caught up, stop replicating.
mysql -e "STOP SLAVE;"
sleep 1
mysql -e "RESET SLAVE ALL;"
sleep 5
# It's now safe to accept writes
mysql -e 'SET GLOBAL read_only = OFF;'
# Switch roles for who writes file system updates
# We're the only one expected to see any writes
# but no point in having lsyncd running on the primary
ssh primary "systemctl stop lsyncd"
systemctl start lsyncd
# Time for us to start replicating
ssh primary "mysql -e \"CHANGE MASTER TO master_host='primary', master_port=3306, master_user='replication_user', master_password='$REPLICATION_PASSWORD', master_use_gtid=current_pos;\""
ssh primary "mysql -e \"START SLAVE;\""
# Time to tell Cloudflare that traffic should be routed to this node.
# The primary is left as a fallback to this node just in case.
# We can do this doing an orderly switch. If this was a failover
# we'd have to get the primary out of the load balancer entirely
# until we had time to inspect whether it could safely resume master
# status.
./set_backup_first_in_lb.sh
Settings
It’s going to be useful for the two servers to both believe themselves to be woo.deref.se so that they can make curl-calls to that address to check their own status. Master node /etc/hosts therefore gets two additional rows:
10.0.0.2 backup # VPN IP for backup node
13.53.118.162 woo.deref.se # Public IP of master node, which Amazon EC2 keeps changing on reboot...
The backup node gets something similar:
10.0.0.1 primary
185.20.12.24 woo.deref.se
You will also need passwordless SSH between the root accounts of the two servers.
MariaDB on second node:
user = mysql
pid-file = /run/mysqld/mysqld.pid
socket = /run/mysqld/mysqld.sock
#port = 3306
basedir = /usr
datadir = /var/lib/mysql
tmpdir = /tmp
lc-messages-dir = /usr/share/mysql
#skip-external-locking
server_id=2
log-basename=woo2
binlog-format=mixed
relay_log=woo2-relay-log
log_bin=woo2-bin
read-only=0
bind-address = 0.0.0.0
Default site config file for Nginx on second node:
server {
listen 80 default_server;
listen [::]:80 default_server;
server_name woo.deref.se;
return 301 https://woo.deref.se$request_uri;
}
server {
listen 443 ssl default_server;
listen [::]:443 ssl default_server;
root /var/www/html;
# Add index.php to the list if you are using PHP
index index.php index.html index.htm index.nginx-debian.html;
server_name woo.deref.se;
include snippets/cloudflaressl.conf;
location / {
# First attempt to serve request as file, then
# as directory, then fall back to displaying a 404.
#try_files $uri $uri/ =404;
try_files $uri $uri/ /index.php?$args ;
}
location ~ \.php$ {
#NOTE: You should have "cgi.fix_pathinfo = 0;" in php.ini
include fastcgi_params;
fastcgi_intercept_errors on;
fastcgi_pass unix:/run/php/php7.4-fpm.sock;
#The following parameter can be also included in fastcgi_params file
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
}
location ~* \.(js|css|png|jpg|jpeg|gif|ico)$ {
expires max;
log_not_found off;
}
location ~ /\.ht {
deny all;
}
}
I wonder why I had to add this manually to the first node to get Zabbix calls to /basic_status to work? Maybe it’s enabled by default on Ubuntu?
server {
listen 80 default_server;
server_name _;
location = /basic_status {
stub_status;
}
}
lsyncd.conf is tricky because on Ubuntu it’s supposed to be installed as /etc/lsyncd/lsyncd.conf.lua while it’s just plain /etc/lsyncd.conf on RHEL.
sync {
default.rsyncssh,
source="/var/www/html",
host="primary",
targetdir="/var/www/html/",
rsync = {
owner = true,
group = true,
chown = "nginx:nginx"
}
}
This was from Ubuntu, on RHEL it says host=”backup” and chown = “www-data:www-data” instead. This is confusing at first until your realize chown is executed on the other host. So it makes sense for RHEL to run “chown www-data:www-data” on the Ubuntu node which is the one that uses the user/group www-data. Not sure that was made super clear but it’s easy enough to test out. Or you can just run a single distro and not have to deal with that.
Sensible questions
Why not Galera? Multi-master without any need for pacemaker and fencing! Galera don’t work too well when the response time between nodes is highly variable: https://www.percona.com/blog/2018/11/15/how-not-to-do-mysql-high-availability-geographic-node-distribution-with-galera-based-replication-misuse/
Why two different Linux distros? I want to minimize common points of faiure. I could have tried this with a RHEL/FreeBSD combo but vulnerabilities that hit all Linux distros are rare. This is my balance between ease-of-use and availability.
Would I use this for real? With some more testing, sure. And keep in mind this isn’t intended to be the ultimate high availability solution for WooCommerce. That would be something like three datacenters with dedicated interconnect running three servers with the Pacemaker cluster manager, MariaDB with Galera for multi-master writes within each data center, LiteSpeed or maybe Nginx+Varnish for high speed serving of data. Though I have a bit of a hard time understanding why someone with that kind of money would use WooCommerce. It has a low barrier for entry – sure – but it’s ecommerce functionality inelegantly taped onto a blogging engine.