Galera to pacemaker switch

So I know in theory how to switch a database from one MariaDB server to another with extremely little downtime. It turns out to be trickier than I had thought in reality when the source is a Galera cluster but let’s go through it step by step:

Step 1

Choose a Galera node and make sure it has binary logging activated and log_slaves_updates = ON. Otherwise it won’t log things in a manner that can be sent to a slave that isn’t a Galera node and the log_slaves_updates-thing is to make it actually do send updates to the slave. I added log_slave_updates = ON to the slave as well because it’s not entirely clear how updates from the Galera node is viewed. I mean, in a sense all Galera nodes are both slaves and masters…

Also make sure the server id of the target node(temporary slave) doesn’t match any of the Galera nodes. Otherwise any Galera node will think that server_id = 1 is already caught up! That took me a couple of hours to figure out.

Step 2

Dump the database to be transferred from the Galera node you chose earlier:

mysqldump --single-transaction --master-data=1 --triggers --events --routines zabbix > zabbix.sql

Copy it to the target server.

Step 3

Create the necessary users and an empty database on the target server and then import the SQL-dump.

mysql zabbix < zabbix.sql

Step 4

Extract the replication coordinates from the top of the sql dump:

root@cluster3:~# head -30 zabbix.sql | grep MASTER
CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000150', MASTER_LOG_POS=691605499;

Create an SQL-file with the full information required to start streaming updates from the Galera node using this data:

CHANGE MASTER TO MASTER_HOST='cluster3.svealiden.se', MASTER_PORT=3306, MASTER_USER="replication_user", MASTER_PASSWORD="SECRETREPLICATIONPASSWORD", MASTER_LOG_FILE='mysql-bin.000150', MASTER_LOG_POS=691605499;
SET GLOBAL replicate_do_db='zabbix';
START SLAVE;

The SET GLOBAL replicate_do_db=’zabbix’; thing is for situations where there is just one database that you want to bring from a source that has multiple databases.

Step 5

Wait for slave to catch up:

MariaDB [(none)]> SHOW SLAVE STATUS\G 
*************************** 1. row ***************************               
                Slave_IO_State: Waiting for master to send event  
                   Master_Host: cluster3.svealiden.se         
                   Master_User: replication_user                             
                   Master_Port: 3306                                         
                 Connect_Retry: 60                                           
               Master_Log_File: mysql-bin.000150
           Read_Master_Log_Pos: 797684680                                    
                Relay_Log_File: pacemaker01-relay-bin.000002
                 Relay_Log_Pos: 10010385                                     
         Relay_Master_Log_File: mysql-bin.000150
              Slave_IO_Running: Yes                                          
             Slave_SQL_Running: Yes
               Replicate_Do_DB: zabbix                                                                                                                    
           Replicate_Ignore_DB: 
            Replicate_Do_Table:                                              
        Replicate_Ignore_Table:      
       Replicate_Wild_Do_Table: 
   Replicate_Wild_Ignore_Table:                                              
                    Last_Errno: 0    
                    Last_Error: 
                  Skip_Counter: 0                                            
           Exec_Master_Log_Pos: 707563585                                                                                                                 
               Relay_Log_Space: 66709818                                                                                                                  
               Until_Condition: None                                         
                Until_Log_File:                                              
                 Until_Log_Pos: 0                                                                                                                                     Master_SSL_Allowed: No                                           
            Master_SSL_CA_File:       
            Master_SSL_CA_Path:                                              
               Master_SSL_Cert: 
             Master_SSL_Cipher:                                              
                Master_SSL_Key:                                              
         Seconds_Behind_Master: 21398

Note the Seconds_Behind_Master-part that tells you how far the slave has been able to catch up. As you get close to zero, get ready to stop writing data to the Galera cluster.

Step 6

Stop writing data to Galera. In my case this entailed changing two configuration files for Zabbix(zabbix-server and the Zabbix Web UI) and stopping the associated services.

root@monitor:~# systemctl stop zabbix-server && systemctl stop nginx

Then stop replicating data from the Galera cluster. Since I had set the slave to read_only mode I had to disable that as well:

MariaDB [(none)]> STOP SLAVE;
Query OK, 0 rows affected (0.014 sec)

MariaDB [(none)]> SET GLOBAL read_only = false;
Query OK, 0 rows affected (0.001 sec)

Then just start up the services now that they have been told to contact the new database server.

root@monitor:~# systemctl start zabbix-server && systemctl start nginx

Can you spot where the switch happened?

Can you spot where the two failed attempts at making this switch seamlessly happened?

Sidenote: this is an example of a sentence that would stand to gain from English having more syntactic options. Like Can you spot where the two failed attempts at making this switch seamlessly happened. Here it’s more clear that we are talking about X = making a switch seamlessly and how there were two failed attempts at making X happen. Compare that to the more nonsensical Can you spot where the two failed attempts at making this switch seamlessly happened. No, we’re not talking about failures seemlessly happening, the switch was supposed to be seemless.

So anyway… Seconds_Behind_Master is a great piece of data to read when you’re about to switch over to a replica that you think is caught up.

Use cases

This is useful for a lot of situations. Maybe you want to convert a 180GB database from MyISAM to InnoDB? Maybe you need to perform an upgrade where it’s safest to do a mysqldump->import?

The lower limit for how fast this switchover can be is the replication lag. If the slave is always 10 seconds behind, that is the absolute lowest limit for how fast you can make the switch. If you do all the steps manually that adds maybe 20 seconds. Script it and we’re down to a adding a few seconds on top of the replication lag. Short of banks and amazon.com, downtime measured in seconds(maybe multiples of ten seconds) isn’t a cause for great concern.