MySQL errors and recovery

MySQL errors of my own making

Trying to break MySQL to give my colleagues some stuff to exercise their skills on. But it’s going… so so.

I tried to reduce the size of the file ibdata1:

innodb_data_file_path = ibdata1:1M

No go. Turns out the minimum size is 5M.

2021-03-26 22:49:37 0 [ERROR] InnoDB: Tablespace size must be at least 5 MB

innodb_data_file_path = ibdata1:5M

Nope:

2021-03-26 22:50:08 0 [ERROR] InnoDB: The innodb_system data file './ibdata1' is of a different size 768 pages than the 320 pages specified in the .cnf file!

Well yeah… I want it to be smaller. Seems I have to enable autoextend again: https://dev.mysql.com/doc/refman/5.7/en/innodb-system-tablespace.html. Sounds weird. Well, yeah. That only makes MariaDB ignore my 5MB limit and go with the old value of 12M. Nice.

Oh, I also encountered a problem with MariaDB wanting to have 32000 open files with a small tablespace:

22:49:25 0 [Warning] Could not increase number of max_open_files to more than 16384 (request: 32184)

Sigh… Had to modify the systemd file /etc/systemd/system/mysql.service to say

LimitNOFILE=65536

And then reload systemd:

systemctl daemon-reload
systemctl restart mysql

Then I realized it was innodb_log_file_size that I wanted to change. Good times.

MySQL safeguards

As we saw in part 1 of this series, MySQL is pretty robust in the face of hard resets as long as you use InnoDB as the underlying engine rather than MyISAM. Heck, InnoDB even held together with the cache-setting “Writeback (unsafe)” in Proxmox chosen for the underlying disk. But an ib_logfile getting corrupted or going missing somehow leaves us in a pickle. Some percentage of tables are likely to be impossible to export if that happens. So we better set things up so that we have something to fall back on if that happens.

Let’s start with my favorite, Btrfs snapshots. Some background:Ext4 is pretty much the default file system for Linux today and it’s pretty darned good. I remember back in the day when we only had Ext2 and it was a big deal when Ext3 introduced journals that made recovery from crashes much faster. Now we’re at Ext4. Btrfs on the other hand is a completely new file system heavily inspired by ZFS. It’s Copy-on-Write amongst other things which is good for snapshots. So Ext4 is tried and tested and Btrfs feature-rich.

Let’s switch from Ext4 to Btrfs for a MySQL-deployment. Note that I’m completely wiping the data stored in MySQL here! This is a test-rig.

I’ll make cron run a script that makes snapshots of the MySQL data directory every five minutes. It could be done in a simpler way but my script cleans up old snapshots automatically based on an assigned lifespan which is handy. This is what we end up with after slightly less than an hour of running the standard workload-script and also taking snapshots every five minutes:

root@mysqlrecov1:/var/lib/mysql# ls -lh Snapshots/data 
total 0 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-21-2340_12h 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-21-2345_12h 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-21-2350_12h 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-21-2355_12h 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-22-0000_12h 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-22-0005_12h 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-22-0010_12h 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-22-0015_12h 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-22-0020_12h 
drwxr-xr-x 1 mysql mysql 388 Mar 21 22:54 data@auto-2020-03-22-0025_12h

So let’s say something bad happened after the last snapshot was taken and we want to get data out of it. Can we just start a MySQL-instance with the snapshot directory as the datadir? No.

I make all my snapshots read-only because they’re meant to be backups. The more immutable they are the better. MySQL can’t run against a read only copy. But all we need to do is to make a writable copy of the read-only snapshot we’re interested in.

I guess I could have run mysqlcheck to demonstrate that things were okey but they were just fine. We could have taken whichever of these snapshots we wanted and gone back to that point in time. As much as I like this setup I do recognize some weaknesses.

First, backups are necessarily part of the same filesystem as the original. So file system corruption can leave us completely boned. There is a mechanism for sending snapshots from one Btrfs file system to another – even on another server – which can remedy that situation to some extent. But Btrfs snapshots are not great to write to tape however you twist and turn things.

Second, we’re not actually saving snapshots that MySQL gives guarantees for. MySQL has to do a sort of crash recovery process when starting up these snapshots because they look just like what you would get if the server had crashed unexpectedly and left lots of databases open.

One way to remedy this is to use a backuptool like xtrabackup. There’s a version for MariaDB called mariabackup that’s basically the same. I’ll use xtrabackup in this case. We’ll make three snapshots, consisting of one full backup and two incremental backups. Then we’ll do a restore by applying each incremental backup to the full backup and then moving the result into the place MySQL expects data to be.

This has the benefit of saving cleaner snapshots and you get them in distinct files that are easier to save to tape or backup to some external storage provider.

Now let’s try something a bit more robust, replication! I have that set up for my own databases but with an older method of keeping the various copies in sync. I’d like to try out GTID. I wonder if it’s enabled by default on this version of MySQL?

mysql> SHOW GLOBAL VARIABLES LIKE '%gtid%'; 
+----------------------------------+-------+ 
| Variable_name                    | Value | 
+----------------------------------+-------+ 
| binlog_gtid_simple_recovery      | ON    | 
| enforce_gtid_consistency         | OFF   | 
| gtid_executed                    |       | 
| gtid_executed_compression_period | 1000  | 
| gtid_mode                        | OFF   | 
| gtid_owned                       |       | 
| gtid_purged                      |       | 
| session_track_gtids              | OFF   | 
+----------------------------------+-------+ 
8 rows in set (0.00 sec)

Nope. Okey, I wonder if I can add it after the fact? Well activating the feature after the fact can obviously be done, but will it work? Will slaves only see the modifications that happened after GTID was activated? We can test it. Let’s enable GTID and create a replication user.

Now for a tricky move. It’s not strictly necessary since we know writes aren’t going on but for the sake of doing things by the book, we flush all table and lock them in one terminal window and then dump all databases in another. As long as the “FLUSH TABLES WITH READ LOCK;”-session is open in MySQL the lock is held and the dump is consistent.

Then we copy the dump to mysqlrecov2 where we import it into a blank setup. Then we configure that server for replication with it’s own server-id. Then we set up the replication and run a test to see if it works. We should be able to run our testscript and find that the maximum id in the Purchase-table(for instance) is the same for mysqlrecov1 and mysqlrecov2.

That seems to work fine. Let’s level this up one more notch. I’ll schedule snapshots to be made of the /var/lib/mysql/data directory every five minutes just like on mysqlrecov1 before. This will be a variant of what we did before. So the testscript is running against mysqlrecov1 for a while and mysqlrecov2 is replicating all the changes in real time. Snapshots are being stored in /var/lib/mysql/Snapshots/data/ on mysqlrecov2. There are actually good reasons for not storing snapshots on mysqlrecov1 which we can get to later. For now, let’s just note that the relay-log from mysqlrecov1 is ticking up nicely on mysqlrecov2:

Now let’s simulate a bonehead mistake. Something like deleting all customers in the database with an id below 5000? That should leave people quite sad. As we saw before we can restore from a Btrfs snapshot but we can lose up to 5 minutes of transactions. In fact, if we don’t recognize our bonehead mistake immediately and take MySQL down on mysqlrecov1 we could rack up even more transactions that will be lost when we go back to our earlier state.

Replication worked just as advertised. Within milliseconds of me deleting a bunch of customers at random on mysqlrecov1, mysqlrecov2 followed suit. As SHOW SLAVE STATUS illustrated at the end, I kept the testscript running for a while longer so more and more gtid_sets were stored on mysqlrecov2 full of legitimate transactions.

The test script started printing a lot more error messages after some random customers got deleted but kept running anyway. So what are we going to do now? Well restoring the last Btrfs snapshot from before the deletion is a good start. We’ll turn off mysqlrecov1 and set mysqlrecov2 so that it can only be reached from the server itself, to keep people from tinkering with the data while we’re working on it.

Seems I chose the right snapshot. We have customer with ids below 5000. Now let’s check in the relay-log from the data_borked directory. We don’t want MySQL to use that directory any more because the data in ib_logdata-files and the sysbench-directory reflects the state where thousands of customers are missing. But the relay-log file is still very valuable. If we run mysqlbinlog –verbose /var/lib/mysql/data_borked/mysqlrecov1-relay-log.00005 we see all the changes made to the database since replication began. This data gets rotated out eventually but here neither time nor space-constraints led any of the changes to be lost before the “mistake” was discovered. We only really need the changes made from the point the latest good snapshot is from to make this work though.

So how does the binlog look?

SET @@SESSION.GTID_NEXT= 'b6efad3f-6c71-11ea-9af3-eaa32fa23308:1586'/*!*/; 
# at 472 
#200328 22:41:23 server id 1  end_log_pos 337 CRC32 0x633797ec  Query   thread_id=3     exec_time=0     error_code=0 
SET TIMESTAMP=1585435283/*!*/; 
SET @@session.pseudo_thread_id=3/*!*/; 
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/; 
SET @@session.sql_mode=1436549152/*!*/; 
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/; 
/*!\C utf8 *//*!*/; 
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/; 
SET @@session.lc_time_names=0/*!*/; 
SET @@session.collation_database=DEFAULT/*!*/; 
BEGIN 
/*!*/; 
# at 550 
#200328 22:41:23 server id 1  end_log_pos 399 CRC32 0x2b1e9803  Table_map: `sysbench`.`Customer` mapped to number 108 
# at 612 
#200328 22:41:23 server id 1  end_log_pos 480 CRC32 0xc9c162b2  Write_rows: table id 108 flags: STMT_END_F 
 
BINLOG ' 
k9J/XhMBAAAAPgAAAI8BAAAAAGwAAAAAAAMACHN5c2JlbmNoAAhDdXN0b21lcgAEAw8DDwT/AP8A 
CAOYHis= 
k9J/Xh4BAAAAUQAAAOABAAAAAGwAAAAAAAEAAgAE//CF4AMAFGNiZnVkbmJmbWxtd2h1YWptbHht 
AAAAAA94c3dqZnh1ZmhwZ3NpbWGyYsHJ 
'/*!*/; 
### INSERT INTO `sysbench`.`Customer`
 ### SET 
###   @1=254085 
###   @2='cbfudnbfmlmwhuajmlxm' 
###   @3=0 
###   @4='xswjfxufhpgsima' 
# at 693 
#200328 22:41:23 server id 1  end_log_pos 511 CRC32 0x0a071603  Xid = 46 
COMMIT/*!*/; 
# at 724 
#200328 22:41:24 server id 1  end_log_pos 576 CRC32 0x05680f0e  GTID    last_committed=1        sequence_number=2       rbr_only=yes 
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/; 
SET @@SESSION.GTID_NEXT= 'b6efad3f-6c71-11ea-9af3-eaa32fa23308:1587'/*!*/; 
# at 789 
#200328 22:41:24 server id 1  end_log_pos 654 CRC32 0x79a206c9  Query   thread_id=12    exec_time=0     error_code=0 
SET TIMESTAMP=1585435284/*!*/; 
BEGIN 
/*!*/; 
# at 867 
#200328 22:41:24 server id 1  end_log_pos 716 CRC32 0xa2bb0bb0  Table_map: `sysbench`.`Customer` mapped to number 108 
# at 929 
#200328 22:41:24 server id 1  end_log_pos 797 CRC32 0xf9f2cb8b  Write_rows: table id 108 flags: STMT_END_F
<...>

Okey, slightly obscure. Can we find our DELETE FROM Customer WHERE id < 5000? No. We only find thousands of these:

SET @@SESSION.GTID_NEXT= ‘b6efad3f-6c71-11ea-9af3-eaa32fa23308:26067’/*!*/;
# at 14852899
#200328 23:02:33 server id 1 end_log_pos 14852762 CRC32 0xb87efd39     Query   thread_id=4     exec_time=0     error_code=0
SET TIMESTAMP=1585436553/*!*/;
BEGIN
/*!*/;
# at 14852975
#200328 23:02:33 server id 1 end_log_pos 14852825 CRC32 0x6a5f0dd7     Table_map: `sysbench`.`Purchase_list` mapped to number 110
# at 14853038
#200328 23:02:33 server id 1 end_log_pos 14852894 CRC32 0x53b1b9b6     Delete_rows: table id 110 flags: STMT_END_F

BINLOG ‘
idd/XhMBAAAAPwAAANmi4gAAAG4AAAAAAAEACHN5c2JlbmNoAA1QdXJjaGFzZV9saXN0AAQDAwMD
AADXDV9q
idd/XiABAAAARQAAAB6j4gAAAG4AAAAAAAEAAgAE//Dw5AEAlMQAAKJQAAAEAAAA8PLkAQCUxAAA
n3MAAAYAAAC2ubFT
‘/*!*/;
### DELETE FROM `sysbench`.`Purchase_list`
### WHERE
###   @1=124144
###   @2=50324
###   @3=20642
###   @4=4
### DELETE FROM `sysbench`.`Purchase_list`

So what is the latest transaction before the first of these thousands of DELETE-operations? Well, I know approximately what time we’re interested in. Then we can just look for DELETE FROM `sysbench`.`Customer`-statements.

So this probably contains the data we’re interested in:

###   @2='trirwseikshnmaeuqgzr' 
###   @3=3522 
###   @4='peahbpigywdapfo' 
# at 14574860 
#200328 23:01:06 server id 1  end_log_pos 14574678 CRC32 0xab48065f     Xid = 758546 
COMMIT/*!*/;
# at 14574891 
#200328 23:01:07 server id 1  end_log_pos 14574743 CRC32 0x18eab10c     GTID    last_committed=24333    sequence_number=24334   rbr_only=yes 
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/; 
SET @@SESSION.GTID_NEXT= 'b6efad3f-6c71-11ea-9af3-eaa32fa23308:25919'/*!*/; 
# at 14574956 
#200328 23:01:07 server id 1  end_log_pos 14574821 CRC32 0xe10dc87f     Query   thread_id=16    exec_time=0     error_code=0 
SET TIMESTAMP=1585436467/*!*/; 
SET @@session.foreign_key_checks=0/*!*/; 
BEGIN 
/*!*/; 
# at 14575034 
#200328 23:01:07 server id 1  end_log_pos 14574883 CRC32 0x6e71054b     Table_map: `sysbench`.`Customer` mapped to number 108 
# at 14575096 
#200328 23:01:07 server id 1  end_log_pos 14583060 CRC32 0x65200e62     Delete_rows: table id 108

All right, what is the last state found in the last good snapshot? SHOW SLAVE STATUS says that Read_Master_Log_Pos is 14507560. That’s a reasonable number given that we just found the marker “end_log_pos 14574678” just above where all the bad deletes started. It’s important to know if 14507560 is an end_log_pos or a start position. Searching through the binlog-output reveals that it is an end_pos:

###   @3=1682 
###   @4='vlntfrggrwhykju' 
# at 14507742 
#200328 23:00:00 server id 1  end_log_pos 14507560 CRC32 0x333e7f0b     Xid = 753352 
COMMIT/*!*/; 
# at 14507773

So we will want to replay from the start position 14507773 that comes directly after the number we found in SHOW SLAVE STATUS. And the final position to include is 14574678.

Okey, so we were able to apply parts of the binlog without running into “DUPLICATE KEY” errors or losing lots of customer data. Shall we try to apply the transactions after the erroneous deletes? Should be as simple as replaying the binlog from position 14766269 to the end.

root@mysqlrecov2:/var/lib/mysql/data_borked# mysqlbinlog --skip-gtids --start-position=14766269 mysqlrecov2-relay-bin.000005 > after_deletes.sql                                   
root@mysqlrecov2:/var/lib/mysql/data_borked# mysql < after_deletes.sql  
root@mysqlrecov2:/var/lib/mysql/data_borked#

Seems okey. According the output from the testscript one of the last transactions was a purchase registered to customer 8786 with a cost of 351.

root@mysqlrecov2:/var/lib/mysql/data_borked# mysql 
Welcome to the MySQL monitor.  Commands end with ; or \g. 
Your MySQL connection id is 9 
Server version: 5.7.29-0ubuntu0.18.04.1-log (Ubuntu) 
 
Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved. 
 
Oracle is a registered trademark of Oracle Corporation and/or its 
affiliates. Other names may be trademarks of their respective 
owners. 
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. 
 
mysql> USE sysbench;        
Reading table information for completion of table and column names 
You can turn off this feature to get a quicker startup with -A 
 
Database changed 
mysql> SELECT * FROM Purchase WHERE customer_id = 8786; 
+-------+-------------+------+--------+ 
| id    | customer_id | cost | weight | 
+-------+-------------+------+--------+ 
| 58333 |        8786 |  351 |    745 | 
+-------+-------------+------+--------+ 
1 row in set (0.00 sec) 
 
mysql>

If this had been a live system I would have been okey with bringing this online again. Since this was a small database a mysqldump and restore back on the master would work. For a bigger system maybe promoting the fixed slave is better and making the previous master into the new slave. Anyway, the testscript agrees that the data is correct:

cjp@kdemain:~/sysbench$ bin/python3.7 ownsysbench.py --hosts 192.168.1.155 --check true 
192.168.1.155 
Connection to 192.168.1.155: <mysql.connector.connection.MySQLConnection object at 0x7f05b26329d0> by X 
self.conn.in_transaction: False 
Checking correctness 
Numbers check out on a high level. Checking each customer.
cjp@kdemain:~/sysbench$

Oh right, I promised to get back to why we would want to avoid making Btrfs snapshots on the master server. Btrfs and ZFS are not considered ideal for database storage. See for instance the wonderfully titled blog post https://blog.pgaddict.com/posts/friends-dont-let-friends-use-btrfs-for-oltp

It’s a few years old by now but I don’t doubt that Btrfs is still terrible for your database storage if you want to maximize throughput. My real world databases sees maybe 10-20 queries per seconds. My backup software can peak at hundreds per second(I blame the Java ORM-layer, not the me the programmer) for short bursts. Btrfs doesn’t slow that down noticeably. If someone asks me to setup a system that sees sustained loads of 5000 transactions per second I’ll ditch Btrfs in an instant on the master and only use it on the slaves. Because slaves typically don’t do much work. They just stream the results that result from all the work done on the master.

Anyway, I think those are the building blocks for safeguarding MySQL databases. There are some additional tweaks like having one slave intentionally trail behind the master a certain number of minutes to make recovery from bonehead mistakes easier. There are also more synchronous solutions like Galera(I’ve used Percona’s XtraDB variant) to minimize the gap during which a write to a master might be lost because it hasn’t reached any slaves yet.

MySQL replication notes

NOTE! These are mostly for my reference working on this post.

root@mysqlrecov2:~# mysqlbinlog /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 | less
<jibberish>

Well that didn’t produce very readable information. This turned out to be better for viewing:

root@mysqlrecov2:~# mysqlbinlog --verbose /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 | less

#200322 21:14:57 server id 1  end_log_pos 894 CRC32 0x8df90820  Query   thread_id=9     exec_time=0     error_code=0 
SET TIMESTAMP=1584911697/*!*/; 
BEGIN 
/*!*/; 
# at 1107 
#200322 21:14:57 server id 1  end_log_pos 956 CRC32 0x18b0e814  Table_map: `sysbench`.`Customer` mapped to number 116 
# at 1169 
#200322 21:14:57 server id 1  end_log_pos 1037 CRC32 0xa46a830a         Write_rows: table id 116 flags: STMT_END_F  
BINLOG ' 
UdV3XhMBAAAAPgAAALwDAAAAAHQAAAAAAAMACHN5c2JlbmNoAAhDdXN0b21lcgAEAw8DDwT/AP8A 
CBTosBg= 
UdV3Xh4BAAAAUQAAAA0EAAAAAHQAAAAAAAEAAgAE//B64AMAFHZ5YmdveWd0bWhwdHd6eGd3bWZy 
AAAAAA9maWhmeW9sYndha21wYm0Kg2qk 
'/*!*/; 
### INSERT INTO `sysbench`.`Customer` 
### SET 
###   @1=254074 
###   @2='vybgoygtmhptwzxgwmfr' 
###   @3=0 
###   @4='fihfyolbwakmpbm' 
# at 1250

How to find current log position? Or some way of figuring which position in the bin log a specific file system snapshot matches. This?

mysql> SHOW MASTER STATUS; 
+------------------+----------+--------------+------------------+---------------------------------------------------------------------------------------+ 
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                     | 
+------------------+----------+--------------+------------------+---------------------------------------------------------------------------------------+ 
| mysql-bin.000001 |      786 |              |                  | 1176771d-6c7e-11ea-a68b-1a389890e7c2:1-3, 
b6efad3f-6c71-11ea-9af3-eaa32fa23308:1-2769 | 
+------------------+----------+--------------+------------------+---------------------------------------------------------------------------------------+ 
1 row in set (0.00 sec)

But it says this at the end of mysqlrecov2-relay-bin.000002…

#200322 21:23:07 server id 1  end_log_pos 1739591 CRC32 0xa5d8a74e      Xid = 82080

So what’s all this about mysqlrecov2-relay-bin.000001? And position 786 is suspect. Well, I guess File and Position are separate from GTID. So I shouldn’t be looking at File and Position because I use GTID? But I can’t very well make out anything sequential from a GTID. They’re UUIDs.

Okey, okey. Now I get it. So we have a GTID for the data synced from the master, that’s _b6efad3f-6c71-11ea-9af3-eaa32fa23308:1-2769_ and we can find matching information in mysqlrecov2-relay-bin.000002. Indeed the last segment in that file reads:

last_committed=2766     sequence_number=2769

So that checks out. The other GTID refers to the local machine and so we find that information in /var/log/mysql/mysql-bin.000001 . Notice that this is not a binlog we replicated in from the master. So nothing relay about this file. This GTID set only contains the part where I added the sysbench user so that I could run the script against the slave to make sure it was consistent. I’m not replicating the mysql-database since I don’t know if that’s safe.

All right, let’s break the database and try to recover using a combination of snapshots and binlog.

mysql> SET FOREIGN_KEY_CHECKS=0; 
Query OK, 0 rows affected (0.00 sec) 
 
mysql> DROP FROM Customer WHERE id < 5000;                 
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'FROM Customer WHERE id < 5000' at l
ine 1 
mysql> DELETE FROM Customer WHERE id < 5000;     
Query OK, 4285 rows affected (4.87 sec)

Let’s see what the sanity check says:

192.168.1.154 
Connection to 192.168.1.154: <mysql.connector.connection.MySQLConnection object at 0x7fa1487cf8d0> by X 
self.conn.in_transaction: False 
Checking correctness 
Sums of credit(753818484), costs(-20278036 and deposits(787452535) don't check out. Something is wrong. 
Credit exists for 1 but no such customer exists. 
Credit exists for 2 but no such customer exists. 
Credit exists for 4 but no such customer exists. 
Credit exists for 5 but no such customer exists. 
Credit exists for 6 but no such customer exists. 
Credit exists for 9 but no such customer exists. 
Credit exists for 10 but no such customer exists. 
Credit exists for 11 but no such customer exists. 
Credit exists for 14 but no such customer exists. 
Credit exists for 15 but no such customer exists.
 <...>

And so on for thousands of lines. Found the last bit before the big delete block:

### INSERT INTO `sysbench`.`Product` 
### SET 
###   @1=57331 
###   @2='awcioutbwumewtopjnhx' 
###   @3=54 
###   @4=9 
# at 15852327 
#200323 21:54:32 server id 1  end_log_pos 15852145 CRC32 0x4918e08a     Xid = 731899 
COMMIT/*!*/; 
# at 15852358 
#200323 21:54:27 server id 1  end_log_pos 15852210 CRC32 0x2b6858df     GTID    last_committed=32482    sequence_number=32483   rbr_only=yes 
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/; 
SET @@SESSION.GTID_NEXT= 'b6efad3f-6c71-11ea-9af3-eaa32fa23308:32483'/*!*/; 
# at 15852423 
#200323 21:54:27 server id 1  end_log_pos 15852288 CRC32 0xe0637123     Query   thread_id=93    exec_time=0     error_code=0 
SET TIMESTAMP=1585000467/*!*/; 
SET @@session.foreign_key_checks=0/*!*/; 
BEGIN 
/*!*/; 
# at 15852501

So the sequence number is 32483. So let’s try to start up the latest snapshot and see what the last sequence number was then.

mysql> SHOW MASTER STATUS; 
+------------------+----------+--------------+------------------+----------------------------------------------------------------------------------------+ 
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                      | 
+------------------+----------+--------------+------------------+----------------------------------------------------------------------------------------+ 
| mysql-bin.000003 |      194 |              |                  | 1176771d-6c7e-11ea-a68b-1a389890e7c2:1-3, 
b6efad3f-6c71-11ea-9af3-eaa32fa23308:1-19756 | 
+------------------+----------+--------------+------------------+----------------------------------------------------------------------------------------+ 
1 row in set (0.00 sec)

All right, between 19756 and 32483 is the koscher sequence to replay:

root@mysqlrecov2:/var/lib/mysql# mysqlbinlog --skip-gtids --start-position=19756 --stop-position=32483 /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 > replay.sql 
ERROR: Error in Log_event::read_log_event(): 'read error', data_len: 2054581866, event_type: 103 
ERROR: Could not read entry at offset 19756: Error in log format or read error.

Huh? Oh for the love of… I have to use log_pos instead of sequence number? All right, so 32483 has end_log_pos 15852288 and 19756 has an “# at 10105794:

# at 10105794 
#200323 21:50:01 server id 1  end_log_pos 10105646 CRC32 0x6ab793c0     GTID    last_committed=19751    sequence_number=19756   rbr_only=yes 
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/; 
SET @@SESSION.GTID_NEXT= 'b6efad3f-6c71-11ea-9af3-eaa32fa23308:19756'/*!*/;

Let’s try this again:

root@mysqlrecov2:/var/lib/mysql# mysqlbinlog --skip-gtids --start-position=10105794 --stop-position=15852288 /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 > replay.sql

All right, let’s try it.

root@mysqlrecov2:/var/lib/mysql# mysql -S /tmp/mysqld2.sock < replay.sql  
ERROR 1062 (23000) at line 35: Duplicate entry '54389' for key 'PRIMARY'

Okey, must have brought with us too many. Let’s see if I can find the next position:

root@mysqlrecov2:/var/lib/mysql# less replay.sql  
root@mysqlrecov2:/var/lib/mysql# mysqlbinlog --verbose /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 | less  
root@mysqlrecov2:/var/lib/mysql# mysqlbinlog --skip-gtids --start-position=10106096 --stop-position=15852288 /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 > replay2.sql      
root@mysqlrecov2:/var/lib/mysql# mysql -S /tmp/mysqld2.sock < replay2.sql

Success! Maybe… Let’s try the sanity check:

cjp@kdemain:~/sysbench$ bin/python3.7 ownsysbench.py --hosts 192.168.1.155 --check true 
192.168.1.155 
Connection to 192.168.1.155: <mysql.connector.connection.MySQLConnection object at 0x7fb052407810> by X 
self.conn.in_transaction: False 
Checking correctness 
Numbers check out on a high level. Checking each customer.
cjp@kdemain:~/sysbench$

Aw, yeah! So let’s dump out what we got here as valid snapshot:

root@mysqlrecov2:/var/lib/mysql# mysqld --daemonize --log-error-verbosity=3 --pid-file=/tmp/mysqld2.pid --socket=/tmp/mysqld2.sock --datadir=/var/lib/mysql/backup --port=3306 
root@mysqlrecov2:/var/lib/mysql# mysqldump sysbench > after_bin1.sql 
mysqldump: Got error: 2002: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) when trying to connect 
root@mysqlrecov2:/var/lib/mysql# mysqldump -S /tmp/mysqld2.sock sysbench > after_bin1.sql 
Warning: A partial dump from a server that has GTIDs will by default include the GTIDs of all transactions, even those that changed suppressed parts of the database. If you don't want to res
tore GTIDs, pass --set-gtid-purged=OFF. To make a complete dump, pass --all-databases --triggers --routines --events.  
root@mysqlrecov2:/var/lib/mysql# mysqldump --set-gtid-purged=OFF -S /tmp/mysqld2.sock sysbench > after_bin1_no_gtid.sql

Now let’s see if we can apply the events that followed after the “unfortunate” deletion of customers.

# at 16050579 
#200323 21:54:32 server id 1  end_log_pos 16050431 CRC32 0xd46e91ef     GTID    last_committed=32482    sequence_number=32484   rbr_only=yes 
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/; 
SET @@SESSION.GTID_NEXT= 'b6efad3f-6c71-11ea-9af3-eaa32fa23308:32484'/*!*/; 
# at 16050644 
#200323 21:54:32 server id 1  end_log_pos 16050509 CRC32 0x6f245997     Query   thread_id=66    exec_time=0     error_code=0 
SET TIMESTAMP=1585000472/*!*/; 
SET @@session.foreign_key_checks=1/*!*/;

Okey, so we only need start-position this time since we want everything from the end of the DELETE block:

root@mysqlrecov2:/var/lib/mysql# mysqlbinlog --skip-gtids --start-position=16050579 /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 > replay_last.sql                                    
root@mysqlrecov2:/var/lib/mysql# mysql -S /tmp/mysqld2.sock < replay_last.sql  
root@mysqlrecov2:/var/lib/mysql#

And is this consistent?

cjp@kdemain:~/sysbench$ bin/python3.7 ownsysbench.py --hosts 192.168.1.155 --check true 
192.168.1.155 
Connection to 192.168.1.155: <mysql.connector.connection.MySQLConnection object at 0x7fb1bd586710> by X 
self.conn.in_transaction: False 
Checking correctness 
Numbers check out on a high level. Checking each customer. 
cjp@kdemain:~/sysbench$

The last legit operations(after I did the big “accidental” delete but just before I brought down the database master to contain the damage) were:

Purchase: Deducted 958 from customer 218820 landing at 1442 
Deleted (83708,) 
Credit id 270922: Added 6224 to 261395 
Deleted (98105,) 
Credit id 270952: Added 7045 to 261281 
Purchase: Deducted 630 from customer 91130 landing at 2144

So is that last purchase now in the restored copy?

mysql> SELECT * FROM Purchase WHERE customer_id = 91130; 
+-------+-------------+------+--------+ 
| id    | customer_id | cost | weight | 
+-------+-------------+------+--------+ 
| 55193 |       91130 |  630 |    550 | 
+-------+-------------+------+--------+ 
1 row in set (0.00 sec)

And do we have customers with an ID below 5000?

mysql> SELECT COUNT(*) FROM Customer WHERE id < 5000;  
+----------+ 
| COUNT(*) | 
+----------+ 
|     4285 | 
+----------+ 
1 row in set (0.00 sec)

Indeed we do! So we have successfully cut out a particular disastrous sequence of operations, piecing together a coherent view of customers, purchases, credits etc. And it only took me an hour and a half. Probably not the response time considered ideal by a trader seeing 100 or so purchases per second.

Note that there’s no guarantee that this method will work out this well in practice. Indeed I may well have lost purchases by doing this but in the interest of un-screwing things it would probably have been the best way forward. So don’t view snapshots and binlogs as a license to clown around with a production database.

On a personal note: would it not be beneficial to have less brittle data structures than this? In a relational database we might have the state of Texas as a row in the table States and if it goes missing then every row representing customers from Texas will generate an error on read. Foreign key constraint error presumably. Maybe MongoDB is going too far in the other direction but DUPLICATE KEY and FOREIGN KEY CONSTRAINT errors make relational databases so sensitive. I guess this is a struggle each IT department has to have with their respective accounting departments.

MySQL crash recovery

Let’s crash some MySQL-databases, shall we? Well that’s harder than it seems. Especially if you’ve had to recover crashed databases on production servers. If they – set up to be very reliable – can crash during a network failure or hard reset, how come I can’t replicate the same when I tweak all settings to be at their most unreliable? Maybe the uptime of the server, the size of the database on disk, number of distinct databases and so on play a role? Well, let’s try to investigate MySQL crash recovery as best we can.

Hmm, maybe we should first establish some method of simulating a workload? Well, I have a script that simulates the sort of workload that you would see in high volume eCommerce. It makes full use of transactions to make multiple updates to multiple tables happen atomically(i.e. completely or not at all). Basically it pretends that there are customers that add credit to their account and then make purchases using that credit. Let’s try it out:

The script ownsysbench.py starts on process per IP or hostname provided and each runs 12 threads that each simulate the workload. As we saw in mytop, the number of queries per second varies between 700 and 100. This is on a dual core VM running on an Intel Atom server. So I’m kind of impressed. But in fairness I ran this on an ext4 filesystem with nobarrier on and relaxed commit rules(https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit). Let’s try it again with standard barrier for ext4 and default commit rules. Oh, maybe I should turn off Cache(unsafe) for the disk in Proxmox? That probably gave this run a performance advantage. Okey, so no extra disk caching, default flush of transactions to disk and ext4 has its barrier again:

So 600 to 800 queries per second. That’s not so bad. It’s not like it’s a tiny database either:

Well, 90 MB is kind of tiny but 200 000 customers is on the big side. With a smaller database we get up to 1200 and 1300 queries per second. Now, the point here is actually just to get a feel for what kind of load we’re putting on the system when we tweak variables. Even with all safeties on we have several hundred queries per second, of which many are INSERT and UPDATE.

So one could be forgiven for thinking that rebooting the server would cause problems. Whenever you pull the plug there will probably be a couple of transactions running. Let’s try it with these safe settings. I’m using a slightly smaller database size but we can try the big one later on:

I have set MySQL to not start automatically, to make it easier for me to spot error states. But as you can see, MySQL starts up without issue. It’s only when we look in the error.log(it’s a general log, not really sure why they named it that) that we see that some recovery-work had to be done. But it was entirely automatic.

5 transaction(s) which must be rolled back or cleaned up in total 6 row operations to undo 
2020-03-21T21:19:59.885968Z 0 [Note] InnoDB: Trx id counter is 373760 
2020-03-21T21:19:59.885997Z 0 [Note] InnoDB: Starting an apply batch of log records to the database...
<...>
2020-03-21T21:20:00.715726Z 0 [Note] InnoDB: Rolling back trx with id 373313, 1 rows to undo 
2020-03-21T21:20:00.715914Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1" 
2020-03-21T21:20:00.715974Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables 
2020-03-21T21:20:00.716079Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ... 
2020-03-21T21:20:00.724026Z 0 [Note] InnoDB: Rollback of trx with id 373313 completed 
2020-03-21T21:20:00.724074Z 0 [Note] InnoDB: Rolling back trx with id 373312, 1 rows to undo 
2020-03-21T21:20:00.731498Z 0 [Note] InnoDB: Rollback of trx with id 373312 completed 
2020-03-21T21:20:00.731537Z 0 [Note] InnoDB: Rolling back trx with id 373309, 1 rows to undo 
2020-03-21T21:20:00.734520Z 0 [Note] InnoDB: Rollback of trx with id 373309 completed 
2020-03-21T21:20:00.734557Z 0 [Note] InnoDB: Rolling back trx with id 373286, 1 rows to undo 
2020-03-21T21:20:00.737645Z 0 [Note] InnoDB: Rollback of trx with id 373286 completed 
2020-03-21T21:20:00.737681Z 0 [Note] InnoDB: Rolling back trx with id 373213, 2 rows to undo 
2020-03-21T21:20:00.890034Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB. 
2020-03-21T21:20:00.891339Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active. 
2020-03-21T21:20:00.891411Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active. 
2020-03-21T21:20:00.892242Z 0 [Note] InnoDB: Waiting for purge to start 
2020-03-21T21:20:00.892640Z 0 [Note] InnoDB: Rollback of trx with id 373213 completed
2020-03-21T21:20:00.890034Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB. 
2020-03-21T21:20:00.891339Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active. 
2020-03-21T21:20:00.891411Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active. 
2020-03-21T21:20:00.892242Z 0 [Note] InnoDB: Waiting for purge to start 
2020-03-21T21:20:00.892640Z 0 [Note] InnoDB: Rollback of trx with id 373213 completed 
2020-03-21T21:20:00.892756Z 0 [Note] InnoDB: Rollback of non-prepared transactions completed

mysqlcheck gave the tables a clean bill of health and the internal consistency is correct according the benchmark software itself:

cjp@kdemain:~/sysbench$ bin/python3.7 ownsysbench.py --hosts 192.168.1.154 --check true                 
192.168.1.154 
Connection to 192.168.1.154: <mysql.connector.connection.MySQLConnection object at 0x7fbbda456710> by X 
self.conn.in_transaction: False 
Checking correctness 
Numbers check out on a high level. Checking each customer.
cjp@kdemain:~/sysbench$

So I’ve tried turning off ext4’s barrier(letting writes be done a bit more sloppy), flush of innodb logs on transaction commit(they are then done once a second instead) and even that caching-mode in Proxmox called “Writeback (unsafe)”. InnoDB hasn’t given me a single complaint… As mentioned earlier, I have seen InnoDB crash in the wild so it’s not completely safe but I at least can hit it with a shovel for a couple of days and fail to make a dent.

I was planning on provoking an error which could be solved by starting MySQL in InnoDB force recovery mode level 1 and demonstrate how that can fix things. This setting(https://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html) is something you hope not having to fiddle with but when InnoDB does crash in some way, innodb_force_recovery is the way forward. We would move on to something that requires recovery level 2 or 3. All the way up past level 4. Levels 4 and up typically involve some data loss and usually requires exporting and importing all the databases.

As noted above however, no go on generating these kinds of errors as intended. But if we can’t get one of those elusive errors that can be solved with innodb_force_recovery=3 then I know a way to generate errors that can be “solved” with innodb_force_recovery=6. So let me reset the database server while it is running again, with 200 000 customers and maximum danger settings on. Then we can do some damage ourselves and see what happens.

Okey, that actually didn’t go to plan… I was expecting to have to go to level 6 and for mysqldump to fail at some stage. Maybe I deleted the wrong ib_logfile? I have a snapshot of the whole mysql partition after the reset. I can try again. No, now not even level 6 worked. All right, let’s start over. I have – as previously mentioned – been hitting this setup with a shovel for a couple of days.

So I switched to Btrfs instead of Ext4, for no other reason than feeling like it. Now let’s try that intentional break again:

Again we only had to go to level 5 of innodb_force_recovery but importantly we got stuck when trying to dump a specific table from the database. In a real life situation it’s easy to imagine how losing the list of items for all purchases would be a bad thing. I’m pretty sure audits by the IRS go better when you can show what you sold and to whom.

So while this doesn’t show tremendously well how to deal with more subtle forms of InnoDB crashes you get the general idea. Notice how I created a Btrfs snapshot before starting to tinker with stuff. That can be a good idea if the option is available. Here I intentionally broke something but the process of fixing things could actually break things even more. And the most important take-away here is that not all corruptions are recoverable. Even if all the data in the sysbench-folder is correct a failure in one of the ib_logfiles can leave you with entire tables being unrecoverable. You got to have backups!

So what about MyISAM? Well it’s a lot easier to break MyISAM. In fact, since the benchmark tool is based on transactions – and MyISAM doesn’t support transactions – I only need to run it for a few seconds to get garbage data out. Let’s first dig ourselves out of the hole we dug for ourselves.

But MyISAM says this clearly on the box so we can’t very well complain. Need transactions? Don’t use MyISAM. Need crash tolerance? Don’t use MyISAM. Have large table? Don’t use MyISAM. It’s surprising how many MySQL-databases in the world don’t conflict with those constraints and merrily run MyISAM without issue. But let’s break a MyISAM setup just for kicks!

This run was done with maximum safe settings. Btrfs doesn’t have a nobarrier equivalent as far as I know. But Proxmox added no caching and MySQL’s settings were conservative. The errors that arose were recoverable but we can’t tell if the contents of the databas is consistent with the application because as we saw before, MyISAM can’t guarantee that even when there’s no one doing a hard reset every couple of minutes.

So that was a slightly disjointed tour of MySQL failure and recovery. The next installment will be about how to cover our backs. Step 1 is clearly to use InnoDB as the underlying engine and I guess step 2 is not deleting ib_logfiles. Snapshots, binlogs and replication will follow.

Percona XtraDB Cluster again

So I got fed up with Percona XtraDB Cluster some time back. It was the backend for Zabbix and a bad crash left me with a cluster that just wouldn’t form a quorum and that was it for me. Other problems had included lots of errors on commits due to inconsistencies between nodes which I felt was difficult to justify since only one of the nodes actually received any writes…

Well I have since written a python script that can write two multiple master servers compatible with MySQL. It complies with Percona’s Strict Mode and after some grinding I got it to maintain consistency of data. The code is absolutely awful but it works! One of the neat tricks is that it can at any point go through the database to determine that the “business logic” is maintained. That’s how I dug up tricky corner cases where a customer could have a purchase registered to him after he had been removed from the Customer-table.

So what happens when I run the script at full tilt(it’s not fast but 15-30 transactions per second isn’t bad given only three queries can run in parallel due to the aforementioned awful code) and I pull the plug on all three Percona XtraDB Cluster VMs? Well the script stops printing out information about completed operations because it can’t continue. So I kill the script.

Next I reboot all three nodes and they just automatically get back into working order. No fiddling with extraneous tables or having to set innodb_force_recovery. I was surprised. I’ve tried it a few times and it just works. Running the consistency-checker shows that atomicity has been maintained. I even took minute-by-minute snapshots of the mysql partition(running Btrfs) and forced a node back in time so it looked like it had lost five minutes of updates. Still no problem!

Now I find myself in the surprising position of having to roll up my sleeves and meticulously engineer a scenario to replicate what I once encountered in the wild. I will basically have to:

Reset the database back to a shared state X
Advance the cluster by writing from my script
Snapshot node 1
Reset the database back to the previous state X
Write some more
Snapshot node 2
Repeat for node 3

Eventually I will have three snapshots – one on each node – which constitutes a completely split brain. No quorum and no automatic syncing possible. Then I can test how to recover manually. Hmm,. I should probably snapshot all three nodes in each of the split brain versions. Then I can try a 2-1 situation and see how Percona XtraDB Cluster will handle it. It should let the 2 nodes that are in agreement start up but the 1 should be kept out, not resynced or overwritten – just excluded.

Well, that’s what I think anyway.

Addendum 1: Now something interesting happened for some reason. I had shut down the cluster cleanly and yet standard start via systemctl didn’t work:

root@pcmk1:~# systemctl start mysql
Job for mysql.service failed because the control process exited with error code.
See “systemctl status mysql.service” and “journalctl -xe” for details.

I had to cat the contents of /var/lib/mysql/data/grastate.dat to find the node with either the highest sequence number or safe_to_bootstrap: 1. I found the latter:

root@pcmk1:~# cat /var/lib/mysql/data/grastate.dat
GALERA saved state
version: 2.1
uuid: ee445958-ed07-11e9-a650-a6dcefe074b9
seqno: 458661
safe_to_bootstrap: 1

The other two nodes had the same seqno. Anyway, with this information in hand pcmk1 could be started with /etc/init.d/mysql bootstrap-pxc and then I could start the other two nodes with systemctl start mysql and then all was well again.

I’m getting the idea that the startup script for Percona XtraDB Cluster does some self-bootstrapping that isn’t just done by running systemctl start mysql. Or this is just one more mystery.

Notes to self:
Point X: Snapshots/data/data@auto-2019-10-29-2111_48h
Start of brain 1: Snapshots/data/data@auto-2019-10-29-2123_48h
End of brain 1: Snapshots/data/data@auto-2019-10-29-2126_48h
Start of brain 2: Snapshots/data/data@auto-2019-10-29-2134_48h
End of brain 2: Snapshots/data/data@auto-2019-10-29-2137_48h
Start of brain 3: Snapshots/data/data@auto-2019-10-29-2143_48h
End of brain 3: Snapshots/data/data@auto-2019-10-29-2146_48h

Okey, testing brain 2 on pcmk3 and pcmk2 and let pcmk1 use brain 1. pcmk2 and pcmk3 had slightly different seqno which isn’t so weird since the snapshots can’t be initiated at the exact same microsecond and can’t really be expected to complete in exactly the same number of hard drive writes. So I had to edit the grastate.dat file for pcmk3 so that safe_to_bootstrap was set to 1 and then ran /etc/init.d/mysql bootstrap-pxc. I needed to run systemctl start mysql a couple of times to get pcmk2 to sync but eventually it got into place.

As expected pcmk1 couldn’t join and was left in its own partition. So far so good. Now to let each node use their own brain.

2019-10-30T21:21:22.923454Z 0 [Note] WSREP: Waiting for SST/IST to complete.
2019-10-30T21:21:22.923806Z 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 390ca26a-fb5b-11e9-805a-9fd389d6c554
2019-10-30T21:21:22.924357Z 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 390ca26a-fb5b-11e9-805a-9fd389d6c554
2019-10-30T21:21:22.924830Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 390ca26a-fb5b-11e9-805a-9fd389d6c554 from 0 (pxc-cluster-node-1)
2019-10-30T21:21:22.925159Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 390ca26a-fb5b-11e9-805a-9fd389d6c554 from 1 (pxc-cluster-node-2)
2019-10-30T21:21:22.925188Z 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:group_post_state_exchange():324: Reversing history: 464082 -> 462077, this member has applied 2005 more events than the primary component.Data loss is possible. Aborting.
2019-10-30T21:21:22.925212Z 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.

Seems reasonable. One node bailed out real quick but two kept trying.

| wsrep_cluster_size | 2 |
| wsrep_cluster_state_uuid | ee445958-ed07-11e9-a650-a6dcefe074b9 |
| wsrep_cluster_status | non-Primary

But the cluster didn’t process queries during this attempt to form a quorum. Just as it intended. So let’s say we know for a fact that brain 3 is the “best”. Brain 1 and brain 2 are just wrong, not actually holding information that needs to be preserved. Percona XtraDB Cluster shouldn’t make this determination, as it hasn’t. But we need to be able to do this when we’ve reviewed the situation(and need to get things back up and running). So let’s run this on pcmk2:

rm -rf /var/lib/mysql/data/*
systemctl start mysql

And a check of the STATUS:

wsrep_local_state 4
wsrep_local_state_comment Synced
wsrep_cert_index_size 0
wsrep_cert_bucket_count 22
wsrep_gcache_pool_size 1456
wsrep_causal_reads 0
wsrep_cert_interval 0.000000
wsrep_open_transactions 0
wsrep_open_connections 0
wsrep_ist_receive_status
wsrep_ist_receive_seqno_start 0
wsrep_ist_receive_seqno_current 0
wsrep_ist_receive_seqno_end 0
wsrep_incoming_addresses 192.168.1.65:3306,192.168.1.66:3306
wsrep_cluster_weight 2
wsrep_desync_count 0
wsrep_evs_delayed
wsrep_evs_evict_list
wsrep_evs_repl_latency 0/0/0/0/0
wsrep_evs_state OPERATIONAL
wsrep_gcomm_uuid 0a119cbd-fb5e-11e9-a854-a317f06d04a7
wsrep_cluster_conf_id 2
wsrep_cluster_size 2
wsrep_cluster_state_uuid ee445958-ed07-11e9-a650-a6dcefe074b9
wsrep_cluster_status Primary
wsrep_connected ON

And the logic check passes when running my Python script. Okey, so let’s join the last node… And yeah, works the same. So if one is patient all you need to do is completely blank the datadir(which in my case is /var/lib/mysql/data but for most people will be /var/lib/mysql) and start the node again. It will do state transfer to initialize itself and then join.

wsrep_local_state 4
wsrep_local_state_comment Synced
wsrep_cert_index_size 0
wsrep_cert_bucket_count 22
wsrep_gcache_pool_size 1456
wsrep_causal_reads 0
wsrep_cert_interval 0.000000
wsrep_open_transactions 0
wsrep_open_connections 0
wsrep_ist_receive_status
wsrep_ist_receive_seqno_start 0
wsrep_ist_receive_seqno_current 0
wsrep_ist_receive_seqno_end 0
wsrep_incoming_addresses 192.168.1.65:3306,192.168.1.66:3306,192.168.1.64:3306
wsrep_cluster_weight 3
wsrep_desync_count 0
wsrep_evs_delayed
wsrep_evs_evict_list
wsrep_evs_repl_latency 0/0/0/0/0
wsrep_evs_state OPERATIONAL
wsrep_gcomm_uuid ce76d2c9-fb5e-11e9-a04e-926b86eb0f23
wsrep_cluster_conf_id 3
wsrep_cluster_size 3
wsrep_cluster_state_uuid ee445958-ed07-11e9-a650-a6dcefe074b9
wsrep_cluster_status Primary
wsrep_connected ON

What to do if you have a split brain scenario and no node is actually “wrong” but there actually is a discrepancy in valuable data? Well then you’re boned! Databases do not have a convenient way of merging data like that. I’m of the opinion that any application that writes to a database should have some fallback mechanism whereby it can scan versions of the database that diverged away from the master and then inject the changes into the master.

In my dummy application this would involve retrieving all purchases and all deposits of credit from the diverged copy. Any purchase or deposit of credit unique to the diverged copy would then be processed as if though they were brand new transactions coming in from customers, with the customers’ credit balance being changed accordingly. In some systems this may not be possible but it’s really a desirable feature when you don’t want a split brain to be an insurmountable obstacle to getting your records straight.

MariaDB snapshots and recovery

Update 2019-11-02: Yep, I iz dumb! I was apparently working with a snapshot that itself contained a set of crashed tables. So yeah… Btrfs snapshots of MariaDB data directories are just fine backups if you make sure the databases are themselves valid. You can check this for arbitrary snapshots like this:

# cd /var/lib/mysql
# btrfs sub snap Snapshots/sveahadb/sveahadb@auto-2019-08-01-0745_6m/ copy
# /usr/bin/mysqld_safe --defaults-file=/etc/mysql/my.cnf --pid-file=/var/run/mysqld/mysqltestd.pid --socket=/var/run/mysqld/mysqldtest.sock --datadir=/var/lib/mysql/copy --log-error=/var/log/mysql/test.log --user=mysql --bind-address=127.0.0.1 --log-bin=mysqld-bin --port 3307 --skip-slave-start
# mysqlcheck -h localhost -P 3307 --all-databases

So the original mess is below:

I was thinking of renaming this article to “Wait, I iz dumb” but I’m not sure exactly where the boundaries of my dumbness lie so I’ll have to postpone that. First things first: my colleagues at work showed me that it is indeed possible to just run a MySQL instance from a snapshot of files in /var/lib/mysql. What I was missing was that the data needed for a specific database isn’t just kept in /var/lib/mysql/dbname but also in /var/lib/mysql/mysql, /var/lib/mysql/performance_schema and the /var/lib/mysql/ib*-files.

“Oh, silly me!” I said and went home and copied over the entire /var/lib/mysql /Snapshots/snapname folder contents to my dbvm testbed. And… same problem! Google provided me with some clues that the fact that the files come from a replicated setup could cause problems and should be deleted: https://dba.stackexchange.com/questions/111616/innodb-error-table-mysql-innodb-table-stats-not-found/111617

But wait, what about those innodb_* files? They seem like they would be important. What happens if I rsync a snapshot from the DB master? Hmm, same. Well the replication files have different names but the innodb-tables still have to be deleted. Wait, they’re just stats? Well nuts to them then!

But this lead to a new interesting problem, presumably because this snapshot was made from the master which isn’t just committing transactions to disk based on a log – like a slave does – but also decides if and how a transaction can be committed:

InnoDB: Transaction 17653441 was in the XA prepared state.
InnoDB: 1 transaction(s) which must be rolled back or cleaned up
InnoDB: in total 0 row operations to undo
InnoDB: Trx id counter is 17654272
2019-09-06 21:29:19 140285946666112 [Note] InnoDB: 128 rollback segment(s) are active.
2019-09-06 21:29:19 140285380126464 [Note] InnoDB: Starting in background the rollback of recovered transactions
2019-09-06 21:29:19 140285380126464 [Note] InnoDB: Rollback of non-prepared transactions completed
2019-09-06 21:29:19 140285946666112 [Note] InnoDB: Waiting for purge to start
2019-09-06 21:29:19 140285946666112 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.44-86.0 started; log sequence number 21897778133
2019-09-06 21:29:19 140285275272960 [Note] InnoDB: Dumping buffer pool(s) not yet started
2019-09-06 21:29:19 140285946666112 [Note] Plugin ‘FEEDBACK’ is disabled.
2019-09-06 21:29:19 7f96de049c80 InnoDB: Starting recovery for XA transactions…
2019-09-06 21:29:19 7f96de049c80 InnoDB: Transaction 17653441 in prepared state after recovery
2019-09-06 21:29:19 7f96de049c80 InnoDB: Transaction contains changes to 1 rows
2019-09-06 21:29:19 7f96de049c80 InnoDB: 1 transactions in prepared state after recovery
2019-09-06 21:29:19 140285946666112 [Note] Found 1 prepared transaction(s) in InnoDB
2019-09-06 21:29:19 140285946666112 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (
last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with –tc-heuristic-recover switch to commit or rollback pending transactions.
2019-09-06 21:29:19 140285946666112 [ERROR] Aborting

Fair enough… Though I’m curious why this never happens at work. I guess R1Soft might do something different from what a Btrfs snapshot does? Well, I guess it’s kind of obvious that it’s a different technology than a CoW file system but beyond that I think R1Soft might make sure it only stores logs that need no recovery/replay. Strange if that’s the case given how they don’t mind saving completely broken tables in backups… Anyway, let’s just do this the proscribed way:

mysqld_safe --defaults-file=/etc/mysql/my.cnf --basedir=/usr --datadir=/var/lib/mysql/sveatest3/
--user=mysql --bind-address=127.0.0.1 --tc-heuristic-recover=COMMIT
 190906 21:30:58 mysqld_safe Logging to syslog.
 190906 21:30:58 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql/sveatest3/

I then stopped the instance and restarted the systemd service. mysqlcheck then verified that all tables were OK. Maybe this approach is more “lossy” than using innodb_force_recovery but it seems like I have asked MySQL to make fewer exceptions this way.

Ignore replication because this is recovery on a single node.
Ignore the innodb table statistics.
Commit any unfinished transactions.

Below tells the story of my first adventures into restoring data from a Btrfs snapshot, not knowing that ibdata and ib_logfile were kind of important:

I’ve been using my Btrfs snapshot-script to take snapshots of my MariaDB databases. This serves merely as what I call Oops-protection. “DROP DATABASE bacula;” kind of sucks after all. But to my dismay data is not stored on disk in a consistent state. This turns out to be an error in my expectations rather than an error on the part of MariaDB. The tables are in a sense stored in a consistent state but ongoing transactions are also stored on disk. So trying to start a database up from files copied from a Btrfs snapshot leads to MariaDB worrying about all the ongoing stuff that it was supposed to do:

2019-07-26 21:07:59 140629359316096 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1795910239

2019-07-26 21:07:59 140629359316096 [ERROR] InnoDB: Attempted to open a previously opened tablespace. Previous tablespace mysql/innodb_table_stats uses space ID: 1 at filepath: ./mysql/i

Innodb_table_stats.ibd. Cannot open tablespace zabbix/acknowledges which uses space ID: 1 at filepath: ./zabbix/acknowledges.ibd

2019-07-26 21:07:59 7fe6d301cc80  InnoDB: Operating system error number 2 in a file operation.

The error means the system cannot find the path specified.
If you are installing InnoDB, remember that you must create
directories yourself, InnoDB does not create them.
Error: could not open single-table tablespace file ./zabbix/acknowledges.ibd
We do not continue the crash recovery, because the table may become
corrupt if we cannot apply the log records in the InnoDB log to it.
To fix the problem and start mysqld:
1) If there is a permission problem in the file and mysqld cannot
open the file, you should modify the permissions.
2) If the table is not needed, or you can restore it from a backup,
then you can remove the .ibd file, and InnoDB will do a normal
crash recovery and ignore that table.
3) If the file system or the disk is broken, and you cannot remove
the .ibd file, you can set innodb_force_recovery > 0 in my.cnf
and force InnoDB to continue crash recovery here

So adding the following to MariaDB’s configuration allows us to start up in recovery mode:

[mysqld]
 innodb_force_recovery=1

Both the mysql and zabbix databases are seriously broken:

root@dbvm:/var/lib/mysql/sveatest# mysqlcheck -A
...
mysql.func                                         OK
 mysql.gtid_slave_pos
 Error    : Table 'mysql.gtid_slave_pos' doesn't exist in engine
 status   : Operation failed
 mysql.help_category                                OK
 mysql.help_keyword                                 OK
 mysql.help_relation                                OK
 mysql.help_topic                                   OK
 mysql.host                                         OK
 mysql.index_stats                                  OK
 mysql.innodb_index_stats
 Error    : Table 'mysql.innodb_index_stats' doesn't exist in engine
 status   : Operation failed
 mysql.innodb_table_stats
 Error    : Table 'mysql.innodb_table_stats' doesn't exist in engine
 status   : Operation failed
 mysql.plugin                                       OK
 mysql.proc                                         OK
...
zabbix.history
 Warning  : InnoDB: Index 'history_1' contains 3058508 entries, should be 3058449.
 error    : Corrupt
 zabbix.history_log                                 OK
 mysqlcheck: Got error: 2013: Lost connection to MySQL server during query when executing 'CHECK TABLE … '

But we can actually dump the data from zabbix which is the one that is interesting here.

mysqldump zabbix > zabbix.sql

Since the mysql-table isn’t interesting here I can just move it out from the MariaDB data directory and install a blank one:

root@dbvm:/var/lib/mysql/sveatest# mysql_install_db                  
Installing MariaDB/MySQL system tables in '/var/lib/mysql/sveatest' …
2019-07-26 21:33:16 140100875676800 [Note] /usr/sbin/mysqld (mysqld 10.1.40-MariaDB-0ubuntu0.18.04.1) starting as process 14528 …

OK                                                                                                                         

Filling help tables…                                                                
2019-07-26 21:33:20 139659493420160 [Note] /usr/sbin/mysqld (mysqld 10.1.40-MariaDB-0ubuntu0.18.04.1) starting as process 14558 …                                                       

OK                                                                                                                        

Creating OpenGIS required SP-s…                                                                                        
2019-07-26 21:33:23 140389584972928 [Note] /usr/sbin/mysqld (mysqld 10.1.40-MariaDB-0ubuntu0.18.04.1) starting as process 14587 …

OK

After dropping the broken zabbix database and disabling innodb_force_recovery the raw SQL can be imported:

mysql -u root zabbix  < zabbix.sql

And we run mysqlcheck -A again:

 zabbix.history_log                                 OK
 zabbix.history_str                                 OK
 zabbix.history_text                                OK
 mysqlcheck: Got error: 2013: Lost connection to MySQL server during query when executing 'CHECK TABLE … '

Wait… What? So it starts without objection but can’t pass a mysqlcheck? I’ll drop the database and import again.

MariaDB [(none)]> DROP DATABASE zabbix;
 ERROR 2013 (HY000): Lost connection to MySQL server during query

Oh, for the love of… Fine. Let’s go nuclear then.

root@dbvm:/var/lib/mysql/sveatest# systemctl stop mysql
root@dbvm:/var/lib/mysql/sveatest# rm -rf zabbix
root@dbvm:/var/lib/mysql/sveatest# systemctl start mysql
# Log in and create empty database zabbix
root@dbvm:/var/lib/mysql/sveatest# mysql zabbix < /root/zabbix.sql

Now what? Table operations not LOCKED with…? What?! So the import won’t work now. What if I drop the database and recreate it and rerun the import?

root@dbvm:/var/lib/mysql/sveatest# mysql zabbix < /root/zabbix.sql
root@dbvm:/var/lib/mysql/sveatest# mysqlcheck -A     
 bacula.BaseFiles                                   OK
 bacula.CDImages                                    OK
 bacula.Client                                      OK
...
 zabbix.users_groups                                OK
 zabbix.usrgrp                                      OK
 zabbix.valuemaps                                   OK
 zabbix.widget                                      OK
 zabbix.widget_field                                OK

Phew, now it worked. Well, this was just a dry run but if importing an SQL-dump doesn’t get you back to where you need to be that would be bad news.

So the points here are:

Filesystem snapshots of MariaDB/MySQL data directories do not give you a database snapshot that you can just run.
“Recovering” from what is essentially what you would be left if the server crashed can be done but it’s not as easy as running mysqlrepair and then things just work.
Backups of databases should be made using database engine-aware software like mysqldump or fancy stuff like xtrabackup or the MariaDB-version of that tool: mariabackup.