MySQL replication notes

NOTE! These are mostly for my reference working on this post.

root@mysqlrecov2:~# mysqlbinlog /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 | less
<jibberish>

Well that didn’t produce very readable information. This turned out to be better for viewing:

root@mysqlrecov2:~# mysqlbinlog --verbose /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 | less
#200322 21:14:57 server id 1  end_log_pos 894 CRC32 0x8df90820  Query   thread_id=9     exec_time=0     error_code=0 
SET TIMESTAMP=1584911697/*!*/; 
BEGIN 
/*!*/; 
# at 1107 
#200322 21:14:57 server id 1  end_log_pos 956 CRC32 0x18b0e814  Table_map: `sysbench`.`Customer` mapped to number 116 
# at 1169 
#200322 21:14:57 server id 1  end_log_pos 1037 CRC32 0xa46a830a         Write_rows: table id 116 flags: STMT_END_F  
BINLOG ' 
UdV3XhMBAAAAPgAAALwDAAAAAHQAAAAAAAMACHN5c2JlbmNoAAhDdXN0b21lcgAEAw8DDwT/AP8A 
CBTosBg= 
UdV3Xh4BAAAAUQAAAA0EAAAAAHQAAAAAAAEAAgAE//B64AMAFHZ5YmdveWd0bWhwdHd6eGd3bWZy 
AAAAAA9maWhmeW9sYndha21wYm0Kg2qk 
'/*!*/; 
### INSERT INTO `sysbench`.`Customer` 
### SET 
###   @1=254074 
###   @2='vybgoygtmhptwzxgwmfr' 
###   @3=0 
###   @4='fihfyolbwakmpbm' 
# at 1250

How to find current log position? Or some way of figuring which position in the bin log a specific file system snapshot matches. This?

mysql> SHOW MASTER STATUS; 
+------------------+----------+--------------+------------------+---------------------------------------------------------------------------------------+ 
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                     | 
+------------------+----------+--------------+------------------+---------------------------------------------------------------------------------------+ 
| mysql-bin.000001 |      786 |              |                  | 1176771d-6c7e-11ea-a68b-1a389890e7c2:1-3, 
b6efad3f-6c71-11ea-9af3-eaa32fa23308:1-2769 | 
+------------------+----------+--------------+------------------+---------------------------------------------------------------------------------------+ 
1 row in set (0.00 sec) 

But it says this at the end of mysqlrecov2-relay-bin.000002…

#200322 21:23:07 server id 1  end_log_pos 1739591 CRC32 0xa5d8a74e      Xid = 82080

So what’s all this about mysqlrecov2-relay-bin.000001? And position 786 is suspect. Well, I guess File and Position are separate from GTID. So I shouldn’t be looking at File and Position because I use GTID? But I can’t very well make out anything sequential from a GTID. They’re UUIDs.

Okey, okey. Now I get it. So we have a GTID for the data synced from the master, that’s _b6efad3f-6c71-11ea-9af3-eaa32fa23308:1-2769_ and we can find matching information in mysqlrecov2-relay-bin.000002. Indeed the last segment in that file reads:

last_committed=2766     sequence_number=2769

So that checks out. The other GTID refers to the local machine and so we find that information in /var/log/mysql/mysql-bin.000001 . Notice that this is not a binlog we replicated in from the master. So nothing relay about this file. This GTID set only contains the part where I added the sysbench user so that I could run the script against the slave to make sure it was consistent. I’m not replicating the mysql-database since I don’t know if that’s safe.

All right, let’s break the database and try to recover using a combination of snapshots and binlog.

mysql> SET FOREIGN_KEY_CHECKS=0; 
Query OK, 0 rows affected (0.00 sec) 
 
mysql> DROP FROM Customer WHERE id < 5000;                 
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'FROM Customer WHERE id < 5000' at l
ine 1 
mysql> DELETE FROM Customer WHERE id < 5000;     
Query OK, 4285 rows affected (4.87 sec)

Let’s see what the sanity check says:

192.168.1.154 
Connection to 192.168.1.154: <mysql.connector.connection.MySQLConnection object at 0x7fa1487cf8d0> by X 
self.conn.in_transaction: False 
Checking correctness 
Sums of credit(753818484), costs(-20278036 and deposits(787452535) don't check out. Something is wrong. 
Credit exists for 1 but no such customer exists. 
Credit exists for 2 but no such customer exists. 
Credit exists for 4 but no such customer exists. 
Credit exists for 5 but no such customer exists. 
Credit exists for 6 but no such customer exists. 
Credit exists for 9 but no such customer exists. 
Credit exists for 10 but no such customer exists. 
Credit exists for 11 but no such customer exists. 
Credit exists for 14 but no such customer exists. 
Credit exists for 15 but no such customer exists.
 <...>

And so on for thousands of lines. Found the last bit before the big delete block:

### INSERT INTO `sysbench`.`Product` 
### SET 
###   @1=57331 
###   @2='awcioutbwumewtopjnhx' 
###   @3=54 
###   @4=9 
# at 15852327 
#200323 21:54:32 server id 1  end_log_pos 15852145 CRC32 0x4918e08a     Xid = 731899 
COMMIT/*!*/; 
# at 15852358 
#200323 21:54:27 server id 1  end_log_pos 15852210 CRC32 0x2b6858df     GTID    last_committed=32482    sequence_number=32483   rbr_only=yes 
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/; 
SET @@SESSION.GTID_NEXT= 'b6efad3f-6c71-11ea-9af3-eaa32fa23308:32483'/*!*/; 
# at 15852423 
#200323 21:54:27 server id 1  end_log_pos 15852288 CRC32 0xe0637123     Query   thread_id=93    exec_time=0     error_code=0 
SET TIMESTAMP=1585000467/*!*/; 
SET @@session.foreign_key_checks=0/*!*/; 
BEGIN 
/*!*/; 
# at 15852501

So the sequence number is 32483. So let’s try to start up the latest snapshot and see what the last sequence number was then.

mysql> SHOW MASTER STATUS; 
+------------------+----------+--------------+------------------+----------------------------------------------------------------------------------------+ 
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                      | 
+------------------+----------+--------------+------------------+----------------------------------------------------------------------------------------+ 
| mysql-bin.000003 |      194 |              |                  | 1176771d-6c7e-11ea-a68b-1a389890e7c2:1-3, 
b6efad3f-6c71-11ea-9af3-eaa32fa23308:1-19756 | 
+------------------+----------+--------------+------------------+----------------------------------------------------------------------------------------+ 
1 row in set (0.00 sec)

All right, between 19756 and 32483 is the koscher sequence to replay:

root@mysqlrecov2:/var/lib/mysql# mysqlbinlog --skip-gtids --start-position=19756 --stop-position=32483 /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 > replay.sql 
ERROR: Error in Log_event::read_log_event(): 'read error', data_len: 2054581866, event_type: 103
ERROR: Could not read entry at offset 19756: Error in log format or read error.

Huh? Oh for the love of… I have to use log_pos instead of sequence number? All right, so 32483 has end_log_pos 15852288 and 19756 has an “# at 10105794:

# at 10105794 
#200323 21:50:01 server id 1  end_log_pos 10105646 CRC32 0x6ab793c0     GTID    last_committed=19751    sequence_number=19756   rbr_only=yes 
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/; 
SET @@SESSION.GTID_NEXT= 'b6efad3f-6c71-11ea-9af3-eaa32fa23308:19756'/*!*/;

Let’s try this again:

root@mysqlrecov2:/var/lib/mysql# mysqlbinlog --skip-gtids --start-position=10105794 --stop-position=15852288 /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 > replay.sql

All right, let’s try it.

root@mysqlrecov2:/var/lib/mysql# mysql -S /tmp/mysqld2.sock < replay.sql  
ERROR 1062 (23000) at line 35: Duplicate entry '54389' for key 'PRIMARY'         

Okey, must have brought with us too many. Let’s see if I can find the next position:

root@mysqlrecov2:/var/lib/mysql# less replay.sql  
root@mysqlrecov2:/var/lib/mysql# mysqlbinlog --verbose /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 | less  
root@mysqlrecov2:/var/lib/mysql# mysqlbinlog --skip-gtids --start-position=10106096 --stop-position=15852288 /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 > replay2.sql      
root@mysqlrecov2:/var/lib/mysql# mysql -S /tmp/mysqld2.sock < replay2.sql 

Success! Maybe… Let’s try the sanity check:

cjp@kdemain:~/sysbench$ bin/python3.7 ownsysbench.py --hosts 192.168.1.155 --check true 
192.168.1.155 
Connection to 192.168.1.155: <mysql.connector.connection.MySQLConnection object at 0x7fb052407810> by X 
self.conn.in_transaction: False 
Checking correctness 
Numbers check out on a high level. Checking each customer.
cjp@kdemain:~/sysbench$ 

Aw, yeah! So let’s dump out what we got here as valid snapshot:

root@mysqlrecov2:/var/lib/mysql# mysqld --daemonize --log-error-verbosity=3 --pid-file=/tmp/mysqld2.pid --socket=/tmp/mysqld2.sock --datadir=/var/lib/mysql/backup --port=3306 
root@mysqlrecov2:/var/lib/mysql# mysqldump sysbench > after_bin1.sql 
mysqldump: Got error: 2002: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) when trying to connect 
root@mysqlrecov2:/var/lib/mysql# mysqldump -S /tmp/mysqld2.sock sysbench > after_bin1.sql 
Warning: A partial dump from a server that has GTIDs will by default include the GTIDs of all transactions, even those that changed suppressed parts of the database. If you don't want to res
tore GTIDs, pass --set-gtid-purged=OFF. To make a complete dump, pass --all-databases --triggers --routines --events.  
root@mysqlrecov2:/var/lib/mysql# mysqldump --set-gtid-purged=OFF -S /tmp/mysqld2.sock sysbench > after_bin1_no_gtid.sql

Now let’s see if we can apply the events that followed after the “unfortunate” deletion of customers.

# at 16050579 
#200323 21:54:32 server id 1  end_log_pos 16050431 CRC32 0xd46e91ef     GTID    last_committed=32482    sequence_number=32484   rbr_only=yes 
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/; 
SET @@SESSION.GTID_NEXT= 'b6efad3f-6c71-11ea-9af3-eaa32fa23308:32484'/*!*/; 
# at 16050644 
#200323 21:54:32 server id 1  end_log_pos 16050509 CRC32 0x6f245997     Query   thread_id=66    exec_time=0     error_code=0 
SET TIMESTAMP=1585000472/*!*/; 
SET @@session.foreign_key_checks=1/*!*/;

Okey, so we only need start-position this time since we want everything from the end of the DELETE block:

root@mysqlrecov2:/var/lib/mysql# mysqlbinlog --skip-gtids --start-position=16050579 /var/lib/mysql/data/mysqlrecov2-relay-bin.000002 > replay_last.sql                                    
root@mysqlrecov2:/var/lib/mysql# mysql -S /tmp/mysqld2.sock < replay_last.sql  
root@mysqlrecov2:/var/lib/mysql#

And is this consistent?

cjp@kdemain:~/sysbench$ bin/python3.7 ownsysbench.py --hosts 192.168.1.155 --check true 
192.168.1.155 
Connection to 192.168.1.155: <mysql.connector.connection.MySQLConnection object at 0x7fb1bd586710> by X 
self.conn.in_transaction: False 
Checking correctness 
Numbers check out on a high level. Checking each customer. 
cjp@kdemain:~/sysbench$

The last legit operations(after I did the big “accidental” delete but just before I brought down the database master to contain the damage) were:

Purchase: Deducted 958 from customer 218820 landing at 1442 
Deleted (83708,) 
Credit id 270922: Added 6224 to 261395 
Deleted (98105,) 
Credit id 270952: Added 7045 to 261281 
Purchase: Deducted 630 from customer 91130 landing at 2144

So is that last purchase now in the restored copy?

mysql> SELECT * FROM Purchase WHERE customer_id = 91130; 
+-------+-------------+------+--------+
| id    | customer_id | cost | weight |
+-------+-------------+------+--------+
| 55193 |       91130 |  630 |    550 |
+-------+-------------+------+--------+
1 row in set (0.00 sec)

And do we have customers with an ID below 5000?

mysql> SELECT COUNT(*) FROM Customer WHERE id < 5000;  
+----------+
| COUNT(*) |
+----------+
|     4285 |
+----------+
1 row in set (0.00 sec)

Indeed we do! So we have successfully cut out a particular disastrous sequence of operations, piecing together a coherent view of customers, purchases, credits etc. And it only took me an hour and a half. Probably not the response time considered ideal by a trader seeing 100 or so purchases per second.

Note that there’s no guarantee that this method will work out this well in practice. Indeed I may well have lost purchases by doing this but in the interest of un-screwing things it would probably have been the best way forward. So don’t view snapshots and binlogs as a license to clown around with a production database.

On a personal note: would it not be beneficial to have less brittle data structures than this? In a relational database we might have the state of Texas as a row in the table States and if it goes missing then every row representing customers from Texas will generate an error on read. Foreign key constraint error presumably. Maybe MongoDB is going too far in the other direction but DUPLICATE KEY and FOREIGN KEY CONSTRAINT errors make relational databases so sensitive. I guess this is a struggle each IT department has to have with their respective accounting departments.