VXLAN implemented with BGP EVPN

Note 1: This assumes you’re familiar with L2 and L3 networking and routing. Ideally you should know more about this stuff than I did when I began this project…

Note 2: Big shout out to Vincent Bernat and his article on this same topic: https://vincent.bernat.ch/en/blog/2017-vxlan-bgp-evpn

I’ve been trying to learn more about networking and of course my focus is on availability. It’s been pretty standard to use STP(Spanning Tree Protocol) to keep things from going off the rails when connecting switches together in a mesh. The mesh is meant to provide redundancy but that means we get loops that data can travel through and that causes disasters. STP figures out which ports to close to keep loops from happening so we’re all good. If a switch dies STP recalculates its solution and connectivity is maintained.

Except STP and Rapid STP are notoriously moody and go off the rails themselves at times. It’s also complicated by the use of VLANs since two links that may belong to separate VLANs will look equivalent to STP and therefor one of them may be shut off. To this end we must use PVST(Per VLAN Spanning Tree) if we want to use STP and VLANs. Network admins don’t seem to like this approach very much. Just look at the efforts put into finding replacements for STP and it’s variations, like TRILL and SPB.

VXLAN is a popular contribution to this mess but doesn’t really do anything about redundancy in the underlying network. It is however fully compatible with running on a routed IP network. So you could set up your network with a bunch of routers that run OSPF or BGP and then add VXLAN to make your network seem like a set of L2 networks.

Long story made slightly shorter, I thought I’d create a virtual network to test this stuff. OpenVSwitch kicked me hard in the shins by locking up whenever I tried to simulate a link or node going down. Not ideal. I ended up creating lots of bridges on a host to simulate links between routers and servers running as LXC containers.

Setup

AS65128 is a gateway that does NAT because this “data center” is just running inside a virtual machine on a server on my local network. So all these devices must appear like they have an IP-adress in the range used on my local network if they’re going to download data from the internet, like apt install does for instance.

Here’s a definition of all the links used:

DC01GW01L01 # DC01GW01->DC01R01
DC01GW01L02 # DC01GW01->DC01R02

 DC01DC02I1 # DC01R01->DC02R01
 DC01DC02I2 # DC01R02->DC02R02
 DC01R01R02 # DC01R01->DC01R02
 DC01R01L01 # DC01R01->DC01S01
 DC01R01L02 # DC01R01->DC01S02
 DC01R02L01 # DC01R02->DC01S01
 DC01R02L02 # DC01R02->DC01S02
 DC01RR01R01# DC01RR01->DC01R01
 DC01RR01R02# DC01RR01->DC01R02

 DC01S01L01 # DC01S01->DC01Server01
 DC01S01L02 # DC01S01->DC01Server02
 DC01S02L01 # DC01S02->DC01Server01
 DC01S02L02 # DC01S02->DC01Server02

 DC02R01R02 # DC02R01->DC02R02
 DC02R01L01 # DC02R01->DC02S01
 DC02R01L02 # DC02R01->DC02S02
 DC02R02L01 # DC02R02->DC02S01
 DC02R02L02 # DC02R02->DC02S02
 DC02RR01R01# DC02RR01->DC02R01
 DC02RR01R02# DC02RR01->DC02R02

 DC02S01L01 # DC02S01->DC02Server01
 DC02S01L02 # DC02S01->DC02Server02
 DC02S02L01 # DC02S02->DC02Server01
 DC02S02L02 # DC02S02->DC02Server02

The following script creates and activates the links if need be:

#!/bin/bash

for LNK in $(cat links.txt | cut -d '#' -f 1); do
ip link | awk '{ print $2}' | grep $LNK -q || ip link add $LNK type bridge;
ip link set $LNK up;
done

Config files for all nodes – both which node uses what link and the interface/router config – can be found in the files in this archive: vxlan_conf_rc01.tar.gz

Walk-through

Free Range Routing(FRR) is used throughout to set things up. I had almost no idea what I was doing when setting things up initially, which might explain why this has taken a couple of months. Let’s first look at the routers, like DC01Router01. First we need to enable bgpd in /etc/frr/daemons:

# The watchfrr and zebra daemons are always started.
#                                                                                                 
bgpd=yes                                                                                          
ospfd=no

Then we establish that this is a BGP router with AS 65001:

interface lo                           
ip address 10.0.1.128/32

# DC01GW01
interface eth0
ip address 10.0.1.128 peer 10.0.128.128/32

# DC01Router02                                                 
interface eth1
ip address 10.0.1.128 peer 10.0.2.128/32

# DC02Router021
interface eth2              
ip address 10.0.1.128 peer 10.0.3.128/32

# DC01Switch01
interface eth3
ip address 10.0.1.128 peer 10.0.1.1/32

# DC01Switch02
interface eth4
ip address 10.0.1.128 peer 10.0.2.1/32

# DC01RR01
interface eth5                
ip address 10.0.1.128 peer 10.0.1.129/32

# Any routes not resolved by other means need to be handled by the gateway
ip route 0.0.0.0/0 10.0.128.128 eth0 

router bgp 65001 # This is a router in AS65001
  bgp router-id 10.0.1.128 # Its ID is its IP-address
  bgp default ipv4-unicast # Activate IPv4 BGP
  no bgp ebgp-requires-policy # No untrusted routers in use, so we can skip policy
  neighbor 10.0.128.128 remote-as 65128 # Connect to gateway AS65128
  neighbor 10.0.3.128 remote-as 65003 # Connect to DC02 AS65003

  neighbor fabric peer-group # This is a grouping of routers we want to talk to
  neighbor fabric remote-as 65001 # They all belong with AS65001
  neighbor fabric capability extended-nexthop # We might use IPv6 to talk to them
  neighbor fabric next-hop-self all # Offer to route traffic to those who ask
  neighbor 10.0.1.128 peer-group fabric # This node
  neighbor 10.0.2.128 peer-group fabric # DC01Router02
  bgp listen range 10.0.0.0/16 peer-group fabric # Let any 10.0.X.X router connect
  !
  address-family ipv4 unicast
    network 10.0.1.0/24 # We offer access to 10.0.1.X
    network 10.0.2.0/24 # We offer access to 10.0.2.X
    neighbor 10.0.128.128 activate # Connect to gateway
    neighbor fabric activate # Connect to local as routers
  exit-address-family
  !
exit

Some of this stuff are leftovers from a different setup where I tried to let VXLAN BGP information be handled alongside regular IPv4 BGP routing. bgp listen range 10.0.0.0/16 peer-group fabric was meant to make it easy to add new switches to send VXLAN data to the routers but I couldn’t make that work. bgp default ipv4-unicast is the default but I include it for my own sanity.

I’ll let you look through the frr.conf-files for the other routers to see the pattern. If I haven’t made it abundantly clear already: I don’t understand what I’m doing. This is me copy-pasting stuff from tutorials, reading manuals and then brute-forcing stuff until it works. I’m sure there’s more wrong that right with these configs. But it works in my lab environment.

So let’s talk VTEPs! That’s Virtual Terminal End Point I think. Basically a VXLAN “port”. In this setup the switches contain the VTEPs. Currently I only have one VXLAN and I set it up through this script:

ip link del lan100
ip link del vxlan100
ip link add vxlan100 type vxlan id 100 dstport 4789 local 10.0.1.1 nolearning
ip link add lan100 type bridge
ip link set vxlan100 master lan100
ip link set eth2 master lan100
ip link set eth3 master lan100
ip link set vxlan100 up
ip link set lan100 up

eth2 and eth3 are the ports to which the servers connect. The following config on DC01Switch01 sends this information to the Route Reflectors that we will soon take a look at:

router bgp 65127 # Unique AS for VXLAN data
bgp router-id 10.0.1.1 # I call this node a switch but it's really kind of a router
no bgp default ipv4-unicast # Can't figure out how to get VXLAN+IPv4 in one node
neighbor central peer-group # Peer group for Route Reflectors
neighbor central remote-as 65127 # Same AS as this node
neighbor 10.0.3.129 peer-group central # Other RR
neighbor 10.0.1.129 peer-group central # This node
!
address-family l2vpn evpn # Send MAC information
  neighbor central activate # For my peer group
  advertise-all-vni # Send virtual network information
exit-address-family
!
exit

The line address-family l2vpn evpn needs some explanation as it goes to the heart of what this does. l2vpn refers to us setting up a L2 network and somehow this means we also need to establish that this is an ethernet VPN? Basically that stanza says “tell the route reflectors in the peer group central of any MAC-adresses you see related to a VXLAN interface”. That way other switches will know to send a VXLAN packet to your IP-adress whenever someone is trying to send an L2 packet to one of your interfaces.

Let’s look at the FRR conf for DC01RR01 before testing things out:

router bgp 65127 # Special L2VPN-info AS
  bgp router-id 10.0.1.129 # IP of this Route Reflector
  bgp cluster-id 10.0.1.129 # Same, used to manage the fact that there are 2 RR
  no bgp default ipv4-unicast
  neighbor central peer-group
  neighbor central remote-as 65127
  neighbor central capability extended-nexthop
  neighbor 10.0.3.129 peer-group central
  neighbor 10.0.1.129 peer-group central
  bgp listen range 10.0.0.0/16 peer-group central
  !
  address-family l2vpn evpn
   neighbor central activate
   neighbor central route-reflector-client
  exit-address-family
  !
exit
!

The big thing here is that we don’t advertise-all-vni but rather neighbor central route-reflector-client. So switches advertise VNI and L2 data and the Route Reflectors collect this data and provide it to all the switches.

Test

Let’s look at DC01Server02 which has a single IP-adress 10.1.0.102 on eth1(connected to DC01Switch02):

3: eth1@if179: mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 8e:bf:63:02:53:9f brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.1.0.102/24 brd 10.1.0.255 scope global eth1
valid_lft forever preferred_lft forever
inet6 fe80::8cbf:63ff:fe02:539f/64 scope link
valid_lft forever preferred_lft forever

Note the hardware address above. Is it visible in DC01Switch01:s Forwarding Database(FDB)?

root@DC01Switch01:~# bridge fdb | grep "8e:bf"
root@DC01Switch01:~#

Nope. Well, no worries. Let’s ping 10.1.0.102 from DC01Server01 with IP-adress 10.1.0.101 which is attached to DC01Switch01:

root@DC01Server01:~# ping 10.1.0.102
PING 10.1.0.102 (10.1.0.102) 56(84) bytes of data.
64 bytes from 10.1.0.102: icmp_seq=1 ttl=64 time=0.165 ms
64 bytes from 10.1.0.102: icmp_seq=2 ttl=64 time=0.181 ms
^C
--- 10.1.0.102 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1005ms
rtt min/avg/max/mdev = 0.165/0.173/0.181/0.008 ms

That worked just fine. It’s almost as if DC01Switch01 learned the necessary information automatically from the Route Reflectors?

root@DC01Switch01:~# bridge fdb | grep "8e:bf"
8e:bf:63:02:53:9f dev vxlan100 vlan 1 extern_learn master lan100
8e:bf:63:02:53:9f dev vxlan100 extern_learn master lan100
8e:bf:63:02:53:9f dev vxlan100 dst 10.0.2.1 self extern_learn
root@DC01Switch01:~#

Indeed it did! But ICMP ping is a two-way thing. So did DC01Switch02 learn the MAC-adress of DC01Server01?

2: eth0@if176: mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether ce:d9:c4:af:e6:88 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.1.0.101/24 brd 10.1.0.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::ccd9:c4ff:feaf:e688/64 scope link
valid_lft forever preferred_lft forever

And the FDB on DC01Switch02:

root@DC01Switch02:~# bridge fdb | grep "ce:d9"
ce:d9:c4:af:e6:88 dev vxlan100 vlan 1 extern_learn master lan100
ce:d9:c4:af:e6:88 dev vxlan100 extern_learn master lan100
ce:d9:c4:af:e6:88 dev vxlan100 dst 10.0.1.1 self extern_learn

Success! Let’s try something fancier. Let’s see if we can get DC02Server01 to ping DC01Server02:

root@DC02Server01:~# ping 10.1.0.102
PING 10.1.0.102 (10.1.0.102) 56(84) bytes of data.
64 bytes from 10.1.0.102: icmp_seq=1 ttl=64 time=0.591 ms
64 bytes from 10.1.0.102: icmp_seq=2 ttl=64 time=0.210 ms
64 bytes from 10.1.0.102: icmp_seq=3 ttl=64 time=0.210 ms
64 bytes from 10.1.0.102: icmp_seq=4 ttl=64 time=0.200 ms
^C
--- 10.1.0.102 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3052ms
rtt min/avg/max/mdev = 0.200/0.302/0.591/0.166 ms

Indeed. So is DC02Switch01 aware of the interfaces connected to the VXLAN interface on DC01Switch02?

root@DC02Switch01:~# bridge fdb | grep "8e:bf"
8e:bf:63:02:53:9f dev vxlan100 vlan 1 extern_learn master lan100
8e:bf:63:02:53:9f dev vxlan100 extern_learn master lan100
8e:bf:63:02:53:9f dev vxlan100 dst 10.0.2.1 self extern_learn

Nicely done. Note that none of these servers know about or have access to 10.0.1.128 or 10.0.1.1 or any of those devices. Not even the switches they are connected to “physically”:

root@DC02Server01:~# ping 10.0.1.2
PING 10.0.1.2 (10.0.1.2) 56(84) bytes of data.
From 10.1.0.103 icmp_seq=1 Destination Host Unreachable
From 10.1.0.103 icmp_seq=2 Destination Host Unreachable
From 10.1.0.103 icmp_seq=3 Destination Host Unreachable
^C
--- 10.0.1.2 ping statistics ---
4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 3066ms
pipe 3

As far as the servers are concerned they are connected to the same switch. They have no way of knowing that the switch is actually four separate switches distributed over two datacenters. Pretty neat! But let’s try some availability-related stuff.

Switching over

Let’s try to ping DC01Server01 from DC02Server02 like before and then tell DC01Server01 to stop using eth0(connected to DC01Switch01) and instead start using eth1 which is connected to DC01Switch02. What does the switch used by DC02Server02 know about DC01Server01 before we start?

root@DC02Switch02:~# bridge fdb | grep "ce:d9"
ce:d9:c4:af:e6:88 dev vxlan100 vlan 1 extern_learn master lan100
ce:d9:c4:af:e6:88 dev vxlan100 extern_learn master lan100
ce:d9:c4:af:e6:88 dev vxlan100 dst 10.0.1.1 self extern_learn

Right, DC01Server01 is reached through DC01Switch01 which has IP-adress 10.0.1.1. Let’s start pinging:

root@DC02Server02:~# ping 10.1.0.101
PING 10.1.0.101 (10.1.0.101) 56(84) bytes of data.
64 bytes from 10.1.0.101: icmp_seq=1 ttl=64 time=0.197 ms
64 bytes from 10.1.0.101: icmp_seq=2 ttl=64 time=0.210 ms
64 bytes from 10.1.0.101: icmp_seq=3 ttl=64 time=0.205 ms
64 bytes from 10.1.0.101: icmp_seq=4 ttl=64 time=0.305 ms

Switching over to eth1 on DC01Server01… Let’s see what happens to the ping. It get’s stuck after ping 26:

64 bytes from 10.1.0.101: icmp_seq=5 ttl=64 time=0.202 ms
64 bytes from 10.1.0.101: icmp_seq=6 ttl=64 time=0.196 ms
64 bytes from 10.1.0.101: icmp_seq=7 ttl=64 time=0.201 ms
64 bytes from 10.1.0.101: icmp_seq=8 ttl=64 time=0.181 ms
64 bytes from 10.1.0.101: icmp_seq=9 ttl=64 time=0.208 ms
64 bytes from 10.1.0.101: icmp_seq=10 ttl=64 time=0.203 ms
64 bytes from 10.1.0.101: icmp_seq=11 ttl=64 time=0.190 ms
64 bytes from 10.1.0.101: icmp_seq=12 ttl=64 time=0.200 ms
64 bytes from 10.1.0.101: icmp_seq=13 ttl=64 time=0.206 ms
64 bytes from 10.1.0.101: icmp_seq=14 ttl=64 time=0.227 ms
64 bytes from 10.1.0.101: icmp_seq=15 ttl=64 time=0.214 ms
64 bytes from 10.1.0.101: icmp_seq=16 ttl=64 time=0.199 ms
64 bytes from 10.1.0.101: icmp_seq=17 ttl=64 time=0.200 ms
64 bytes from 10.1.0.101: icmp_seq=18 ttl=64 time=0.256 ms
64 bytes from 10.1.0.101: icmp_seq=19 ttl=64 time=0.201 ms
64 bytes from 10.1.0.101: icmp_seq=20 ttl=64 time=0.200 ms
64 bytes from 10.1.0.101: icmp_seq=21 ttl=64 time=0.198 ms
64 bytes from 10.1.0.101: icmp_seq=22 ttl=64 time=0.198 ms
64 bytes from 10.1.0.101: icmp_seq=23 ttl=64 time=0.236 ms
64 bytes from 10.1.0.101: icmp_seq=24 ttl=64 time=0.201 ms
64 bytes from 10.1.0.101: icmp_seq=25 ttl=64 time=0.192 ms
64 bytes from 10.1.0.101: icmp_seq=26 ttl=64 time=0.207 ms
64 bytes from 10.1.0.101: icmp_seq=60 ttl=64 time=0.481 ms
64 bytes from 10.1.0.101: icmp_seq=61 ttl=64 time=0.212 ms

64 bytes from 10.1.0.101: icmp_seq=62 ttl=64 time=0.207 ms
64 bytes from 10.1.0.101: icmp_seq=63 ttl=64 time=0.216 ms
64 bytes from 10.1.0.101: icmp_seq=64 ttl=64 time=0.203 ms
64 bytes from 10.1.0.101: icmp_seq=65 ttl=64 time=0.205 ms
64 bytes from 10.1.0.101: icmp_seq=66 ttl=64 time=0.255 ms
64 bytes from 10.1.0.101: icmp_seq=67 ttl=64 time=0.217 ms
64 bytes from 10.1.0.101: icmp_seq=68 ttl=64 time=0.207 ms
64 bytes from 10.1.0.101: icmp_seq=69 ttl=64 time=0.241 ms
64 bytes from 10.1.0.101: icmp_seq=70 ttl=64 time=0.209 ms
64 bytes from 10.1.0.101: icmp_seq=71 ttl=64 time=0.215 ms
64 bytes from 10.1.0.101: icmp_seq=72 ttl=64 time=0.235 ms
64 bytes from 10.1.0.101: icmp_seq=73 ttl=64 time=0.228 ms
64 bytes from 10.1.0.101: icmp_seq=74 ttl=64 time=0.197 ms
64 bytes from 10.1.0.101: icmp_seq=75 ttl=64 time=0.201 ms
64 bytes from 10.1.0.101: icmp_seq=76 ttl=64 time=0.210 ms
64 bytes from 10.1.0.101: icmp_seq=77 ttl=64 time=0.249 ms
64 bytes from 10.1.0.101: icmp_seq=78 ttl=64 time=0.216 ms
64 bytes from 10.1.0.101: icmp_seq=79 ttl=64 time=0.205 ms
64 bytes from 10.1.0.101: icmp_seq=80 ttl=64 time=0.195 ms
64 bytes from 10.1.0.101: icmp_seq=81 ttl=64 time=0.205 ms
64 bytes from 10.1.0.101: icmp_seq=82 ttl=64 time=0.204 ms
64 bytes from 10.1.0.101: icmp_seq=83 ttl=64 time=0.204 ms
64 bytes from 10.1.0.101: icmp_seq=84 ttl=64 time=0.200 ms
^C
— 10.1.0.101 ping statistics —
84 packets transmitted, 51 received, 39.2857% packet loss, time 84979ms
rtt min/avg/max/mdev = 0.181/0.216/0.481/0.042 ms

You might say that this isn’t very good. We lost plenty of pings! But if we had lost DC01Switch01 and DC01Server01 failed over like this then a brief interruption is to be expected. At the end of the day the network reconfigured itself automatically to restore connectivity.

I suspect this kind of switch-over can be made to happen faster by configuring things differently but I’ll leave it here for now.

Caveats and things I can’t get to work

I tried using active-backup bonding on servers with this setup and it was a disaster with bridges on switches sending packets on the wrong port. Can’t figure out why but I’ll try it again at some point.

It seems like you have to have VTEPs and Route Reflectors on the same AS. I couldn’t get it to work any other way.

FRR says you can have multiple AS:s in a single bgpd process by using Virtual Routing and Forwarding(VRF) but I couldn’t get that to click. Ideally switches would be part of the routers’ AS to get routes easily but whenever I try to run L2VPN in a non-default VRF nothing happens. Maybe the VXLAN interface must be assigned the same VRF?

FRR:s VRRP implementation is awkward so I’d use Keepalived for that purpose to be honest.

Memory hotplug problem in Proxmox

Edit: Oh, I’m in idiot. Proxmox clearly state that this doesn’t work in Ubuntu out of the box:

For Linux memory unplug, you need to have movable zone enabled, in the kernel config (not enabled by default on Debian/Ubuntu):

CONFIG_MOVABLE_NODE=YES

https://pve.proxmox.com/wiki/Hotplug_(qemu_disk,nic,cpu,memory)

Original post

Tried to reduce the memory used by a VM on ProxMox:

Parameter verification failed. (400)

memory: hotplug problem – 400 Parameter verification failed. dimm4: error unplug memory module

And in the syslog:


Jun 19 01:15:46 samba01 kernel: [6657625.965645] Offlined Pages 32768
Jun 19 01:15:46 samba01 kernel: [6657625.975153] Offlined Pages 32768
Jun 19 01:15:46 samba01 kernel: [6657625.977703] page:ffffef54c64a1000 refcount:1 mapcount:0 mapping:ffff8c6fb1f8b880 index:0x0 compound_mapcount: 0
Jun 19 01:15:46 samba01 kernel: [6657625.977707] flags: 0x17ffffc0010200(slab|head)
Jun 19 01:15:46 samba01 kernel: [6657625.977711] raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff8c6fb1f8b880
Jun 19 01:15:46 samba01 kernel: [6657625.977713] raw: 0000000000000000 00000000001d001d 00000001ffffffff 0000000000000000
Jun 19 01:15:46 samba01 kernel: [6657625.977714] page dumped because: unmovable page
Jun 19 01:15:46 samba01 kernel: [6657625.982024] Offlined Pages 32768
Jun 19 01:15:46 samba01 kernel: [6657625.984611] page:ffffef54c64a1000 refcount:1 mapcount:0 mapping:ffff8c6fb1f8b880 index:0x0 compound_mapcount: 0
Jun 19 01:15:46 samba01 kernel: [6657625.984615] flags: 0x17ffffc0010200(slab|head)
Jun 19 01:15:46 samba01 kernel: [6657625.984619] raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff8c6fb1f8b880
Jun 19 01:15:46 samba01 kernel: [6657625.984622] raw: 0000000000000000 00000000001d001d 00000001ffffffff 0000000000000000
Jun 19 01:15:46 samba01 kernel: [6657625.984623] page dumped because: unmovable page
Jun 19 01:15:46 samba01 kernel: [6657625.984647] memory memory50: Offline failed.
Jun 19 01:15:49 samba01 kernel: [6657628.969161] Offlined Pages 32768
Jun 19 01:15:49 samba01 kernel: [6657628.975996] Offlined Pages 32768
Jun 19 01:15:49 samba01 kernel: [6657628.977797] page:ffffef54c64a1000 refcount:1 mapcount:0 mapping:ffff8c6fb1f8b880 index:0x0 compound_mapcount: 0
Jun 19 01:15:49 samba01 kernel: [6657628.977801] flags: 0x17ffffc0010200(slab|head)
Jun 19 01:15:49 samba01 kernel: [6657628.977806] raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff8c6fb1f8b880
Jun 19 01:15:49 samba01 kernel: [6657628.977808] raw: 0000000000000000 00000000001d001d 00000001ffffffff 0000000000000000
Jun 19 01:15:49 samba01 kernel: [6657628.977809] page dumped because: unmovable page
Jun 19 01:15:50 samba01 kernel: [6657628.982276] Offlined Pages 32768
Jun 19 01:15:50 samba01 kernel: [6657628.982783] page:ffffef54c64a1000 refcount:1 mapcount:0 mapping:ffff8c6fb1f8b880 index:0x0 compound_mapcount: 0
Jun 19 01:15:50 samba01 kernel: [6657628.982786] flags: 0x17ffffc0010200(slab|head)
Jun 19 01:15:50 samba01 kernel: [6657628.982790] raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff8c6fb1f8b880
Jun 19 01:15:50 samba01 kernel: [6657628.982792] raw: 0000000000000000 00000000001d001d 00000001ffffffff 0000000000000000
Jun 19 01:15:50 samba01 kernel: [6657628.982793] page dumped because: unmovable page
Jun 19 01:15:50 samba01 kernel: [6657628.982815] memory memory50: Offline failed.
Jun 19 01:15:52 samba01 kernel: [6657631.979709] Offlined Pages 32768
Jun 19 01:15:53 samba01 kernel: [6657631.987515] Offlined Pages 32768
Jun 19 01:15:53 samba01 kernel: [6657631.987911] page:ffffef54c64a1000 refcount:1 mapcount:0 mapping:ffff8c6fb1f8b880 index:0x0 compound_mapcount: 0
Jun 19 01:15:53 samba01 kernel: [6657631.987962] flags: 0x17ffffc0010200(slab|head)
Jun 19 01:15:53 samba01 kernel: [6657631.987968] raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff8c6fb1f8b880
Jun 19 01:15:53 samba01 kernel: [6657631.987970] raw: 0000000000000000 00000000001d001d 00000001ffffffff 0000000000000000
Jun 19 01:15:53 samba01 kernel: [6657631.987971] page dumped because: unmovable page
Jun 19 01:15:53 samba01 kernel: [6657631.991985] Offlined Pages 32768
Jun 19 01:15:53 samba01 kernel: [6657631.993757] page:ffffef54c64a1000 refcount:1 mapcount:0 mapping:ffff8c6fb1f8b880 index:0x0 compound_mapcount: 0
Jun 19 01:15:53 samba01 kernel: [6657631.993761] flags: 0x17ffffc0010200(slab|head)
Jun 19 01:15:53 samba01 kernel: [6657631.993765] raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff8c6fb1f8b880
Jun 19 01:15:53 samba01 kernel: [6657631.993768] raw: 0000000000000000 00000000001d001d 00000001ffffffff 0000000000000000
Jun 19 01:15:53 samba01 kernel: [6657631.993769] page dumped because: unmovable page
Jun 19 01:15:53 samba01 kernel: [6657631.993792] memory memory50: Offline failed.
Jun 19 01:15:56 samba01 kernel: [6657634.990796] Offlined Pages 32768
Jun 19 01:15:56 samba01 kernel: [6657634.998121] Offlined Pages 32768
Jun 19 01:15:56 samba01 kernel: [6657634.998571] page:ffffef54c64a1000 refcount:1 mapcount:0 mapping:ffff8c6fb1f8b880 index:0x0 compound_mapcount: 0
Jun 19 01:15:56 samba01 kernel: [6657634.998574] flags: 0x17ffffc0010200(slab|head)
Jun 19 01:15:56 samba01 kernel: [6657634.998578] raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff8c6fb1f8b880
Jun 19 01:15:56 samba01 kernel: [6657634.998580] raw: 0000000000000000 00000000001d001d 00000001ffffffff 0000000000000000
Jun 19 01:15:56 samba01 kernel: [6657634.998581] page dumped because: unmovable page
Jun 19 01:15:56 samba01 kernel: [6657635.002457] Offlined Pages 32768
Jun 19 01:15:56 samba01 kernel: [6657635.003626] page:ffffef54c64a1000 refcount:1 mapcount:0 mapping:ffff8c6fb1f8b880 index:0x0 compound_mapcount: 0
Jun 19 01:15:56 samba01 kernel: [6657635.003631] flags: 0x17ffffc0010200(slab|head)
Jun 19 01:15:56 samba01 kernel: [6657635.003635] raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff8c6fb1f8b880
Jun 19 01:15:56 samba01 kernel: [6657635.003637] raw: 0000000000000000 00000000001d001d 00000001ffffffff 0000000000000000
Jun 19 01:15:56 samba01 kernel: [6657635.003638] page dumped because: unmovable page
Jun 19 01:15:56 samba01 kernel: [6657635.003661] memory memory50: Offline failed.
Jun 19 01:15:59 samba01 kernel: [6657637.999099] Offlined Pages 32768
Jun 19 01:15:59 samba01 kernel: [6657638.007138] Offlined Pages 32768
Jun 19 01:15:59 samba01 kernel: [6657638.007587] page:ffffef54c64a1000 refcount:1 mapcount:0 mapping:ffff8c6fb1f8b880 index:0x0 compound_mapcount: 0
Jun 19 01:15:59 samba01 kernel: [6657638.007590] flags: 0x17ffffc0010200(slab|head)
Jun 19 01:15:59 samba01 kernel: [6657638.007594] raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff8c6fb1f8b880
Jun 19 01:15:59 samba01 kernel: [6657638.007597] raw: 0000000000000000 00000000001d001d 00000001ffffffff 0000000000000000
Jun 19 01:15:59 samba01 kernel: [6657638.007598] page dumped because: unmovable page
Jun 19 01:15:59 samba01 kernel: [6657638.012491] Offlined Pages 32768
Jun 19 01:15:59 samba01 kernel: [6657638.013243] page:ffffef54c64a1000 refcount:1 mapcount:0 mapping:ffff8c6fb1f8b880 index:0x0 compound_mapcount: 0
Jun 19 01:15:59 samba01 kernel: [6657638.013246] flags: 0x17ffffc0010200(slab|head)
Jun 19 01:15:59 samba01 kernel: [6657638.013254] raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff8c6fb1f8b880
Jun 19 01:15:59 samba01 kernel: [6657638.013256] raw: 0000000000000000 00000000001d001d 00000001ffffffff 0000000000000000
Jun 19 01:15:59 samba01 kernel: [6657638.013257] page dumped because: unmovable page
Jun 19 01:15:59 samba01 kernel: [6657638.013291] memory memory50: Offline failed.
Jun 19 01:16:02 samba01 kernel: [6657641.011053] Offlined Pages 32768
Jun 19 01:16:02 samba01 kernel: [6657641.015726] Offlined Pages 32768
Jun 19 01:16:02 samba01 kernel: [6657641.016167] page:ffffef54c64a1000 refcount:1 mapcount:0 mapping:ffff8c6fb1f8b880 index:0x0 compound_mapcount: 0
Jun 19 01:16:02 samba01 kernel: [6657641.016170] flags: 0x17ffffc0010200(slab|head)
Jun 19 01:16:02 samba01 kernel: [6657641.016175] raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff8c6fb1f8b880
Jun 19 01:16:02 samba01 kernel: [6657641.016177] raw: 0000000000000000 00000000001d001d 00000001ffffffff 0000000000000000
Jun 19 01:16:02 samba01 kernel: [6657641.016178] page dumped because: unmovable page
Jun 19 01:16:02 samba01 kernel: [6657641.020044] Offlined Pages 32768
Jun 19 01:16:02 samba01 kernel: [6657641.020495] page:ffffef54c64a1000 refcount:1 mapcount:0 mapping:ffff8c6fb1f8b880 index:0x0 compound_mapcount: 0
Jun 19 01:16:02 samba01 kernel: [6657641.020499] flags: 0x17ffffc0010200(slab|head)
Jun 19 01:16:02 samba01 kernel: [6657641.020503] raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff8c6fb1f8b880
Jun 19 01:16:02 samba01 kernel: [6657641.020505] raw: 0000000000000000 00000000001d001d 00000001ffffffff 0000000000000000
Jun 19 01:16:02 samba01 kernel: [6657641.020506] page dumped because: unmovable page
Jun 19 01:16:02 samba01 kernel: [6657641.020529] memory memory50: Offline failed.
Jun 19 01:16:05 samba01 kernel: [6657644.027669] Offlined Pages 32768
Jun 19 01:16:05 samba01 kernel: [6657644.032558] Offlined Pages 32768
Jun 19 01:16:05 samba01 kernel: [6657644.033056] page:ffffef54c64a1000 refcount:1 mapcount:0 mapping:ffff8c6fb1f8b880 index:0x0 compound_mapcount: 0
Jun 19 01:16:05 samba01 kernel: [6657644.033059] flags: 0x17ffffc0010200(slab|head)
Jun 19 01:16:05 samba01 kernel: [6657644.033063] raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff8c6fb1f8b880
Jun 19 01:16:05 samba01 kernel: [6657644.033120] raw: 0000000000000000 00000000001d001d 00000001ffffffff 0000000000000000
Jun 19 01:16:05 samba01 kernel: [6657644.033123] page dumped because: unmovable page
Jun 19 01:16:05 samba01 kernel: [6657644.037184] Offlined Pages 32768
Jun 19 01:16:05 samba01 kernel: [6657644.037639] page:ffffef54c64a1000 refcount:1 mapcount:0 mapping:ffff8c6fb1f8b880 index:0x0 compound_mapcount: 0
Jun 19 01:16:05 samba01 kernel: [6657644.037642] flags: 0x17ffffc0010200(slab|head)
Jun 19 01:16:05 samba01 kernel: [6657644.037647] raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff8c6fb1f8b880
Jun 19 01:16:05 samba01 kernel: [6657644.037649] raw: 0000000000000000 00000000001d001d 00000001ffffffff 0000000000000000
Jun 19 01:16:05 samba01 kernel: [6657644.037650] page dumped because: unmovable page
Jun 19 01:16:05 samba01 kernel: [6657644.037673] memory memory50: Offline failed.

Tried to offline memory-chunks manually:

root@samba01:~# cat /sys/devices/system/memory/memory50/removable
1
root@samba01:~# echo offline > /sys/devices/system/memory/memory50/state
-bash: echo: write error: Device or resource busy
root@samba01:~# echo offline > /sys/devices/system/memory/memory50/state
-bash: echo: write error: Device or resource busy

Tried to force memory compaction:

root@samba01:~# echo 1 > /proc/sys/vm/compact_memory

But no. In the end I had to shut down the VM and restart it to make enforce the new memory limit. Not a big deal really but interesting. Just one more example of why we should set things up in such a way as to tolerate reboots. In this particular case it was my local network file share which is just fine to have down for two minutes but I do also have a read-only backup to take over when the primary is down.

LV thin volume data usage

Noticed that a thin volume had weird data usage according to LVM. 94% of data space was being used. But inside the VM it said 25%. Learned of a command called fstrim which sends “discard” commands to the underlying storage while going through a filesystem to tell the storage what isn’t actually being used. Didn’t work. Not sure if remount was sufficient or if it was that MariaDB was shut down or maybe that I removed the noatime setting for the FS, but after some tinkering it worked:

root@cluster1:~# fstrim -av
/var/lib/mysql: 0 B (0 bytes) trimmed on /dev/sdb
/boot: 0 B (0 bytes) trimmed on /dev/sda2
/: 74.8 MiB (78364672 bytes) trimmed on /dev/mapper/ubuntu--vg-ubuntu--lv
root@cluster1:~# fsck
fsck         fsck.btrfs   fsck.cramfs  fsck.ext2    fsck.ext3    fsck.ext4    fsck.fat     fsck.minix   fsck.msdos   fsck.vfat    fsck.xfs
root@cluster1:~# fsck.ext4 /var/lib/mysql/
e2fsck 1.45.5 (07-Jan-2020)
fsck.ext4: Is a directory while trying to open /var/lib/mysql/

The superblock could not be read or does not describe a valid ext2/ext3/ext4
filesystem.  If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
 or
    e2fsck -b 32768 <device>

root@cluster1:~# fsck.ext4 /dev/sdb
e2fsck 1.45.5 (07-Jan-2020)
/dev/sdb is mounted.
e2fsck: Cannot continue, aborting.

root@cluster1:~# systemctl stop mysql
root@cluster1:~# umount /var/lib/mysql
root@cluster1:~# fsck.ext4 /dev/sdb
e2fsck 1.45.5 (07-Jan-2020)
/dev/sdb: clean, 481/1966080 files, 1872010/7864320 blocks
root@cluster1:~# vim /etc/fstab
root@cluster1:~# mount /var/lib/mysql/
root@cluster1:~# df -h
Filesystem                         Size  Used Avail Use% Mounted on
udev                               415M     0  415M   0% /dev
tmpfs                              170M  1.1M  169M   1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv   20G  7.4G   12G  40% /
tmpfs                              459M     0  459M   0% /dev/shm
tmpfs                              5.0M     0  5.0M   0% /run/lock
tmpfs                              459M     0  459M   0% /sys/fs/cgroup
/dev/loop0                          56M   56M     0 100% /snap/core18/1997
/dev/sda2                          976M  199M  711M  22% /boot
/dev/loop5                          71M   71M     0 100% /snap/lxd/19647
/dev/loop4                          68M   68M     0 100% /snap/lxd/20326
/dev/loop7                          56M   56M     0 100% /snap/core18/2066
/dev/loop1                          33M   33M     0 100% /snap/snapd/12057
/dev/loop6                          33M   33M     0 100% /snap/snapd/12159
tmpfs                              102M     0  102M   0% /run/user/1000
/dev/sdb                            30G  6.6G   22G  24% /var/lib/mysql
root@cluster1:~# fstrim -v /var/lib/mysql
/var/lib/mysql: 22.9 GiB (24544501760 bytes) trimmed
root@cluster1:~# systemctl restart mysql

But still no change in lvdisplay:

root@pve1:~# lvdisplay pve/vm-109-disk-1
  --- Logical volume ---
  LV Path                /dev/pve/vm-109-disk-1
  LV Name                vm-109-disk-1
  VG Name                pve
  LV UUID                ATDHHH-DXGG-KqtR-sLA6-XeUW-xa51-HLXxId
  LV Write Access        read/write
  LV Creation host, time pve1, 2020-09-10 20:25:44 +0200
  LV Pool name           data
  LV Status              available
  # open                 1
  LV Size                30.00 GiB
  Mapped size            93.87%
  Current LE             7680
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:11

Damn it… This took a while. Apparently the big issue was that the VM definition in ProxMox had “discard=0” configured for that hard drive. So I had to change that, and then fstrim -av worked a lot better. It ran way slower of course since it actually ended up doing work.

root@cluster1:~# fstrim -av
/var/lib/mysql: 22.9 GiB (24514723840 bytes) trimmed on /dev/sdb
/boot: 676.1 MiB (708972544 bytes) trimmed on /dev/sda2
/: 11.9 GiB (12730437632 bytes) trimmed on /dev/mapper/ubuntu--vg-ubuntu--lv

And result:

root@pve1:~# lvdisplay pve/vm-109-disk-1
  --- Logical volume ---
  LV Path                /dev/pve/vm-109-disk-1
  LV Name                vm-109-disk-1
  VG Name                pve
  LV UUID                ATDHHH-DXGG-KqtR-sLA6-XeUW-xa51-HLXxId
  LV Write Access        read/write
  LV Creation host, time pve1, 2020-09-10 20:25:44 +0200
  LV Pool name           data
  LV Status              available
  # open                 1
  LV Size                30.00 GiB
  Mapped size            23.40%
  Current LE             7680
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:11

Command line extravaganza

Some of us don’t particularly like using the computer mouse. It’s partly a matter of ergonomics – stretching out your arm to grab something wears thin after a couple of hours even if it’s just a few inches – and also a matter of keeping your fingers aligned on the keyboard so as to maintain touch typing. This has led to window managers like ratpoison which go in for completely removing the need for using a mouse.

I prefer i3 though it has a big problem with screen tearing so I only use it for non-graphical stuff. So video and games get to run in KDE and I have a separate VM where I do tech-stuff that’s heavy on text, terminals and so on. Some say the compositing… something-manager compton solves the issue but not for me.

We need more than just a keyboard-centric window manager to make it work. One of the most important missing pieces is browsing the web without a mouse. You could use something like w3m or lynx which is necessarily keyboard-centric as they are terminal-based. w3m can actually show images but that is obviously heresy. My solution of choice is Google Chrome with vimium. You can use Firefox with vimperator if you are so inclined.

In vimium you can press v once to start moving a caret around and another press of v starts marking a continous segment of text. Then it’s y to copy to clipboard. If you wonder why the characters vim appear so often in this context, it’s because the text editor vim is how hard core terminal junkies edit text files. It’s great but if you accidentally start vim you’ll feel like you’ve gone to crazy-town( :q and Enter will quit vim. You’re welcome). So maybe have a look at this crash course first: https://gist.github.com/dmsul/8bb08c686b70d5a68da0e2cb81cd857f

Terminal in itself

So what else do we need? Well, copy-pasting is the basis for all communication so it’s not sufficient to be able to copy paste from web browsers to text documents and terminals, but we must also be able to copy from text documents to terminals. Enter tmux! It creates virtual terminals in your terminal, sort of?

You get multiple screens and panes and can copy and paste within them, between them and also out to your system clipboard. Here’s the text copied from the screenshot:

root 5970 1 0 18:06 ? 00:00:00 /usr/lib/packagekit/pack
cjp 6017 4995 0 18:06 pts/0 00:00:00 tmux
cjp 6019 1318 0 18:06 ? 00:00:00 tmux
cjp 6020 6019 0 18:06 pts/1 00:00:00 -bash
cjp 6040 6019 0 18:06 pts/2 00:00:00 -bash
cjp 6052 6040 0 18:06 pts/2 00:00:00 ps -efd

My .tmux.conf file is pretty simple: .tmux.conf

fzf

This is a great fuzzy finder that helps you find text. Like a modern version of grep.

googler

Search Google via the terminal and open chosen result in a default or specified web browser.

ranger

A nice file browser in the terminal:

Ironical that I have to use my mouse to make screenshots of my completely keyboard-based VM.

bat

bat is like cat but for the modern era. The binary is called batcat on Ubuntu for some reason.

MySQL errors of my own making

Trying to break MySQL to give my colleagues some stuff to exercise their skills on. But it’s going… so so.

I tried to reduce the size of the file ibdata1:

innodb_data_file_path = ibdata1:1M

No go. Turns out the minimum size is 5M.

2021-03-26 22:49:37 0 [ERROR] InnoDB: Tablespace size must be at least 5 MB
innodb_data_file_path = ibdata1:5M

Nope:

2021-03-26 22:50:08 0 [ERROR] InnoDB: The innodb_system data file './ibdata1' is of a different size 768 pages than the 320 pages specified in the .cnf file!

Well yeah… I want it to be smaller. Seems I have to enable autoextend again: https://dev.mysql.com/doc/refman/5.7/en/innodb-system-tablespace.html. Sounds weird. Well, yeah. That only makes MariaDB ignore my 5MB limit and go with the old value of 12M. Nice.

Oh, I also encountered a problem with MariaDB wanting to have 32000 open files with a small tablespace:

22:49:25 0 [Warning] Could not increase number of max_open_files to more than 16384 (request: 32184)

Sigh… Had to modify the systemd file /etc/systemd/system/mysql.service to say

LimitNOFILE=65536

And then reload systemd:

systemctl daemon-reload
systemctl restart mysql

Then I realized it was innodb_log_file_size that I wanted to change. Good times.

Port mirroring

Core dump of my brain:

Need to find out what’s drawing 10 Mbit/s on my WAN. Thank God I figured out that port mirroring is a thing before constructing that 3 node keepalived cluster idea to make a redundant virtual router through which all traffic would have to go.

Source is port 1 which goes to the router and port 23 is eno4 on pve3. It may have been sufficient to run “ip link set ens19 promisc on” inside the VM that I connected to the correspond bridge in Proxmox and turn off the firewall for the interface. That last bit was a tricky thing because I have no firewall rules in Proxmox. But apparently just having firewalling enabled kicks my plans of pushing all internet-related packets to my testmonitor right in the shins.

Along the way I switched standard Linux bridging for OpenvSwitch. Not sure if that was necessary but this configuration worked:

auto lo
iface lo inet loopback

iface eno3 inet manual

iface eno1 inet manual

iface eno2 inet manual

allow-vmbr1 eno4
iface eno4 inet manual
        ovs_bridge vmbr1
        ovs_type OVSPort

auto bond0
iface bond0 inet manual
        bond-slaves eno2 eno3
        bond-miimon 100
        bond-mode balance-rr

auto vmbr0
iface vmbr0 inet static
        address 192.168.1.23/21
        gateway 192.168.0.1
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0

#auto vmbr1
allow-ovs vmbr1
iface vmbr1 inet manual
        ovs_type OVSBridge
        ovs_ports eno4

Update 2021-09-29 00:12

With Proxmox 7 it was sufficient to turned of firewall and run this command:

brctl setageing vmbr1 0

Some more notes

Linux bridges can have STP support:

root@pve3:~# brctl showstp vmbr0
vmbr0
 bridge id              8000.ac1f6bb1dd89
 designated root        8000.ac1f6bb1dd89
 root port                 0                    path cost                  0
 max age                  20.00                 bridge max age            20.00
 hello time                2.00                 bridge hello time          2.00
 forward delay             0.00                 bridge forward delay       0.00
 ageing time             300.00
 hello timer               0.00                 tcn timer                  0.00
 topology change timer     0.00                 gc timer                  83.72
 flags


bond0 (1)
 port id                8001                    state                forwarding
 designated root        8000.ac1f6bb1dd89       path cost                  4
 designated bridge      8000.ac1f6bb1dd89       message age timer          0.00
 designated port        8001                    forward delay timer        0.00
 designated cost           0                    hold timer                 0.00
 flags

fwpr103p0 (2)
 port id                8002                    state                forwarding
 designated root        8000.ac1f6bb1dd89       path cost                  2
 designated bridge      8000.ac1f6bb1dd89       message age timer          0.00
 designated port        8002                    forward delay timer        0.00
 designated cost           0                    hold timer                 0.00
 flags

I did not know that.

Here’s a really good idea if you have an internal authoritative DNS server for your domain and you use short TTL values so that changes will propagate quickly: DON’T SET THE PDNS SERVICE TO DISABLED. If you are an idiot like me, run this:

systemctl enable pdns
systemctl start pdns

I guess having your authoritative DNS server autostart is good no matter what your TTL values but it got real obvious real fast that something had gone to hell in a handbasket. At least I know now why things went all bananas the last time I rebooted the physical server where authdns01 runs…

I have a systemd service for a Docker-based PowerDNS GUI by the way:

root@authdns01:~# cat /etc/systemd/system/pdnsgui.service
[Unit]
Description=PowerDNS Admin Container
After=docker.service
Requires=docker.service

[Service]
TimeoutStartSec=45
Restart=always
ExecStartPre=-/usr/bin/docker stop pdnsgui
ExecStartPre=-/usr/bin/docker rm pdnsgui
ExecStart=/usr/bin/docker run --name pdnsgui -v pda-data:/data -p 9191:80 ngoduykhanh/powerdns-admin

[Install]
WantedBy=multi-user.target

Also not enabled…

systemctl enable pdnsgui
systemctl start pdnsgui

Some have “ExecStartPre=/usr/bin/docker pull ngoduykhanh/powerdns-admin” in their pdnsgui-analogue but I don’t like living dangerously.

UEFI for Qemu VMs

That was a properly gatekeeping title. If you’re not in the know:
UEFI = Latest system for getting computers to start
VM = Virtual Machine
Qemu = Basis for Linux runs virtual machines using its own VM-speed-up-thing

So this is about how to boot virtual machines running on Qemu as if though the virtual machine had UEFI support. Which is a bit wonky… I was trying to set up Arch Linux on a Proxmox VM and it took a bit more time than initial projected.

You need to add a special EFI disk to a virtual machine if you want to be able to save UEFI entries but it maybe works sometimes. It might have been that I added and erased a bunch of entries while troubleshooting that broke it. Anyway, now I can add EFI entries and they show up just fine in efibootmgr-printouts but when I reboot only the default boot entries are available.

This turned out to give me a ground course in UEFI stuff, and make me go all little bit unhinged. Kind of like what happens when you try to understand how a company like Juicero could come to exist. Anyway, UEFI works on the principle of there being a EFI storage space on a connected hard drive. Note that this is not the same as the Proxmox EFI disk. Yeah, kind of complicated huh?

EFI needs to store lots of data about the operating systems that are available to boot on a computer. For Linux this is just a Linux kernel image and maybe an initramfs(which I needed because I thought I’d make my Arch VM use LVM for its root partition, because of course I did) but it can be other stuff as well. So the EFI disk you always need – even on a physical computer – is for operating system stuff and should be a few hundred megabytes in size.

The extra EFI disk you need to add for Proxmox is just a place to store things like “which Linux kernel to boot for Arch Linux”. That’s the part that seems to be kind of wonky. Anyway, the “real” EFI disk needs to be FAT32 formatted and throughout UEFI folders are separated from each other using backslashes like in Windows, not with forward slashes like in Linux. Not two things I choose when I have a choice.

All in all UEFI isn’t a bad step forward from BIOS as it replaces boot loaders in most aspects. That’s what I learned during all of this. Every UEFI implementation should have some fallback environment that it can start if nothing else works and there you can do basic command line stuff. Here’s how I got Arch Linux booting on my Proxmox VM finally:

I just realized I can’t copy-paste from the Proxmox UEFI shell… Well then. Okey, I guess I remember most of it and the script used to boot is available in the file system now.

Shell> FS0:
FS0>arch.nsh

I don’t know why it’s not “cd FS0” but it isn’t… Anyway, UEFI let’s you write and execute scripts, in this case arch.nsh:

\EFI\arch\vmlinuz-linux rw root=/dev/VG_root/root initrd=\EFI\arch\initramfs-linux.img

I made a proof of concept right in the UEFI shell:

Shell> FS0:
FS0>edit arch.nsh

There I could test different commands. Should it be \EFI\arch\vmlinuz-linux or EFI\arch\vmlinuz-linux(no leading backslash) ? Should it be initrd=\EFI\arch\initramfs-linux.img or initrd=”\EFI\arch\initramfs-linux.img” ? Using root=PARTUUID=LONGUUID was apparently not one of the right things but good old fashioned /dev/vgname/lvname worked fine.

I’m still not sure if scripts need to have the .nsh file ending but probably not. Anyway, now to figure out how to make Arch Linux autostart the network. And then another restart and booting using an EFI script!

Disaster recovery for small companies

Disaster recovery is a huge topic and if you’re General Electric or even a huge successful company then it’s really complicated. On the one hand you have so much data in so many places that needs to be preserved but just telling people to put in on their personal Dropbox is neither reliable nor secure. It’s a bit easier for smaller businesses since they have one web site and ten laptops to consider.

This is going to focus on the internet-related part, so web sites and email. For the love of God, please make backups of your laptops as well. It’s just not the point of this… well, whatever this is. Windows has good built-in backups nowadays so you only need a NAS and that’ll set you back like a few hundred Kajiggers even if you buy a pretty good one with multiple disks.

What I describe are tools which are readily available but not necessarily easy to use. One-click solutions to Disaster Recovery are expensive and often sketchy*. Unless you know how it works and why it will work even in a disaster, don’t trust it.

* Wow! I see Cloudflare has stopped offering a 1000% uptime guarantee for Enterprise Plans. Now it’s 100% with a 25x penalty-thing. I’m almost sad to see it go, it was such a great example of promising things that you know you can’t deliver but which you know you can afford to pay penalties for.

Web sites

So how to safeguard a web site? If you have your web site with a hosting company then they probably make backups but you had best check that. Don’t just ask them if they make backups. Test the backups. Download them and see if you can get your website up and running somewhere else. This serves multiple purposes:

  • Verifies that backup are in fact made.
  • Verifies that backups are available.
  • Makes sure you know what to do with them in an emergency.

Note that you shouldn’t rely entirely on the backups made by your hosting company even if everything checks out. Disaster recovery isn’t about surviving a broken hard drive somewhere, that’s just your average monday. A company being unable to get their entire storage system to work properly for weeks is where we need DR. There are quite a few companies in Sweden as I’m writing this that have been waiting to access their email accounts for weeks. It took that provider a week just to get their customers back to the point where they could send and receive emails again.

My personal favorites are legal problems. Like a group of people not bothering to transfer assets of a bankrupt company(by buying it from the estate) but instead keep operating them saying simply “Hey, we got all the login credentials we need to run this stuff. Why would be go all out of our way to make sure it’s owned by the right legal entity?”. Yeah, that doesn’t work out too well for their customers in the end. If you are told that the next step for you to get your ecommerce site back online is to contact the retired lawyer who precided over a bankruptcy case several years ago, you’re looking at a serious piece of downtime.

So don’t just have backups with the same company that hosts your web site. Keep stuff off site as well. If you get FTP or SSH access to your hosting account then you could script it, as demonstrated in this post that shows how to get all the data necessary to run a web site over from a hosting account to a virtual server so that downtime can be minutes rather than 17 weeks of legal wrangling.

If you don’t want to spend money on a virtual server 365 days a year just to have a backup that you can switch over to super-fast then downloading the information locally and making sure your know roughly how to bring a site back online is a way to go. Disaster recovery can be allowed to take time because disasters are not expected to happen often. It’s about getting things back in less than weeks or months or about getting things back at all.

Email

The IMAP protocol is great as it allows us to access email from multiple devices simultaneously. If configured correctly you can even see which emails have been sent from an account on a smart phone even when accessed from a different device(folder mapping can be kind of wonky out of the box though). It is however very much an online system, IMAP. If the email server goes down you may not even be able to read the old emails in your inbox. That depends on how much your email client is storing locally.

What can really be a kick in the pants when there’s a big failure for an email provider is that if your email clients can connect to the server using the IMAP protocol and the server is empty, IMAP will dutifully empty your inbox. Yeah, that’s not great but would your rather have old copies of deleted emails lying around on various devices just so that you can avoid this corner-case in a disaster recovery situation?

Anyway, what I do is use Imapsync to keep a set of local copies of important email accounts. Note that I have Google Workspace as my email provider and that I don’t trust them enough with my completely unimportant emails to let them be the only safeguard against data loss. Of course Imapsync is kind of not-so-user-friendly. It’s meant for situations where you need to move thousands of email accounts between providers or when making automated backups.

I run this once a day:

/usr/bin/imapsync --host1 imap.gmail.com --user1 my-gmail-address --password1 SUPERSECRETAPPPASSWORD --host2 localhost --user2 myemailbackup --password2 SUPERSECRETLOCALPASSWORD --maxbytespersecond 300000 --gmail1

Since I have a virtual machine doing backup tasks generally I installed Dovecot on it so as to not have to pay someone to hold my email backups that are just there for disaster recovery situations. On Ubuntu it’s really easy:

apt install dovecot-imapd

Then create local users that will serve as destination email accounts:

useradd myemailbackup
useradd anotherbackup

I don’t like mbox or mdbox much, at least not for these applications and that was the one change I had to make to the default dovecot setup. Maildir is my kind of format:

mail_location = maildir:~/Maildir:LAYOUT=fs

Dovecot uses the local user’s home directory by default and accepts whatever password the user has for logging on to the computer.

Knocking it up a notch

I can’t get by without Btrfs snapshots. Like in the case of email backups for instance. The maildir folders mentioned above? Stored on a Btrfs partition on my backup server.

root@backup:~# ls -lh /srv/storage/Snapshots/myemailbackup/
drwxr-xr-x 1 myemailbackup myemailbackup 284 Dec 16 23:40 myemailbackup@auto-2020-12-16-2352_14d
drwxr-xr-x 1 myemailbackup myemailbackup 284 Dec 16 23:40 myemailbackup@auto-2020-12-17-0500_14d
drwxr-xr-x 1 myemailbackup myemailbackup 324 Dec 18 01:09 myemailbackup@auto-2020-12-18-0500_14d
drwxr-xr-x 1 myemailbackup myemailbackup 324 Dec 18 01:09 myemailbackup@auto-2020-12-19-0500_14d
drwxr-xr-x 1 myemailbackup myemailbackup 324 Dec 19 14:42 myemailbackup@auto-2020-12-20-0500_14d
drwxr-xr-x 1 myemailbackup myemailbackup 324 Dec 19 14:42 myemailbackup@auto-2020-12-20-0830_4w
drwxr-xr-x 1 myemailbackup myemailbackup 324 Dec 19 14:42 myemailbackup@auto-2020-12-21-0500_14d
drwxr-xr-x 1 myemailbackup myemailbackup 324 Dec 19 14:42 myemailbackup@auto-2020-12-22-0500_14d
drwxr-xr-x 1 myemailbackup myemailbackup 324 Dec 19 14:42 myemailbackup@auto-2020-12-23-0500_14d
drwxr-xr-x 1 myemailbackup myemailbackup 324 Dec 19 14:42 myemailbackup@auto-2020-12-24-0500_14d
drwxr-xr-x 1 myemailbackup myemailbackup 324 Dec 19 14:42 myemailbackup@auto-2020-12-25-0500_14d
drwxr-xr-x 1 myemailbackup myemailbackup 324 Dec 19 14:42 myemailbackup@auto-2020-12-26-0500_14d

This takes care of the scenario mentioned above, where you run IMAP against a server that thinks you have no email and all your local copies are deleted. Each snapshot shows the state of each account at that point but uses only the amount of storage necessary for the difference between other snapshots. To go completely off the rails I also send these snapshots to a secondary server. Mostly because I already coded that functionality into the script and that this is just a few gigabytes of data. My script is a horror that makes programmer’s intentionally blind themselves but there are better variants: https://ownyourbits.com/2017/12/27/schedule-btrfs-snapshots-with-btrfs-snp/

In either case, you have all your emails stored locally in case of disaster striking your email provider. You can get a new email provider and Imapsync your data from your local server up to their servers. In a couple of hours you’ll be back up and running and not having to worry about how long it will take your previous provider to sort out their problems. As mentioned earlier I’ve seen that take weeks for companies that I thankfully don’t work for.

Large scale data storage

Some web sites have huge amounts of data even though the company running it is quite small. News- and fashion sites seems to be keen on lots of high resolution pictures and video. It might not be viable to store that on your own network. Or maybe you simply want a safer storage solution. Amazon S3 is pretty great for these applications. While I use my own network as a sort of offsite backup location for email hosted externally, I can’t very well be the offsite backup location for my own house. Therefore I use Amazon S3 whenever that need arises.

You can get the aws command line tool for most operating system to easily copy data to S3(and access their other services):

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
cd aws/
less install
 # <-- Read what it does
./install

You will need to add some credentials that have access to the right S3 buckets in the aws config file:

root@samba01:~# cat .aws/credentials
[default]
aws_access_key_id = SECRETID
aws_secret_access_key = SECRETKEY

Then it’s just a matter of running aws s3:

aws s3 cp localfile.tar.gz s3://mybackupbucket/Documents/

What I really like is that I can assign life-cycle rules that push data out to their Glacier storage tier after a while:

I currently have some 1,4 TB of data in Amazon S3 with the vast majority being in Glacier. This costs me around $20 a month which is a very reasonable price for offsite backups. It’s easier than tapes and offsite is obviously way better than simple nearline(term used for not immediately accessible storage). I really like tapes but they are way more of a hassle so now that I have a 100 Mbit fiber connection I mostly use that. I got the bulk of data on tape anyway which will speed up restores. The only data I have to get from S3 in case of all my servers shorting out at the same time is the last few months.

The downside of Glacier is that it takes time to restore but consider – if you will – that you put a backup of your 5 GB database into S3 every day. Which SQL-dump are you most likely to need? One from the 7 days that are immediately available or one from the X backups that take a few hours to get? Either way, you get your data back and disaster recovery is successful.

By the way, I don’t trust Amazon’s encryption. Not their server-side encrypted, not their client-side encryption. I do the encryption using tools I have installed and that have nothing to do with Amazon:

gpg -v --batch --passphrase-file path_to_passphrase_file -c filename.tar.gz

This give me a filename.tar.gz.gpg file that can be sent to S3 as is. Note that this isn’t some specific gripe I have against Amazon, it’s the same with all American companies. Well, it would apply to Russian and Chinese companies as well but for obvious reasons they aren’t on the table. My data is not super sensitive but I’d rather encrypt it than not.

WooCommerce monitoring

This is a follow-up to HA WooCommerce on a budget. Now we add monitoring using Zabbix so that we can keep track of services failing, load, queries per second. Installing Zabbix can be a bit of a hassle the first time around and I succeeded in having a big hassle the… What is this? Fourth time I install it? Using the RHEL repo for installing Fedora 33 seems to not work great. On Ubuntu it’s easy:

wget https://repo.zabbix.com/zabbix/5.0/ubuntu/pool/main/z/zabbix-release/zabbix-release_5.0-1+focal_all.deb
dpkg -i zabbix-release_5.0-1+focal_all.deb
apt update
apt install zabbix-server-mysql zabbix-frontend-php zabbix-nginx-conf zabbix-agent

You’ll have to enter the right information in /etc/zabbix/zabbix_server.conf for how to connect to the database, as defined when you created the database:

MariaDB [(none)]> create database zabbix character set utf8 collate utf8_bin;
MariaDB [(none)]> create user zabbix@'%' IDENTIFIED BY 'SECRETPASSWORD';

That database also needs to be populated with the right tables:

zcat /usr/share/doc/zabbix-server-mysql/create.sql.gz | mysql -h 192.168.1.209 -u zabbix -p zabbix

192.168.1.209 is the virtual IP for the MariaDB master in my temporary Pacemaker cluster. Needed a way to write data to it continuously so I could test switchover and why not kill two birds with one stone?

VPN

To make my life easier I expanded the VPN for the primary/backup pair to include the monitor server. On primary the /etc/wireguard/wg0.conf looks like this:

[Interface]
PrivateKey =  PRIVKEY_FOR_PRIMARY
Address = 10.0.0.1/24
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
ListenPort = 51820

[Peer]
PublicKey = PUBKEY_FOR_BACKUP
AllowedIPs = 10.0.0.2/32

[Peer]
PublicKey = PUBKEY_FOR_MONITOR
AllowedIPs = 10.0.0.3/32

On backup:

[Interface]
Address = 10.0.0.2/32
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
PrivateKey = PRIVKEY_FOR_BACKUP
ListenPort = 51820

[Peer]
PublicKey = PUBKEY_FOR_PRIMARY
Endpoint = 13.49.145.244:51820
AllowedIPs = 10.0.0.1/24

[Peer]
PublicKey = PUBKEY_FOR_MONITOR
AllowedIPs = 10.0.0.3/32

And on the monitor:

[Interface]
Address = 10.0.0.3/32
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
PrivateKey = PRIVKEY_FOR_MONITOR

[Peer]
PublicKey = PUBKEY_FOR_PRIMARY
Endpoint = 13.49.145.244:51820
AllowedIPs = 10.0.0.1/24

[Peer]
PublicKey = /WbR1A3hQg3gyYMpHvCLTmMqIlhxZDrfcMaop19BGzA=
Endpoint = 185.20.12.24:51820
AllowedIPs = PUBKEY_FOR_BACKUP

User parameters

We need to be able to access MariaDB’s status variables for a thing later on so we need to add something to cat /etc/zabbix/zabbix_agentd.d/userparameter_mysql.conf :

UserParameter=mysql.variables[*], mysql -h"$1" -P"$2" -sNX -e "show variables"

Zabbix agents

The easiest way to monitor servers is to install Zabbix Agent on them. SNMP and other methods are available but when you can use the Agent-method then that’s typically easier. Install on RHEL:

rpm -Uvh https://repo.zabbix.com/zabbix/5.2/rhel/8/x86_64/zabbix-release-5.2-1.el8.noarch.rpm
dnf clean all
dnf install zabbix-agent
systemctl enable zabbix-agent

Install on Ubuntu:

wget https://repo.zabbix.com/zabbix/5.0/ubuntu/pool/main/z/zabbix-release/zabbix-release_5.0-1+focal_all.deb
dpkg -i zabbix-release_5.0-1+focal_all.deb
apt update
apt install zabbix-agent

Both need their /etc/zabbix/zabbix_agentd.conf edited so that Server, ServerActive and Hostname are set correctly. Server and ServerActive should be the IP address of the monitor, in this case 10.0.0.3. Hostname should reflect the nodes own name.

Customization

Adding the nodes to Zabbix is easy enough so I won’t demonstrate that but adding templates isn’t necessary all that obvious(like that you pretty much have to add them to make Zabbix do anything). Here I’m adding some standard templates for Linux servers and also MySQL. I added Nginx as well.

Now let’s create two items manually, one for primary and one for backup. They’ll do the same thing, get status variables and extract the “read_only” variable that tells us if the node is accepting MariaDB writes:

We need to process the output to get a readable value using Preprocessing:

We can then add triggers. primary should normally be read-write so if it is read-only. That should trigger a warning. The opposite is true for backup:

It also became clear that I should have a check for backup server running the failover service:

And a trigger to go along with that of course:

Zero running failover daemon is bad you see. Don’t ask me what prompted me to realize the necessity of having an automated warning for this.

You’ll probably want to customize the MySQL by Zabbix Agent template in this sort of situation.

  • Set replication discovery to run every 5 minutes.
  • Make it possible to manually close the warning about a server not replicating from a master(since we will be switching master/slave roles these warnings can be spurious).
  • Disable warnings about InnoDB pool utlization:

Dashboards

It’s easy to create dashboards with information you find particularly useful:

Warnings

You’ll want to go to the Administration->Media Types section to enable ways for Zabbix to alert you to things going wrong. I use email only for my own network but for a production setup you’d probably want PushOver or OpsGenie to alert you more forcefully when things go south.

HA WooCommerce on a budget

There is a draft on this website called “HA WordPress on a slightly bigger budget” and details the futility of trying to use Keepalived and Cloudflare Load Balancer to make a working primary/backup high availability server setup. I do not consider this to be any failing on the part of Keepalived as it quite clearly isn’t meant to be used for things like databases where split brain is a concern.

Cloudflare could have been a bit more helpful by providing a “no failback” option so that if traffic was ever directed to the backup, traffic wouldn’t go over to the primary automatically if that server came back online again. But I’ve implemented that using their REST API instead. So it’s still Cloudflare Load Balancer, MariaDB, Nginx, PHP-FPM, lsyncd and some janky bash scripts. It’s just that Keepalived isn’t being asked to do the work of Pacemaker/Corosync.

A planned switchover

Important numbers:

Thu 24 Dec 2020 11:37:11 PM CET Read worked. | Write worked. 185.20.12.24
Thu 24 Dec 2020 11:37:30 PM CET Read worked. | Write worked. 172.31.35.57

We had 19 seconds during which writes didn’t work due to the switchover process. Reads worked throughout. Full log:

cjp@util:~$ ./wootest.sh
Thu 24 Dec 2020 11:36:53 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:36:54 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:36:56 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:36:58 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:36:59 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:37:01 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:37:03 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:37:04 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:37:06 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:37:07 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:37:09 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:37:11 PM CET Read worked. | Write worked. 185.20.12.24

Thu 24 Dec 2020 11:37:12 PM CET Read worked. | Write failed. 185.20.12.24

Thu 24 Dec 2020 11:37:14 PM CET Read worked. | Write failed. 185.20.12.24

Thu 24 Dec 2020 11:37:15 PM CET Read worked. | Write failed. 185.20.12.24

Thu 24 Dec 2020 11:37:17 PM CET Read worked. | Write failed. 185.20.12.24

Thu 24 Dec 2020 11:37:19 PM CET Read worked. | Write failed. 185.20.12.24

Thu 24 Dec 2020 11:37:20 PM CET Read worked. | Write failed. 185.20.12.24

Thu 24 Dec 2020 11:37:22 PM CET Read worked. | Write failed. 185.20.12.24

Thu 24 Dec 2020 11:37:23 PM CET Read worked. | Write failed. 172.31.35.57

Thu 24 Dec 2020 11:37:25 PM CET Read worked. | Write failed. 172.31.35.57

Thu 24 Dec 2020 11:37:27 PM CET Read worked. | Write failed. 172.31.35.57

Thu 24 Dec 2020 11:37:30 PM CET Read worked. | Write worked. 172.31.35.57

Thu 24 Dec 2020 11:37:32 PM CET Read worked. | Write worked. 172.31.35.57

Thu 24 Dec 2020 11:37:34 PM CET Read worked. | Write worked. 172.31.35.57

Thu 24 Dec 2020 11:37:36 PM CET Read worked. | Write worked. 172.31.35.57

Thu 24 Dec 2020 11:37:38 PM CET Read worked. | Write worked. 172.31.35.57

Thu 24 Dec 2020 11:37:40 PM CET Read worked. | Write worked. 172.31.35.57

Thu 24 Dec 2020 11:37:42 PM CET Read worked. | Write worked. 172.31.35.57

Thu 24 Dec 2020 11:37:44 PM CET Read worked. | Write worked. 172.31.35.57

Yes, I _am_ an idiot. But not for the reasons that I thought…

Installing Zabbix Agent on RHEL was difficult for me somehow

Failover test

Okey, so now for the (hopefully) more rare case of the primary going down all of a sudden:

Sat 26 Dec 2020 06:44:18 PM CET Read worked. | Write worked. 172.31.35.57

Sat 26 Dec 2020 06:44:21 PM CET Read worked. | Write worked. 172.31.35.57

Sat 26 Dec 2020 06:44:23 PM CET Read worked. | Write worked. 172.31.35.57

Sat 26 Dec 2020 06:44:26 PM CET Read worked. | Write worked. 172.31.35.57

Sat 26 Dec 2020 06:44:27 PM CET Read failed. | Write failed. <html>
<head><title>521 Origin Down</title></head>
<body bgcolor="white">
<center><h1>521 Origin Down</h1></center>
<hr><center>cloudflare-nginx</center>
</body>
</html>

Sat 26 Dec 2020 06:44:28 PM CET Read failed. | Write failed. <html>
<head><title>521 Origin Down</title></head>
<body bgcolor="white">
<center><h1>521 Origin Down</h1></center>
<hr><center>cloudflare-nginx</center>
</body>
</html>

Sat 26 Dec 2020 06:45:30 PM CET Read failed. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:32 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:33 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:35 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:36 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:38 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:39 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:40 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:42 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:43 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:45 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:46 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:48 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:49 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:50 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:52 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:53 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:55 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 06:45:56 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 06:45:58 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 06:45:59 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 06:46:00 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 06:46:02 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 06:46:03 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 06:46:04 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 06:46:06 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 06:46:07 PM CET Read worked. | Write worked. 185.20.12.24

Important numbers:

Sat 26 Dec 2020 06:44:26 PM CET Read worked. | Write worked. 172.31.35.57
Sat 26 Dec 2020 06:45:32 PM CET Read worked. | Write failed. 185.20.12.24
Sat 26 Dec 2020 06:45:56 PM CET Read worked. | Write worked. 185.20.12.24

So a minute with site being down for reads and writes, and another half a minute during which only writes didn’t work. Not too shabby. And Zabbix is suitably concerned:

Crashed server rehabilitation

Problems getting a crashed primary to sync with backup:

[root@ip-172-31-35-57 ~]# mysql -e "SHOW SLAVE STATUS\G;"
*************************** 1. row ***************************
                Slave_IO_State:
                   Master_Host: backup
                   Master_User: replication_user
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File:
           Read_Master_Log_Pos: 4
                Relay_Log_File: woo1-relay-log.000001
                 Relay_Log_Pos: 4
         Relay_Master_Log_File:
              Slave_IO_Running: No
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 4
               Relay_Log_Space: 256
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: NULL
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 1236
                 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-1-58257, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions'
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 2
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Current_Pos
                   Gtid_IO_Pos: 0-1-58257
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
              Slave_DDL_Groups: 2
Slave_Non_Transactional_Groups: 14585
    Slave_Transactional_Groups: 8746

Oh, dear! What have we missed? I.e. what happened on primary when it was disconnected from the backup and _should_ have been read-only?

[root@ip-172-31-35-57 ~]# mysqlbinlog /var/lib/mysql/woo1-bin.000054 | less
<snip>
#201225 17:32:10 server id 1  end_log_pos 8224049 CRC32 0x780272cf      GTID 0-1-58257

/*!100001 SET @@session.gtid_seq_no=58257*//*!*/;
START TRANSACTION
/*!*/;
# at 8224049
#201225 17:32:10 server id 1  end_log_pos 8224198 CRC32 0x07e3910e      Query   thread_id=3456  exec_time=0     error_code=0
SET TIMESTAMP=1608917530/*!*/;
DELETE FROM `wp9k_options` WHERE `option_name` = '_transient_doing_cron'
/*!*/;
# at 8224198
#201225 17:32:10 server id 1  end_log_pos 8224281 CRC32 0xb1ecec93      Query   thread_id=3456  exec_time=0     error_code=0
SET TIMESTAMP=1608917530/*!*/;
COMMIT
/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

Oh, DELETE _transient_doing_cron… Well that’s no biggie. I was feigning concern that somehow lots of orders to my entirely not-real web shop had somehow happened to be written to the wrong database.

Well MariaDB is very stubborn about these sorts of things. The only way I know how to get primary back in sync with backup is:

root@woo2:~# mysqldump -A --master-data > backup_2020-12-25_1854.sql
root@woo2:~# scp backup_2020-12-25_1854.sql primary:/root
backup_2020-12-25_1854.sql

Back to woo1(primary):

[root@ip-172-31-35-57 ~]# systemctl stop mariadb
[root@ip-172-31-35-57 ~]# rm -rf /var/lib/mysql/* # <--- EXTREMELY DANGER IF YOU DON'T KNOW PRECISELY WHAT YOU ARE DOING. Lost hours of work doing this on the wrong server once...
[root@ip-172-31-35-57 ~]# mysql_install_db
Installing MariaDB/MySQL system tables in '/var/lib/mysql' ...
OK

To start mysqld at boot time you have to copy
support-files/mysql.server to the right place for your system


PLEASE REMEMBER TO SET A PASSWORD FOR THE MariaDB root USER !
To do so, start the server, then issue the following commands:

'/usr/bin/mysqladmin' -u root password 'new-password'
'/usr/bin/mysqladmin' -u root -h ip-172-31-35-57.eu-north-1.compute.internal password 'new-password'

Alternatively you can run:
'/usr/bin/mysql_secure_installation'

which will also give you the option of removing the test
databases and anonymous user created by default.  This is
strongly recommended for production servers.

See the MariaDB Knowledgebase at http://mariadb.com/kb or the
MySQL manual for more instructions.

You can start the MariaDB daemon with:
cd '/usr' ; /usr/bin/mysqld_safe --datadir='/var/lib/mysql'

You can test the MariaDB daemon with mysql-test-run.pl
cd '/usr/mysql-test' ; perl mysql-test-run.pl

Please report any problems at http://mariadb.org/jira

The latest information about MariaDB is available at http://mariadb.org/.
You can find additional information about the MySQL part at:
http://dev.mysql.com
Consider joining MariaDB's strong and vibrant community:
https://mariadb.org/get-involved/

[root@ip-172-31-35-57 ~]# systemctl restart mariadb
Job for mariadb.service failed because the control process exited with error code.
See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@ip-172-31-35-57 ~]# journalctl -u mariadb
-- Logs begin at Wed 2020-12-23 00:40:59 UTC, end at Fri 2020-12-25 17:55:55 UTC. --
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Starting MariaDB 10.3 database server...
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-socket[1506]: Socket file /var/lib/mysql/mysql.sock exists.
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-socket[1506]: No process is using /var/lib/mysql/mysql.sock, which means it is a garbage, so it will be removed automatically.
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[1535]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[1535]: If this is not the case, make sure the /var/lib/mysql is empty before running mysql-prepare-db-dir.
Dec 23 00:44:23 ip-172-31-35-57.eu-north-1.compute.internal mysqld[1573]: 2020-12-23  0:44:23 0 [Note] /usr/libexec/mysqld (mysqld 10.3.27-MariaDB-log) starting as process 1573 ...
Dec 23 00:44:23 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Started MariaDB 10.3 database server.
Dec 25 17:55:26 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Stopping MariaDB 10.3 database server...
Dec 25 17:55:28 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: mariadb.service: Succeeded.
Dec 25 17:55:28 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Stopped MariaDB 10.3 database server.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Starting MariaDB 10.3 database server...
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[66558]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[66558]: If this is not the case, make sure the /var/lib/mysql is empty before running mysql-prepare-db-dir.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: 2020-12-25 17:55:55 0 [Note] /usr/libexec/mysqld (mysqld 10.3.27-MariaDB-log) starting as process 66596 ...
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: /usr/libexec/mysqld: Can't create file './woo1.err' (errno: 13 "Permission denied")
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: 2020-12-25 17:55:55 0 [ERROR] mysqld: File './woo1-bin.index' not found (Errcode: 13 "Permission denied")
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: 2020-12-25 17:55:55 0 [ERROR] Aborting
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: mariadb.service: Failed with result 'exit-code'.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Failed to start MariaDB 10.3 database server.
[root@ip-172-31-35-57 ~]# ls -lh /var/lib/
total 4.0K
drwxr-xr-x. 2 root    root      85 Dec 18 20:17 alternatives
drwxr-xr-x. 3 root    root     209 Oct 31 05:07 authselect
drwxr-xr-x. 2 chrony  chrony    19 Dec 25 17:49 chrony
drwxr-xr-x. 8 root    root     105 Dec 23 00:41 cloud
drwx------. 2 apache  apache     6 Jun 15  2020 dav
drwxr-xr-x. 2 root    root      24 Oct 31 05:02 dbus
drwxr-xr-x. 2 root    root       6 May  7  2020 dhclient
drwxr-xr-x. 4 root    root     115 Dec 25 16:32 dnf
drwxr-xr-x. 2 root    root       6 Apr 23  2020 games
drwx------. 2 apache  apache     6 Jun 15  2020 httpd
drwxr-xr-x. 2 root    root       6 Aug  4 12:03 initramfs
drwxr-xr-x. 2 root    root      30 Dec 25 03:51 logrotate
drwxr-xr-x. 2 root    root       6 Apr 23  2020 misc
drwxr-xr-x. 5 mysql   mysql    264 Dec 25 17:55 mysql
drwxr-xr-x. 4 root    root      45 Dec 18 22:13 net-snmp
drwx------. 2 root    root     169 Dec 25 17:51 NetworkManager
drwxrwx---. 3 nginx   root      33 Dec 18 23:20 nginx
drwxr-xr-x. 2 root    root       6 Aug 12  2018 os-prober
drwxr-xr-x. 6 root    root      68 Dec 18 21:37 php
drwxr-x---. 3 root    polkitd   28 Oct 31 05:02 polkit-1
drwx------. 2 root    root       6 Oct 31 05:02 portables
drwx------. 2 root    root       6 Oct 31 05:02 private
drwxr-x---. 6 root    root      71 Oct 31 05:03 rhsm
drwxr-xr-x. 2 root    root    4.0K Dec 23 00:51 rpm
drwxr-xr-x. 3 root    root      21 Dec 18 20:30 rpm-state
drwx------. 2 root    root      29 Dec 25 17:55 rsyslog
drwxr-xr-x. 5 root    root      46 Oct 31 05:03 selinux
drwxr-xr-x. 9 root    root     105 Oct 31 05:03 sss
drwxr-xr-x. 5 root    root      70 Nov  6 13:06 systemd
drwx------. 2 tss     tss        6 Jun 10  2019 tpm
drwxr-xr-x. 2 root    root       6 Oct  2 16:12 tuned
drwxr-xr-x. 2 unbound unbound   22 Dec 25 00:00 unbound
[root@ip-172-31-35-57 ~]# ls -lh /var/lib/mysql/
total 109M
-rw-rw---- 1 root root  16K Dec 25 17:55 aria_log.00000001
-rw-rw---- 1 root root   52 Dec 25 17:55 aria_log_control
-rw-rw---- 1 root root  972 Dec 25 17:55 ib_buffer_pool
-rw-rw---- 1 root root  12M Dec 25 17:55 ibdata1
-rw-rw---- 1 root root  48M Dec 25 17:55 ib_logfile0
-rw-rw---- 1 root root  48M Dec 25 17:55 ib_logfile1
drwx------ 2 root root 4.0K Dec 25 17:55 mysql
drwx------ 2 root root   20 Dec 25 17:55 performance_schema
drwx------ 2 root root   20 Dec 25 17:55 test
-rw-rw---- 1 root root  28K Dec 25 17:55 woo1-bin.000001
-rw-rw---- 1 root root   18 Dec 25 17:55 woo1-bin.index
-rw-rw---- 1 root root    7 Dec 25 17:55 woo1-bin.state
-rw-rw---- 1 root root  642 Dec 25 17:55 woo1.err
[root@ip-172-31-35-57 ~]# chown -R mysql:mysql /var/lib/mysql/
[root@ip-172-31-35-57 ~]# systemctl restart mariadb
[root@ip-172-31-35-57 ~]# journalctl -u mariadb
-- Logs begin at Wed 2020-12-23 00:40:59 UTC, end at Fri 2020-12-25 17:56:22 UTC. --
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Starting MariaDB 10.3 database server...
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-socket[1506]: Socket file /var/lib/mysql/mysql.sock exists.
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-socket[1506]: No process is using /var/lib/mysql/mysql.sock, which means it is a garbage, so it will be removed automatically.
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[1535]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Dec 23 00:44:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[1535]: If this is not the case, make sure the /var/lib/mysql is empty before running mysql-prepare-db-dir.
Dec 23 00:44:23 ip-172-31-35-57.eu-north-1.compute.internal mysqld[1573]: 2020-12-23  0:44:23 0 [Note] /usr/libexec/mysqld (mysqld 10.3.27-MariaDB-log) starting as process 1573 ...
Dec 23 00:44:23 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Started MariaDB 10.3 database server.
Dec 25 17:55:26 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Stopping MariaDB 10.3 database server...
Dec 25 17:55:28 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: mariadb.service: Succeeded.
Dec 25 17:55:28 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Stopped MariaDB 10.3 database server.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Starting MariaDB 10.3 database server...
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[66558]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[66558]: If this is not the case, make sure the /var/lib/mysql is empty before running mysql-prepare-db-dir.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: 2020-12-25 17:55:55 0 [Note] /usr/libexec/mysqld (mysqld 10.3.27-MariaDB-log) starting as process 66596 ...
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: /usr/libexec/mysqld: Can't create file './woo1.err' (errno: 13 "Permission denied")
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: 2020-12-25 17:55:55 0 [ERROR] mysqld: File './woo1-bin.index' not found (Errcode: 13 "Permission denied")
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66596]: 2020-12-25 17:55:55 0 [ERROR] Aborting
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: mariadb.service: Failed with result 'exit-code'.
Dec 25 17:55:55 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Failed to start MariaDB 10.3 database server.
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Starting MariaDB 10.3 database server...
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[66636]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[66636]: If this is not the case, make sure the /var/lib/mysql is empty before running mysql-prepare-db-dir.
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysqld[66675]: 2020-12-25 17:56:22 0 [Note] /usr/libexec/mysqld (mysqld 10.3.27-MariaDB-log) starting as process 66675 ...
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-upgrade[66707]: The datadir located at /var/lib/mysql needs to be upgraded using 'mysql_upgrade' tool. This can be done using the following steps:
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-upgrade[66707]:   1. Back-up your data before with 'mysql_upgrade'
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-upgrade[66707]:   2. Start the database daemon using 'service mariadb start'
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-upgrade[66707]:   3. Run 'mysql_upgrade' with a database user that has sufficient privileges
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-upgrade[66707]: Read more about 'mysql_upgrade' usage at:
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal mysql-check-upgrade[66707]: https://mariadb.com/kb/en/mariadb/documentation/sql-commands/table-commands/mysql_upgrade/
Dec 25 17:56:22 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Started MariaDB 10.3 database server.
[root@ip-172-31-35-57 ~]# mysql_upgrade
Phase 1/7: Checking and upgrading mysql database
Processing databases
mysql
mysql.column_stats                                 OK
mysql.columns_priv                                 OK
mysql.db                                           OK
mysql.event                                        OK
mysql.func                                         OK
mysql.gtid_slave_pos                               OK
mysql.help_category                                OK
mysql.help_keyword                                 OK
mysql.help_relation                                OK
mysql.help_topic                                   OK
mysql.host                                         OK
mysql.index_stats                                  OK
mysql.innodb_index_stats                           OK
mysql.innodb_table_stats                           OK
mysql.plugin                                       OK
mysql.proc                                         OK
mysql.procs_priv                                   OK
mysql.proxies_priv                                 OK
mysql.roles_mapping                                OK
mysql.servers                                      OK
mysql.table_stats                                  OK
mysql.tables_priv                                  OK
mysql.time_zone                                    OK
mysql.time_zone_leap_second                        OK
mysql.time_zone_name                               OK
mysql.time_zone_transition                         OK
mysql.time_zone_transition_type                    OK
mysql.transaction_registry                         OK
mysql.user                                         OK
Phase 2/7: Installing used storage engines... Skipped
Phase 3/7: Fixing views
Phase 4/7: Running 'mysql_fix_privilege_tables'
Phase 5/7: Fixing table and database names
Phase 6/7: Checking and upgrading tables
Processing databases
information_schema
performance_schema
test
Phase 7/7: Running 'FLUSH PRIVILEGES'
OK
[root@ip-172-31-35-57 ~]# mysql -e "SHOW SLAVE STATUS\G;"
[root@ip-172-31-35-57 ~]# mysql < backup_2020-12-25_1854.sql
[root@ip-172-31-35-57 ~]# grep gtid_slave_pos backup_2020-12-25_1854.sql  | head -1
-- SET GLOBAL gtid_slave_pos='0-2-61170';

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 32
Server version: 10.3.27-MariaDB-log MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> GRANT ALL PRIVILEGES ON `johante2_wp508`.* TO `johante2_wp508`@`localhost`;
Query OK, 0 rows affected (0.001 sec)

MariaDB [(none)]> GRANT USAGE,REPLICATION CLIENT,PROCESS,SHOW DATABASES,SHOW VIEW ON *.* TO 'zbx_monitor'@'%';
Query OK, 0 rows affected (0.000 sec)


MariaDB [(none)]> SET GLOBAL gtid_slave_pos='0-2-61170';
Query OK, 0 rows affected (0.021 sec)

MariaDB [(none)]> Bye
[root@ip-172-31-35-57 ~]#

Oki, let’s try to get primary to replicate from backup now.



MariaDB [(none)]> CHANGE MASTER TO master_host="backup", master_port=3306, master_user="replication_user", master_password="REPLICATION_PASSWORD", master_use_gtid=slave_pos;
Query OK, 0 rows affected (0.005 sec)

MariaDB [(none)]> START SLAVE;
Query OK, 0 rows affected (0.009 sec)

MariaDB [(none)]> SHOW SLAVE STATUS;
+----------------------------------+-------------+------------------+-------------+---------------+-----------------+---------------------+-----------------------+---------------+-----------------------+------------------+-------------------+-----------------+---------------------+--------------------+------------------------+-------------------------+-----------------------------+------------+------------+--------------+---------------------+-----------------+-----------------+----------------+---------------+--------------------+--------------------+--------------------+-----------------+-------------------+----------------+-----------------------+-------------------------------+---------------+---------------+----------------+----------------+-----------------------------+------------------+----------------+--------------------+------------+-------------+-------------------------+-----------------------------+---------------+-----------+---------------------+-----------------------------------------------------------------------------+------------------+--------------------------------+----------------------------+
| Slave_IO_State                   | Master_Host | Master_User      | Master_Port | Connect_Retry | Master_Log_File | Read_Master_Log_Pos | Relay_Log_File        | Relay_Log_Pos | Relay_Master_Log_File | Slave_IO_Running | Slave_SQL_Running | Replicate_Do_DB | Replicate_Ignore_DB | Replicate_Do_Table | Replicate_Ignore_Table | Replicate_Wild_Do_Table | Replicate_Wild_Ignore_Table | Last_Errno | Last_Error | Skip_Counter | Exec_Master_Log_Pos | Relay_Log_Space | Until_Condition | Until_Log_File | Until_Log_Pos | Master_SSL_Allowed | Master_SSL_CA_File | Master_SSL_CA_Path | Master_SSL_Cert | Master_SSL_Cipher | Master_SSL_Key | Seconds_Behind_Master | Master_SSL_Verify_Server_Cert | Last_IO_Errno | Last_IO_Error | Last_SQL_Errno | Last_SQL_Error | Replicate_Ignore_Server_Ids | Master_Server_Id | Master_SSL_Crl | Master_SSL_Crlpath | Using_Gtid | Gtid_IO_Pos | Replicate_Do_Domain_Ids | Replicate_Ignore_Domain_Ids | Parallel_Mode | SQL_Delay | SQL_Remaining_Delay | Slave_SQL_Running_State                                                     | Slave_DDL_Groups | Slave_Non_Transactional_Groups | Slave_Transactional_Groups |
+----------------------------------+-------------+------------------+-------------+---------------+-----------------+---------------------+-----------------------+---------------+-----------------------+------------------+-------------------+-----------------+---------------------+--------------------+------------------------+-------------------------+-----------------------------+------------+------------+--------------+---------------------+-----------------+-----------------+----------------+---------------+--------------------+--------------------+--------------------+-----------------+-------------------+----------------+-----------------------+-------------------------------+---------------+---------------+----------------+----------------+-----------------------------+------------------+----------------+--------------------+------------+-------------+-------------------------+-----------------------------+---------------+-----------+---------------------+-----------------------------------------------------------------------------+------------------+--------------------------------+----------------------------+
| Waiting for master to send event | backup      | replication_user |        3306 |            60 | woo2-bin.000002 |             6356025 | woo1-relay-log.000002 |         95872 | woo2-bin.000002       | Yes              | Yes               |                 |                     |                    |                        |                         |                             |          0 |            |            0 |             6356025 |           96180 | None            |                |             0 | No                 |                    |                    |                 |                   |                |                     0 | No                            |             0 |               |              0 |                |                             |                2 |                |                    | Slave_Pos  | 0-2-61237   |                         |                             | conservative  |         0 |                NULL | Slave has read all relay log; waiting for the slave I/O thread to update it |                0 |                             40 |                         27 |
+----------------------------------+-------------+------------------+-------------+---------------+-----------------+---------------------+-----------------------+---------------+-----------------------+------------------+-------------------+-----------------+---------------------+--------------------+------------------------+-------------------------+-----------------------------+------------+------------+--------------+---------------------+-----------------+-----------------+----------------+---------------+--------------------+--------------------+--------------------+-----------------+-------------------+----------------+-----------------------+-------------------------------+---------------+---------------+----------------+----------------+-----------------------------+------------------+----------------+--------------------+------------+-------------+-------------------------+-----------------------------+---------------+-----------+---------------------+-----------------------------------------------------------------------------+------------------+--------------------------------+----------------------------+
1 row in set (0.000 sec)

MariaDB [(none)]> Bye

From now on we can start replication by saying master_use_gtid=current_pos; at the end but this time around we had to use slave_pos as the starting point. Because reasons.

Failover test

Whatever system you set up for high availability – even a less janky one – needs to be tested a lot before being used live. Example: did you know that Amazon EC2 will assign a VM a different IP address after it was Stopped? Well I sure didn’t! Nor did I realize that Cloudflare’s Origin Server editing UI can’t handle spaces:

Took me a while to figure out I had copied the string “13.49.145.244 “.

The main issue with this test was that I had forgot to turn on the failover service on backup… But that made it clear there should be a Zabbix item counting the number of failover processes running and a trigger to warn if it is zero.

Trigger only if 5 consequtive measurements indicate that process is down(there can be some noise):

Implementation

So now to the details of how it works. Let’s start with the main script that runs on the backup monitoring the primary:

root@woo2:~/ha_scripts# cat monitor_primary.sh
#!/bin/bash

HTTP_FAIL="0"

while true; do
        mysql -e "SHOW VARIABLES LIKE 'read_only';" | grep -iq off
        if [ $? -eq 0 ]
        then
                # We're apparently accepting writes so we are in the
                # master already, little point in promoting this node.
                exit 1
        fi
        curl --connect-timeout 5 -fks -H "Host: woo.deref.se" https://primary -o /dev/null
        if [ $? -ne 0 ]
        then
                ((HTTP_FAIL=HTTP_FAIL+1))
                echo "Failures: $HTTP_FAIL."
        else
                ((HTTP_FAIL=0))
        fi
        if [ $HTTP_FAIL -gt 3 ]
        then
                ./get_primary_health.sh | grep "healthy" | grep -q "true"
                if [ $? -ne 0 ]
                then

                        # Not sure primary is reachable, but let's try getting it to
                        # stop sending changes to us
                        ssh primary "systemctl stop lsyncd"
                        # If the primary is still ticking along but not reachable
                        # we need to prevent it from overwriting our file system
                        unlink /root/.ssh/authorized_keys
                        ln -s /root/.ssh/authorized_keys_no_primary /root/.ssh/authorized_keys

                        # primary cut off from sending changes to us, we can push changes
                        # in the other direction then. Might work, might not.
                        systemctl start lsyncd

                        # Wait for replication to catch up
                        ./checkslave.sh
                        # We've caught up, stop replicating.
                        mysql -e "STOP SLAVE;"
                        sleep 1
                        mysql -e "RESET SLAVE ALL;"

                        sleep 5
                        # It's now safe to accept writes
                        mysql -e 'SET GLOBAL read_only = OFF;'

                        # Got to tell Cloudflare not to switch back to primary.
                        ./set_backup_alone_in_lb.sh

                        # Let's give it a shot to get the primary to replicate data from us
                        # Started as a background job since it's not critical
                        ./make_primary_slave.sh &
                        exit 0
                else
                        # Okey, if Cloudflare says the server is okey, let's give it another
                        # chance by starting the counting from zero
                        echo "Cloudflare says primary is healthy, even though we failed to reach it. Resetting fail-counter."
                        ((HTTP_FAIL=0))
                fi
        fi
        sleep 5
done

Obviously “curl –connect-timeout 5 -fks -H “Host: woo.deref.se” https://primary -o /dev/null” needs to be modified to your environment.

There are some other scripts being called from this one, checkslave.sh is a chunk of code I copied straight from ocf:heartbeat:mariadb:

#!/bin/bash

# Swiped from ocf:heartbeat:mariadb resource agent
SLAVESTATUS=$(mysql -e "SHOW SLAVE STATUS;" | wc -l)
if [ $SLAVESTATUS -gt 0 ]; then
        tmpfile="/tmp/processlist.txt"
        while true; do
                mysql -e 'SHOW PROCESSLIST\G' > $tmpfile
                if grep -i 'Has read all relay log' $tmpfile >/dev/null; then
                    echo "MariaDB slave has finished processing relay log"
                    break
                fi
                if ! grep -q 'system user' $tmpfile; then
                    echo "Slave not running - not waiting to finish"
                    break
                fi
                echo "Waiting for MariaDB slave to finish processing relay log"
                sleep 1
        done
fi

How best to test if a MariaDB server is busy catching up to the master it was replicating from is tricky. If you can get your hands on the results that worked well for other people in a lot of scenarios then you should use that.

set_backup_alone_in_lb.sh is an API-call to Cloudflare:

#!/bin/bash

source cloudflare_vars.sh
source site_vars.sh

# Promote primary in the Load Balancer
# It takes a few seconds

make_payload()
{
      cat <<EOF
      {
      "description": "",
      "id": "$LB_ID",
      "proxied": true,
      "enabled": true,
      "name": "$DOMAIN_NAME",
      "session_affinity": "none",
      "session_affinity_attributes": {
        "samesite": "Auto",
        "secure": "Auto",
        "drain_duration": 0
      },
      "steering_policy": "off",
      "fallback_pool": "$BACKUP_POOL_ID",
      "default_pools": [
        "$BACKUP_POOL_ID"
      ],
      "pop_pools": {},
      "region_pools": {}
      }
EOF
}

curl -fX PUT "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/load_balancers/$LB_ID" -H "Authorization: Bearer $API_TOKEN" -H "Content-Type:application/json" --data "$(make_payload)"

Cloudflare_vars and site_vars are just files assigning the right values to variables used throughout:

# Cloudflare vars

PRIMARY_POOL_ID="5a3ccSECRETSECRETSECRET0797e2d9ce"
BACKUP_POOL_ID="0ef15eSECRETSECRETSECRETf06686ac1e"
LB_ID="93ff3da14adSECRETSECRETSECRETe6ad1ee3d"
CLIENT_ID="a7e9f6SECRETSECRETSECRETb864cb4a5"
ZONE_ID="5e98d4dSECRETSECRETSECRET87cb6ad8"
API_TOKEN="noMamSECRETSECRETSECRETZaO1X"

And:

#!/bin/bash

DOMAIN_NAME="woo.deref.se"
REPLICATION_PASSWORD="SECRETSECRET"

It’s not guaranteed that the primary stopped working because it crashed and won’t restart. So we should at least try getting the primary to take on the roll as slave, hence the invocation of make_primary_slave.sh:

#!/bin/bash

source site_vars.sh

# We want the primary to be read only for as much of this process
# as possible because we can't proceed until the backup has caught
# up with all the writes performed on the master
ssh primary "mysql -e 'SET GLOBAL read_only = ON;'"

# Let's try to get the primary to replicate from us
ssh primary "mysql -e 'CHANGE MASTER TO master_host=\"backup\", master_port=3306, master_user=\"replication_user\", master_password=\"$REPLICATION_PASSWORD\", master_use_gtid=current_pos;'"
ssh primary "mysql -e 'START SLAVE;'"

Scripts for switchover

It’s good to sanity check the database of the server we’re trying to promote to master:

#!/bin/bash

tmpfile="/tmp/remoteslavestatus"
ssh $1 "mysql -e 'SHOW SLAVE STATUS\G;'" > $tmpfile

SLAVE_IO_STATE=$(grep -i "slave_io_state" $tmpfile | wc -w)
SLAVE_IO_RUNNING=$?
SLAVE_SQL_RUNNING=$?
SLAVE_SECONDS_BEHIND=$(grep -i seconds_behind_master $tmpfile | awk '{print $2}')

if [ $SLAVE_IO_STATE -lt 2 ]
then
        echo "Slave IO State is empty. Is $1 properly synchronized? Bailing out."
        exit 1
else
        echo "Slave IO state is not null. Good."
fi

if [ $SLAVE_IO_RUNNING -ne 0 ]
then
        echo "Slave IO seems not to be running. Bailing out."
        exit 1
else
        echo "Slave SQL running. Good."
fi

if [ $SLAVE_SQL_RUNNING -ne 0 ]
then
        echo "Slave SQL seems not to be running. Bailing out."
        exit 1
else
        echo "Slave SQL running. Good."
fi

if [ $SLAVE_SECONDS_BEHIND == "NULL" ]
then
        echo "Slave is not processing relay log. Bailing out."
        exit 1
fi

if [ $SLAVE_SECONDS_BEHIND -gt 5 ]
then
        echo "Slave is more than 5 seconds behind master. Won't have time to catch up! Bailing out."
        exit 1
else
        echo "Slave is reasonably caught up, only $SLAVE_SECONDS_BEHIND s behind."
fi

For instance, let’s check if it’s okey to switch over the primary after it rebooted and failover occurred:

root@woo2:~/ha_scripts# ./sanity_check.sh primary
Slave IO State is empty. Is primary properly synchronized? Bailing out.

Nope. Not at all. Have to make it replicate changes first:

root@woo2:~/ha_scripts# ./make_primary_slave.sh
root@woo2:~/ha_scripts# ./sanity_check.sh primary
 Slave IO state is not null. Good.
 Slave SQL running. Good.
 Slave SQL running. Good.
 Slave is reasonably caught up, only 0 s behind.
root@woo2:~/ha_scripts#

Let’s just check how things are going with lsyncd:

root@woo2:~/ha_scripts# systemctl status lsyncd
● lsyncd.service - LSB: lsyncd daemon init script
     Loaded: loaded (/etc/init.d/lsyncd; generated)
     Active: active (exited) since Sat 2020-12-26 18:45:50 CET; 28min ago
       Docs: man:systemd-sysv-generator(8)
    Process: 1081833 ExecStart=/etc/init.d/lsyncd start (code=exited, status=0/SUCCESS)

Dec 26 18:45:50 woo2 systemd[1]: Starting LSB: lsyncd daemon init script...
Dec 26 18:45:50 woo2 lsyncd[1081833]:  * Starting synchronization daemon lsyncd
Dec 26 18:45:50 woo2 lsyncd[1081838]: 18:45:50 Normal: --- Startup, daemonizing ---
Dec 26 18:45:50 woo2 lsyncd[1081833]:    ...done.
Dec 26 18:45:50 woo2 lsyncd[1081838]: Normal, --- Startup, daemonizing ---
Dec 26 18:45:50 woo2 systemd[1]: Started LSB: lsyncd daemon init script.
Dec 26 18:45:50 woo2 lsyncd[1081839]: Normal, recursive startup rsync: /var/www/html/ -> primary:/var/www/html/
Dec 26 18:45:52 woo2 lsyncd[1081839]: Error, Temporary or permanent failure on startup of "/var/www/html/". Terminating since "insist" is not set.
root@woo2:~/ha_scripts# sudo -u www-data date >> /var/www/html/testfile
root@woo2:~/ha_scripts# systemctl restart lsyncd

Testfile didn’t show up on primary so indeed the error message was right about lsyncd having given up on synchronizing to primary. A restart did the trick. So let’s switch over using this script called demote_backup.sh :

#!/bin/bash

source site_vars.sh

ssh primary "systemctl status mariadb"

# If either we couldn't reach primary or mariadb was down
# then we really should switch over
if [ $? -ne 0 ];
then
        echo "Error! Couldn't verify that primary has MariaDB running! Bailing out."
        exit 1
fi

curl -fks -H "Host: woo.deref.se" https://primary/

if [ $? -ne 0 ];
then
        echo "Error! Couldn't verify that primary has Nginx running! Bailing out."
        exit 1
fi

mysql -e "SET GLOBAL read_only = ON;"
systemctl stop lsyncd

# Wait for replication to catch up, if necessary
ssh primary "./checkslave.sh"

# Promote primary in the Load Balancer
# It takes a few seconds
./set_primary_first_in_lb.sh

# Just in case there was replication active on primary
ssh primary "mysql -e 'STOP SLAVE;'"
sleep 1
ssh primary "mysql -e 'RESET SLAVE ALL;'"

# Time for us to stop reading updates from primary
mysql -e "STOP SLAVE;"


# Let's allow primary's SSH key so lsyncd will work
unlink /root/.ssh/authorized_keys
ln -s /root/.ssh/authorized_keys_primary_allowed /root/.ssh/authorized_keys

# primary can start accepting writes now that it's no longer replicating
ssh primary "mysql -e 'SET GLOBAL read_only = OFF;'"
ssh primary "systemctl start lsyncd"

# Time for us to start replicating
mysql -e "CHANGE MASTER TO master_host='primary', master_port=3306, master_user='replication_user', master_password='$REPLICATION_PASSWORD', master_use_gtid=current_pos;"
mysql -e "START SLAVE;"

Again, “curl -fks -H “Host: woo.deref.se” https://primary/” isn’t neatly generated based on the contents of the _vars-files but whatever. Let’s see if it works out. I’ll start the same timing script as before on another server.

cjp@util:~$ ./wootest.sh | tee test_woo_scripted_switchover6.txt
Sat 26 Dec 2020 07:18:28 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 07:18:30 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 07:18:31 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 07:18:33 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 07:18:34 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 07:18:36 PM CET Read worked. | Write worked. 185.20.12.24

Sat 26 Dec 2020 07:18:37 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 07:18:39 PM CET Read worked. | Write failed. 185.20.12.24

Sat 26 Dec 2020 07:18:41 PM CET Read worked. | Write worked. 172.31.35.57

Sat 26 Dec 2020 07:18:43 PM CET Read worked. | Write worked. 172.31.35.57

Sat 26 Dec 2020 07:18:44 PM CET Read worked. | Write worked. 172.31.35.57

Sat 26 Dec 2020 07:18:46 PM CET Read worked. | Write worked. 172.31.35.57

Boy, that was fast.

Output on backup shows that we get a good response from the primary regarding mariadb(just another safety check) followed by the contents of a call to Nginx on the primary, finally showing the new state of the load balancer at Cloudflare:

root@woo2:~/ha_scripts# ./demote_backup.sh
● mariadb.service - MariaDB 10.3 database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
   Active: active (running) since Sat 2020-12-26 18:10:16 UTC; 8min ago
     Docs: man:mysqld(8)
           https://mariadb.com/kb/en/library/systemd/
  Process: 932 ExecStartPost=/usr/libexec/mysql-check-upgrade (code=exited, status=0/SUCCESS)
  Process: 828 ExecStartPre=/usr/libexec/mysql-prepare-db-dir mariadb.service (code=exited, status=0/SUCCESS)
  Process: 779 ExecStartPre=/usr/libexec/mysql-check-socket (code=exited, status=0/SUCCESS)
 Main PID: 870 (mysqld)
   Status: "Taking your SQL requests now..."
    Tasks: 36 (limit: 10984)
   Memory: 120.0M
   CGroup: /system.slice/mariadb.service
           └─870 /usr/libexec/mysqld --basedir=/usr

Dec 26 18:10:15 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Starting MariaDB 10.3 database server...
Dec 26 18:10:15 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[828]: Database MariaDB is probably initialized in /var/lib/mysql already, nothing is done.
Dec 26 18:10:15 ip-172-31-35-57.eu-north-1.compute.internal mysql-prepare-db-dir[828]: If this is not the case, make sure the /var/lib/mysql is empty before running mysql-prepare-db-dir.
Dec 26 18:10:15 ip-172-31-35-57.eu-north-1.compute.internal mysqld[870]: 2020-12-26 18:10:15 0 [Note] /usr/libexec/mysqld (mysqld 10.3.27-MariaDB-log) starting as process 870 ...
Dec 26 18:10:16 ip-172-31-35-57.eu-north-1.compute.internal systemd[1]: Started MariaDB 10.3 database server.

<huge html blog deleted>

MariaDB slave has finished processing relay log
{
  "result": {
    "description": "",
    "created_on": "0001-01-01T00:00:00Z",
    "modified_on": "2020-12-26T18:18:38.282592Z",
    "id": "93ff3SECRETSECRETSECRET9e6ad1ee3d",
    "ttl": 30,
    "proxied": true,
    "enabled": true,
    "name": "woo.deref.se",
    "session_affinity": "none",
    "session_affinity_attributes": {
      "samesite": "Auto",
      "secure": "Auto",
      "drain_duration": 0
    },
    "steering_policy": "off",
    "fallback_pool": "0ef15SECRETSECRETSECRET6686ac1e",
    "default_pools": [
      "5a3cc32SECRETSECRETSECRET7e2d9ce",
      "0ef15SECRETSECRETSECRET6686ac1e"
    ],
    "pop_pools": {},
    "region_pools": {}
  },
  "success": true,
  "errors": [],
  "messages": []
}

promote_backup.sh is good for when primary needs to be rebooted or otherwise undergo maintenance:

#!/bin/bash

source site_vars.sh

# We want the primary to be read only for as much of this process
# as possible because we can't proceed until the backup has caught
# up with all the writes performed on the master
ssh primary "mysql -e 'SET GLOBAL read_only = ON;'"

# Wait for replication to catch up
./checkslave.sh

# We've caught up, stop replicating.
mysql -e "STOP SLAVE;"
sleep 1
mysql -e "RESET SLAVE ALL;"

sleep 5
# It's now safe to accept writes
mysql -e 'SET GLOBAL read_only = OFF;'

# Switch roles for who writes file system updates
# We're the only one expected to see any writes
# but no point in having lsyncd running on the primary
ssh primary "systemctl stop lsyncd"
systemctl start lsyncd


# Time for us to start replicating
ssh primary "mysql -e \"CHANGE MASTER TO master_host='primary', master_port=3306, master_user='replication_user', master_password='$REPLICATION_PASSWORD', master_use_gtid=current_pos;\""
ssh primary "mysql -e \"START SLAVE;\""

# Time to tell Cloudflare that traffic should be routed to this node.
# The primary is left as a fallback to this node just in case.
# We can do this doing an orderly switch. If this was a failover
# we'd have to get the primary out of the load balancer entirely
# until we had time to inspect whether it could safely resume master
# status.
./set_backup_first_in_lb.sh

Settings

It’s going to be useful for the two servers to both believe themselves to be woo.deref.se so that they can make curl-calls to that address to check their own status. Master node /etc/hosts therefore gets two additional rows:

10.0.0.2 backup # VPN IP for backup node
13.53.118.162 woo.deref.se # Public IP of master node, which Amazon EC2 keeps changing on reboot...

The backup node gets something similar:

10.0.0.1 primary
185.20.12.24 woo.deref.se

You will also need passwordless SSH between the root accounts of the two servers.

MariaDB on second node:

user                    = mysql
pid-file                = /run/mysqld/mysqld.pid
socket                  = /run/mysqld/mysqld.sock
#port                   = 3306
basedir                 = /usr
datadir                 = /var/lib/mysql
tmpdir                  = /tmp
lc-messages-dir         = /usr/share/mysql
#skip-external-locking
server_id=2
log-basename=woo2
binlog-format=mixed
relay_log=woo2-relay-log
log_bin=woo2-bin
read-only=0
bind-address            = 0.0.0.0

Default site config file for Nginx on second node:


server {
        listen 80 default_server;
        listen [::]:80 default_server;


        server_name woo.deref.se;
        return 301 https://woo.deref.se$request_uri;
}

server {

        listen 443 ssl default_server;
        listen [::]:443 ssl default_server;

        root /var/www/html;

        # Add index.php to the list if you are using PHP
        index index.php index.html index.htm index.nginx-debian.html;

        server_name woo.deref.se;
        include snippets/cloudflaressl.conf;

        location / {
                # First attempt to serve request as file, then
                # as directory, then fall back to displaying a 404.
                #try_files $uri $uri/ =404;
                try_files $uri $uri/ /index.php?$args ;
        }


        location ~ \.php$ {
                #NOTE: You should have "cgi.fix_pathinfo = 0;" in php.ini
                include fastcgi_params;
                fastcgi_intercept_errors on;
                fastcgi_pass unix:/run/php/php7.4-fpm.sock;
                #The following parameter can be also included in fastcgi_params file
                fastcgi_param  SCRIPT_FILENAME $document_root$fastcgi_script_name;
        }

        location ~* \.(js|css|png|jpg|jpeg|gif|ico)$ {
                expires max;
                log_not_found off;
        }

        location ~ /\.ht {
                deny all;
        }
}

I wonder why I had to add this manually to the first node to get Zabbix calls to /basic_status to work? Maybe it’s enabled by default on Ubuntu?

server {
        listen 80 default_server;

        server_name _;
        location = /basic_status {
            stub_status;
        }
    }

lsyncd.conf is tricky because on Ubuntu it’s supposed to be installed as /etc/lsyncd/lsyncd.conf.lua while it’s just plain /etc/lsyncd.conf on RHEL.

 sync {
         default.rsyncssh,
         source="/var/www/html",
         host="primary",
         targetdir="/var/www/html/",
         rsync = {
             owner = true,
             group = true,
             chown = "nginx:nginx"
         }
 }

This was from Ubuntu, on RHEL it says host=”backup” and chown = “www-data:www-data” instead. This is confusing at first until your realize chown is executed on the other host. So it makes sense for RHEL to run “chown www-data:www-data” on the Ubuntu node which is the one that uses the user/group www-data. Not sure that was made super clear but it’s easy enough to test out. Or you can just run a single distro and not have to deal with that.

Sensible questions

Why not Galera? Multi-master without any need for pacemaker and fencing! Galera don’t work too well when the response time between nodes is highly variable: https://www.percona.com/blog/2018/11/15/how-not-to-do-mysql-high-availability-geographic-node-distribution-with-galera-based-replication-misuse/

Why two different Linux distros? I want to minimize common points of faiure. I could have tried this with a RHEL/FreeBSD combo but vulnerabilities that hit all Linux distros are rare. This is my balance between ease-of-use and availability.

Would I use this for real? With some more testing, sure. And keep in mind this isn’t intended to be the ultimate high availability solution for WooCommerce. That would be something like three datacenters with dedicated interconnect running three servers with the Pacemaker cluster manager, MariaDB with Galera for multi-master writes within each data center, LiteSpeed or maybe Nginx+Varnish for high speed serving of data. Though I have a bit of a hard time understanding why someone with that kind of money would use WooCommerce. It has a low barrier for entry – sure – but it’s ecommerce functionality inelegantly taped onto a blogging engine.