GlusterFS

I first set up my new server arrangement with GlusterFS which is easier to get up and running than Ceph and has great disaster recovery properties since it stores data in regular file systems on hard drives. So if GlusterFS stops working with error message PC LOAD LETTER I can just copy all the files back from the FS used on each drive.

Buuut… GlusterFS is slow as tar. I tried to put a MySQL database on a virtual disk stored on GlusterFS and it didn’t go well. IO wait was absurdly high. That made me set up a Percona XtraDB Cluster on each Proxmox node’s local storage which worked well at first… Well that’s another issue. For metadata operations there’s a setting readdir-ahead ( https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/readdir-ahead/ ) which helps a bit but it’s still slow.

Apparently GlusterFS is okay for people streaming big files, like video editors. At the very least it should be good for archival purposes. Though the necessity of clustered expand-online storage for archives is probably not super common.

Percona XtraDB Cluster

Putting a MySQL database server on Ceph storage seems like a moderately good idea. At least with my hardware. When using GlusterFS it was absolutely terrible but even Ceph gives some worrying IO Wait times on my MySQL host.

To remedy the situation when using GlusterFS I installed Percona XtraDB Cluster on three nodes stored using LVM Thin volumes on SSD. Great performance but also problems. wsrep_commit errors were really common when one of the cluster nodes was made of bits of old twig and connected via a not entirely ideal LAN connection. They continued much more rarely in a tighter setup so I didn’t dare put anything other than Zabbix monitoring data on it.

And a good thing too because as of last week I can’t get it back online. Or to put it another way, I don’t have the time or the inclination to go back through Btrfs snapshots, trying to find a combination of perceived states that the cluster will accept. I’ve restored it from a cold start before using the restart-bootstrap start sequence on one node and then letting the others join. But even with editing of grastate.dat files it won’t start.

This is basically what I used to worry about when it came to Ceph. It basically like this:
– Can I have my data now?
– No, can’t give you any data because Partition.
– Oh, the cluster is partitioned?
– Yes.
– Can you discard nodes that are unable to join?
– No, because Partition.

And so on. It was easier to solve split-brain on GlusterFS and that involved editing bitmasks. I understand the whole idea of data reliability and that clustering makes that harder but forcing a node to become a master should always be a fallback.

My conclusion is this: if you want a scalable database then you can’t have a relational database. You’re going to have to use MongoDB, Cassandra or one of those graph databases. The exception might be if you can afford IBM DB2 or whatever Oracle sells. But by and large, don’t do multi-master RDBMS. The people who make PostgreSQL seem to agree: https://www.postgresql.org/docs/8.2/high-availability.html

Home Lab – v1

I like high availability, probably more than I should. Banking, phone systems and the electrical grid do a good job with this but we have a lot of complex stuff at home nowadays which makes it trickier to keep them up and running reliably. It was more than ten years ago that I first tried to build a server setup that could keep things up and running even when individual servers failed or had to be brought down for maintenance. I didn’t have enough hardware with virtualization support so I had to use Xen(32-bit edition) with its paravirtualization support.

I used DRBD for shared storage and OCFS2 I think on top of that. It worked not so well. For various reasons I ended up having a single server with Solaris 11 and zRaid-5 storage so at least I had data redundancy even if the system as a whole wasn’t replicated. I later ended up with a master-slave setup with two identical Core i5-based “servers” where the master node replicated data over to the slave using Btrfs snapshots. The filesystems used Btrfs RAID1 on both nodes so there was a LOT of redundancy. There’s a story with zRaid-5, a weird hard drive failure and a couple of days of worrying behind this 2-server – each with internal 2-way replication – setup.

One of the servers in the master-slave setup gave up a few months ago so it was time to replace it with something new and now I think it’s pretty much complete.

Hardware

Node name: pve1
Microtower Atom C3558
16GB RAM
3.5″ hot swap spaces + 2 internal SATA connections
1 250 GB NVMe
1 250 GB SATA
1 500 GB SATA
1 2 TB 5400 RPM SATA
1 4 TB 5400 RPM SATA
4 Gbit LAN
IPMI

Node name: pve2
Microtower Xeon-D 1541
32 GB RAM
3.5″ hot swap spaces + 2 internal SATA connections
1 250 GB NVMe
2 250 GB SATA
1 500 GB SATA
1 2 TB 5400 RPM SATA
1 4 TB 5400 RPM SATA
Gbit LAN, 2 10GbaseT LAN
IPMI

Node name: pve3
Microtower Atom C3558
32GB RAM
3.5″ hot swap spaces + 2 internal SATA connections
1 250 GB NVMe
1 250 GB SATA
1 500 GB SATA
1 2 TB 5400 RPM SATA
1 4 TB 5400 RPM SATA
Gbit LAN
IPMI

Node name: nearline
Self-assembled Core i5-system
16GB RAM
Tandberg LTO3 tape drive
Hodgepodge of hard drives

Networking equipment:
Cisco RV082 Router
Netgear LB2120( for backup internet connection )
HPE 1920S switch

The tiny little screen to the top right is sort of a crude monitoring system. I don’t much feel like running a big LCD screen just to show the load on my servers. So this tiny screen tells me if any important hosts are down and what the load is on each Proxmox machine.

Structure

The nodes pve1-pve3 form a Proxmox 5.4 cluster with Ceph Luminous as shared storage. The journals for all Ceph OSDs is stored on NVMe partitions which took a while to set up since Proxmox’ own Ceph tools don’t want to do that. They say they do, but they don’t.

Some storage is kept out of Ceph because of reliability reasons. Basically I think of Ceph as a single point of (unlikely) failure. So virtual machines I want running even when I try to figure out why Ceph refuses to work are stored on LVM-thin volumes.

The nearline machine(not shown below since it is mostly turned off and so it makes little sense to monitor) is also a manifestation of my distrust in Ceph. I used to rsync data over to Btrfs-volumes once a day that I then snapshot. But the drives I put into the machine were junk so that had to stop. Now I got Bacula up and running again so therefor store backups on my cherished LTO-tapes. *hugs*

The Proxmox nodes use bonded network ports to connect to the HP Enterprise switch that serves mainly to connect the cluster together but it’s also the core switch of the network. The HP switch connects to my good old Cisco RV082 router which in turn connects to the fiber-modem that gives a nice 100 Mbit connection out. Now it also has 4G modem connected to WAN2 as a fallback.

The nodes with green links to the cloud symbol are stored on Ceph so can be migrated from physical hosts while running. Some nodes are not shown in the graphic above. Mostly they’re testbeds like my CloudLinux install with WHM and cPanel, a copy of my pacemaker-cluster and so on.

Software

Proxmox

A Debian-based virtualization platform with cluster and HA-support? Yes please… Has a great GUI and integration with Ceph. Kind of a pain to install new SSL certificates but it can be done. Basically Proxmox is an alternative to VMware vSphere and whatever Xen offers nowadays. Wish it had a way of configuring fencing for cluster nodes integrated with its own HA functionality. Has pretty good built-in performance monitoring as well.

Ceph

So I took the plunge to start using Ceph. It wasn’t entirely easy since I already had my cluster set up to use GlusterFS. It’s great to be able to move virtual machines from node to node using live migration but you can’t do that between separate shared storage systems now can you? I can handle some downtime but thought of it as an interesting experiment. Since I had DNS servers and MySQL servers set up in a cluster of virtual machines those virtual machines could be shut down one at a time, recreated on a new cluster with Ceph as a backend and then the process could be repeated one physical node at a time.

I didn’t need to create a new cluster but I figured I might as well go the entire way and upgrade to Proxmox 5.4. All in all there was like 5-10 minutes of down time inherent in the move and 1-2 hours of downtime because I’m a klutz who configures two servers to use the same IP address and then wonder why things don’t work so well.

Many decisions made about this setup reflect my lack of trust in Ceph but by now I’ve actually come to trust it quite well. It performed remarkably well when I screwed up the IP-addresses and other things. Haven’t encountered a split brain situation yet, which is more than I can say for GlusterFS(note: when increasing the node count in a GlusterFS setup, you have to change quorum levels manually…).

Ceph also has its own monitoring system which is nice.