*§#% Ceph

So I updated my Proxmox servers a few weeks ago. One worked fine but for two of them Ceph couldn’t start OSD:s for two drives. The processes segfaulted as it read through each disk and started over. I had to blank the drives and resync them. Well that’s not kosher! I was going to update Proxmox anyway so maybe it will be better with Ceph Nautilus which comes with Proxmox 6.2. Does it bollocks!

I upgrade one server at a time. One server is blanked and installed with Proxmox 6.2 and a new Ceph cluster is created with replica size 1. I then rsync data over from the old cluster. I had run this setup for 48 hours when a sudden reboot of the server left Ceph unable to read from two hard drives containing the CephFS. This time there was at least an error logged, not just a segfault:

2020-09-08 21:38:00.252 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:7 2.14 2.844f3494 (undecoded) ondisk+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:07:02.911572 currently queued for pg
2020-09-08 21:38:00.252 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:21:50.726262 currently queued for pg
2020-09-08 21:38:00.252 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:7 2.14 2.844f3494 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:21:50.726295 currently queued for pg
2020-09-08 21:38:00.252 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:36:50.729917 currently queued for pg
2020-09-08 21:38:00.252 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:7 2.14 2.844f3494 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:36:50.729969 currently queued for pg
2020-09-08 21:38:00.252 7fd75494f700 -1 osd.2 122 get_health_metrics reporting 6 slow ops, oldest is osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+read+known_if_redirected+full_force e119)
2020-09-08 21:38:01.256 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+read+known_if_redirected+full_force e119) initiated 2020-09-08 21:06:50.724397 currently queued for pg
2020-09-08 21:38:01.256 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:7 2.14 2.844f3494 (undecoded) ondisk+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:07:02.911572 currently queued for pg
2020-09-08 21:38:01.256 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:21:50.726262 currently queued for pg
2020-09-08 21:38:01.256 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:7 2.14 2.844f3494 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:21:50.726295 currently queued for pg
2020-09-08 21:38:01.256 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:36:50.729917 currently queued for pg
2020-09-08 21:38:01.256 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:7 2.14 2.844f3494 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:36:50.729969 currently queued for pg
2020-09-08 21:38:01.256 7fd75494f700 -1 osd.2 122 get_health_metrics reporting 6 slow ops, oldest is osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+read+known_if_redirected+full_force e119)

Well, you get the idea. So that’s #¤§* it! It’s not that I can’t work around these issues and blank drives and bring them back into the cluster, but I have zero tolerance for buggy file systems and database engines. I run the old Luminous release for which there has been ample time to iron out kinks and OSD:s can’t start. I switch over to Nautilus: same but different.

I’ll run a toy Ceph-cluster in a couple of VM:s but for actual workloads? How about no? This means standard logical volumes for VM:s and btrfs for my 2TB of files. I’ll go back to my old btrfs script with the “send snapshot” functionality activated. I’ve run it without that send-snapshot thing for a while as a DB backup solution(a plan that turned out less than ideal which I should have predicted for such high IOPS applications).

Memo to self: if running Ceph in production, set up a staging cluster that can be upgraded and beaten about before rolling out the same release to the real cluster.