Ceph-csi and Nomad

I have three manager-nodes in a proof-of-concept cluster: manage01, manage02 and manage03. There are also three worker nodes: worker01, worker02 and worker03. Managers are identical to each other, as are workers. Manage-nodes have Ceph mons. Worker nodes have Ceph OSDs. Each configured with the same ansible collection. There are of course some differences. Different IP addresses, different hostnames, they joined the cluster at different times but that shouldn’t matter and it turns out it doesn’t.

But what has happened is that only worker01 and worker02 have been able to mount Ceph RBD-volumes via ceph-csi. Isn’t that weird? Surely three identical(with the caveats above) servers should behave the same? I’ll skip to the answer to the riddle: ceph-csi in node-configuration running as a Nomad system-job can not connect to an OSD on the same host. worker01 always mounted an RBD volume by connecting to an OSD on worker02 or worker03. Worker02 always connected to an OSD on worker01 and worker03. But worker03 always tried to connect to its own OSD, which just fails by timeout. When I deleted the OSD on worker03 everything worked just with on worker03. It has no trouble mounting an RBD volume.

As we will soon see, this isn’t a Nomad or ceph-csi-specific issue. Yes, I said “ceph-csi in node-configuration running as a Nomad system-job can not connect to an OSD on the same host” but that’s still true. It’s just not the only situation in which the rbd command fails to mount an OSD on the same host. It’s probably a very unusual configuration to have ceph-csi node-instances running on the same host as OSDs. So I just co-located OSDs with mons on the manage-nodes and everything works fine.

Now to my point: why does almost everything aimed at running as/with containers have such bad debugging support? Let’s look at what happens when ceph-csi can’t mount an RBD volume because it tries to connect to the OSD located on the same node:

I1005 22:26:42.857254       1 utils.go:212] ID: 115 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":3}}},{"Type":{"Rpc":{"type":5}}}]}
I1005 22:26:45.963342       1 utils.go:195] ID: 116 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 GRPC call: /csi.v1.Node/NodeStageVolume
I1005 22:26:45.963873       1 utils.go:206] ID: 116 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 GRPC request: {"secrets":"***stripped***","staging_target_path":"/local/csi/staging/ceph-mariadb-test06/rw-file-system-single-node-writer","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"c7a90a82-4109-11ed-9a14-7b981f44cc3b","imageFeatures":"layering","imageName":"csi-vol-7403da80-fa1c-44e6-9794-2dca4ed3fbd7","journalPool":"nomadstore","pool":"nomadstore"},"volume_id":"0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7"}
I1005 22:26:45.964612       1 rbd_util.go:1279] ID: 116 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 setting disableInUseChecks: false image features: [layering] mounter: rbd
I1005 22:27:12.859386       1 utils.go:195] ID: 118 GRPC call: /csi.v1.Identity/Probe
I1005 22:27:12.859389       1 utils.go:195] ID: 118 GRPC call: /csi.v1.Identity/Probe
I1005 22:27:12.859995       1 utils.go:206] ID: 118 GRPC request: {}
I1005 22:27:12.859998       1 utils.go:206] ID: 118 GRPC request: {}
I1005 22:27:12.860186       1 utils.go:212] ID: 118 GRPC response: {}
I1005 22:27:12.860187       1 utils.go:212] ID: 118 GRPC response: {}
I1005 22:27:12.860962       1 utils.go:195] ID: 119 GRPC call: /csi.v1.Node/NodeGetCapabilities
I1005 22:27:12.861096       1 utils.go:206] ID: 119 GRPC request: {}
I1005 22:27:12.861634       1 utils.go:212] ID: 119 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":3}}},{"Type":{"Rpc":{"type":5}}}]}
I1005 22:27:42.862239       1 utils.go:195] ID: 120 GRPC call: /csi.v1.Identity/Probe
I1005 22:27:42.862341       1 utils.go:206] ID: 120 GRPC request: {}
I1005 22:27:42.862407       1 utils.go:212] ID: 120 GRPC response: {}
I1005 22:27:42.862550       1 utils.go:195] ID: 121 GRPC call: /csi.v1.Identity/Probe
I1005 22:27:42.862578       1 utils.go:206] ID: 121 GRPC request: {}
I1005 22:27:42.862598       1 utils.go:212] ID: 121 GRPC response: {}
I1005 22:27:42.863718       1 utils.go:195] ID: 122 GRPC call: /csi.v1.Node/NodeGetCapabilities
I1005 22:27:42.863834       1 utils.go:206] ID: 122 GRPC request: {}
I1005 22:27:42.864018       1 utils.go:212] ID: 122 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":3}}},{"Type":{"Rpc":{"type":5}}}]}

Mhm… So… It gets to setting disableInUseChecks: false image features: [layering] mounter: rbd and then just nothing? What happens when thing work?

I1006 13:43:56.363432       1 utils.go:212] ID: 46 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":3}}},{"Type":{"Rpc":{"type":5}}}]}
I1006 13:43:56.984542       1 utils.go:195] ID: 47 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 GRPC call: /csi.v1.Node/NodeUnpublishVolume
I1006 13:43:56.984866       1 utils.go:206] ID: 47 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 GRPC request: {"target_path":"/local/csi/per-alloc/08a19d7d-fb2c-8f36-6a79-84acb6324c72/ceph-mariadb-test06/rw-file-system-single-node-writer","volume_id":"0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7"}
I1006 13:43:56.986804       1 utils.go:195] ID: 48 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 GRPC call: /csi.v1.Node/NodeStageVolume
I1006 13:43:56.987341       1 utils.go:206] ID: 48 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 GRPC request: {"secrets":"***stripped***","staging_target_path":"/local/csi/staging/ceph-mariadb-test06/rw-file-system-single-node-writer","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"c7a90a82-4109-11ed-9a14-7b981f44cc3b","imageFeatures":"layering","imageName":"csi-vol-7403da80-fa1c-44e6-9794-2dca4ed3fbd7","journalPool":"nomadstore","pool":"nomadstore"},"volume_id":"0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7"}
I1006 13:43:56.989185       1 rbd_util.go:1279] ID: 48 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 setting disableInUseChecks: false image features: [layering] mounter: rbd
I1006 13:43:56.989833       1 utils.go:212] ID: 47 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 GRPC response: {}
I1006 13:43:56.993063       1 utils.go:195] ID: 49 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 GRPC call: /csi.v1.Node/NodeUnstageVolume
I1006 13:43:56.993150       1 utils.go:206] ID: 49 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 GRPC request: {"staging_target_path":"/local/csi/staging/ceph-mariadb-test06/rw-file-system-single-node-writer","volume_id":"0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7"}
E1006 13:43:56.993231       1 nodeserver.go:919] ID: 49 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 an operation with the given Volume ID 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 already exists
E1006 13:43:56.993268       1 utils.go:210] ID: 49 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 already exists
I1006 13:43:57.059403       1 omap.go:88] ID: 48 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 got omap values: (pool="nomadstore", namespace="", name="csi.volume.7403da80-fa1c-44e6-9794-2dca4ed3fbd7"): map[csi.imageid:d37e95b82f3a csi.imagename:csi-vol-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 csi.volname:ceph-mariadb-test06]
I1006 13:43:57.100591       1 rbd_util.go:346] ID: 48 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 checking for ImageFeatures: [layering]
I1006 13:43:57.183197       1 cephcmds.go:105] ID: 48 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 command succeeded: rbd [device list --format=json --device-type krbd]
I1006 13:43:57.208985       1 rbd_attach.go:420] ID: 48 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 rbd: map mon manage01,manage02,manage03
I1006 13:43:57.362451       1 cephcmds.go:105] ID: 48 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 command succeeded: rbd [--id admin -m manage01,manage02,manage03 --keyfile=***stripped*** map nomadstore/csi-vol-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 --device-type krbd --options noudev]
I1006 13:43:57.362632       1 nodeserver.go:414] ID: 48 Req-ID: 0001-0024-c7a90a82-4109-11ed-9a14-7b981f44cc3b-0000000000000002-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 rbd image: nomadstore/csi-vol-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 was successfully mapped at /dev/rbd0

Well, this was more informative but I’m not helped all that much by being shown what works. It’s what doesn’t work that would be helpful to see… It wasn’t entirely without usefulness though. The following commands allowed me to attempt things that ceph-csi seemingly never tried to do or did attempt but which failed silently:

rbd device list --format=json --device-type krbd
rbd --id admin -m manage01,manage02,manage03 --keyfile=***stripped*** map nomadstore/csi-vol-7403da80-fa1c-44e6-9794-2dca4ed3fbd7 --device-type krbd --options noudev

They failed inside ceph-csi-containers on worker03 but not on worker01 and worker02. They also failed outside ceph-csi on worker03 but worked outside ceph-csi on worker01 and worker02. So it wasn’t specifically a ceph-csi thing! As explained earlier, placing ceph-csi on one set of nodes and OSDs on another solved the issue and I’m not going to figure out why Ceph takes exception to node X asking to mount an RBD image by connecting to an OSD on node X.

Now, this was not nice to debug. You’d think that raising the verbosity level for the cephcsi-container from 5 to 9 or 99 would help:

      config {
        image = "quay.io/cephcsi/cephcsi:v3.7.1"

        args = [
          "--drivername=rbd.csi.ceph.com",
          "--v=5",
          "--type=rbd",
          "--nodeserver=true",
          "--nodeid=${node.unique.id}",
          "--instanceid=${NOMAD_ALLOC_ID}",
          "--endpoint=${CSI_ENDPOINT}",
          "--metricsport=${NOMAD_PORT_prometheus}",
          "--logtostderr=true",
        ]

Fun fact: no. Doesn’t do anything. Imagine if there hadn’t been commands logged so that I could run tests myself inside and outside the container to debug the issue! Nomad and Consul aren’t 100% when it comes to log message but they are way better than most container-related things. Credit also to Docker and Podman even though I’ve had to resort to debugging issue with them using strace at times. But things scheduled dynamically by something other than me is not easy to strace… So log messages aren’t just nice, they’re the only thing you’ve got to go on in many environments.

Anyway, now I have a PoC for this. I can demo that you can move workloads around with minimal downtime for maintenance or load-distribution purposes. But if a worker-node suddenly fails stateful workloads won’t failover. You have to explicitly tell Nomad that the allocation on the failed node is gone:

nomad volume status ceph-mariadb-test06
(read which allocation is using it)
nomad alloc stop d48619b7
nomad volume detach ceph-mariadb-test06 d48619b7

I’ll try to turn this into a workable Kubernetes alternative to keep my colleagues from jumping onto the K8s bandwagon. Because I don’t want to end up having to debug a failed Kubernetes cluster at 3 AM when I’m on call. Of course the time of day is less relevant when resolving the issue might take days due to poor logging and debug support. Look, if Kubernetes wants me to stop talking trash it needs to stop kicking me in the nuts!