Kubernetes and clusters in general

I’ve hated Kubernetes for a long time. Must be nigh on seven years at this point. I got in pretty early when Traefik wasn’t around and it was generally not viable to run a Kubernetes cluster outside the cloud due to the lack of LoadBalancer implementations. StatefulSets weren’t a thing. But everyone else seems to have been crazy for it.

Maybe what made me hostile to Kubernetes was its way of pawning off all the difficult parts of clustering to someone else. When you don’t have to deal with state and mutual exclusion, clustering becomes way easier. But that’s not really solving the underlying problem, just saying “If you figured out the hard parts, feel free to use Kubernetes to simplify the other parts”.

It also doesn’t work in its favor that it is so complex and obtuse. I’ve spent years tinkering with it(on and off naturally, I don’t engage that frequently with things I hate) and over the past few weeks I’ve got a working setup using microk8s and Rook that lets me create persistent volumes in my external Ceph cluster running on Proxmox. I now run my web UI for the pdns authoritative DNS servers in that cluster using a Deployment that can be scaled quite easily:

[root@kube01 ~]# mkctl get pods -o wide
NAME                                    READY   STATUS    RESTARTS   AGE   IP              NODE                  NOMINATED NODE   READINESS GATES
traefik-ingress-controller-gmf7k        1/1     Running   1          20h   192.168.1.172   kube02.svealiden.se   <none>           <none>
traefik-ingress-controller-94clq        1/1     Running   1          20h   192.168.1.173   kube03.svealiden.se   <none>           <none>
traefik-ingress-controller-77fxr        1/1     Running   1          20h   192.168.1.171   kube01.svealiden.se   <none>           <none>
whoami-78447d957f-t82sd                 1/1     Running   1          20h   10.1.154.29     kube01.svealiden.se   <none>           <none>
whoami-78447d957f-bwg7p                 1/1     Running   1          20h   10.1.154.30     kube01.svealiden.se   <none>           <none>
pdnsadmin-deployment-856dcfdfd8-5d45p   1/1     Running   0          50s   10.1.173.137    kube02.svealiden.se   <none>           <none>
pdnsadmin-deployment-856dcfdfd8-rdv88   1/1     Running   0          50s   10.1.246.237    kube03.svealiden.se   <none>           <none>
pdnsadmin-deployment-856dcfdfd8-vmzws   1/1     Running   0          50s   10.1.154.44     kube01.svealiden.se   <none>           <none>

[root@kube01 ~]# mkctl scale deployment.v1.apps/pdnsgui-deployment --replicas=2
Error from server (NotFound): deployments.apps "pdnsgui-deployment" not found

[root@kube01 ~]# mkctl get deployments
NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
whoami                 2/2     2            2           20h
pdnsadmin-deployment   3/3     3            3           2m23s

[root@kube01 ~]# mkctl scale deployment.v1.apps/pdnsadmin-deployment --replicas=2
deployment.apps/pdnsadmin-deployment scaled

[root@kube01 ~]# mkctl get pods -o wide
NAME                                    READY   STATUS        RESTARTS   AGE     IP              NODE                  NOMINATED NODE   READINESS GATES
traefik-ingress-controller-gmf7k        1/1     Running       1          20h     192.168.1.172   kube02.svealiden.se   <none>           <none>
traefik-ingress-controller-94clq        1/1     Running       1          20h     192.168.1.173   kube03.svealiden.se   <none>           <none>
traefik-ingress-controller-77fxr        1/1     Running       1          20h     192.168.1.171   kube01.svealiden.se   <none>           <none>
whoami-78447d957f-t82sd                 1/1     Running       1          20h     10.1.154.29     kube01.svealiden.se   <none>           <none>
whoami-78447d957f-bwg7p                 1/1     Running       1          20h     10.1.154.30     kube01.svealiden.se   <none>           <none>
pdnsadmin-deployment-856dcfdfd8-rdv88   1/1     Running       0          2m33s   10.1.246.237    kube03.svealiden.se   <none>           <none>
pdnsadmin-deployment-856dcfdfd8-vmzws   1/1     Running       0          2m33s   10.1.154.44     kube01.svealiden.se   <none>           <none>
pdnsadmin-deployment-856dcfdfd8-5d45p   0/1     Terminating   0          2m33s   10.1.173.137    kube02.svealiden.se   <none>           <none>

[root@kube01 ~]# mkctl get pods -o wide
NAME                                    READY   STATUS    RESTARTS   AGE     IP              NODE                  NOMINATED NODE   READINESS GATES
traefik-ingress-controller-gmf7k        1/1     Running   1          20h     192.168.1.172   kube02.svealiden.se   <none>           <none>
traefik-ingress-controller-94clq        1/1     Running   1          20h     192.168.1.173   kube03.svealiden.se   <none>           <none>
traefik-ingress-controller-77fxr        1/1     Running   1          20h     192.168.1.171   kube01.svealiden.se   <none>           <none>
whoami-78447d957f-t82sd                 1/1     Running   1          20h     10.1.154.29     kube01.svealiden.se   <none>           <none>
whoami-78447d957f-bwg7p                 1/1     Running   1          20h     10.1.154.30     kube01.svealiden.se   <none>           <none>
pdnsadmin-deployment-856dcfdfd8-rdv88   1/1     Running   0          2m39s   10.1.246.237    kube03.svealiden.se   <none>           <none>
pdnsadmin-deployment-856dcfdfd8-vmzws   1/1     Running   0          2m39s   10.1.154.44     kube01.svealiden.se   <none>           <none>

Yet I have only the flimsiest idea of how it works and how to fix it if something breaks. Maybe I’ll learn how Calico uses vxlan to connect everything magically and why I had to reset my entire cluster on thursday to remove a custom resource definition.

By the way, try to create a custom resource definition not using the beta-API! You’ll have to provide a schema: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/

I suspect many people will go mad trying to make heads or tails of that. Anyway, if I had to make a decision about how to run a set of microservices in production I’d still go with something like docker containers run as systemd-services and HAProxy to load balance the traffic. Less automation for rolling upgrades, scaling and so on but I wouldn’t be worried about relying entirely on a system where it’s not even clear if I can find a way to describe the malfunction that keeps all services from running! I mean, I added podAntiAffinity to me deployment before:

affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - pdns-admin-gui
            topologyKey: "kubernetes.io/hostname"

And this is what all three pods logged when I started the deployment:

  Warning  FailedScheduling  39m   default-scheduler  0/3 nodes are available: 3 node(s) didn't match pod affinity/anti-affinity rules, 3 node(s) didn't match pod anti-affinity rules.
  Warning  FailedScheduling  39m   default-scheduler  0/3 nodes are available: 3 node(s) didn't match pod affinity/anti-affinity rules, 3 node(s) didn't match pod anti-affinity rules.
  Normal   Scheduled         39m   default-scheduler  Successfully assigned default/pdnsadmin-deployment-856dcfdfd8-vmzws to kube01.svealiden.se
  Normal   Pulled            39m   kubelet            Container image "cacher.svealiden.se:5000/pdnsadmin:20210925" already present on machine
  Normal   Created           39m   kubelet            Created container pdnsadmin
  Normal   Started           39m   kubelet            Started container pdnsadmin

Obviously “Successfully assigned default/pdnsadmin-deployment-856dcfdfd8-vmzws to kube01.svealiden.se” was a good thing but Kubernetes telling me that zero nodes were initially available? When the only requirement I set out was that pdns-admin-gui pods shouldn’t be run on the same node? And there are three nodes? And I asked for three pods? That’s the sort of stuff that would make me very nervous if this was used for production. What if Kubernetes gets stuck in the “All pods forbidden because reasons”-mode?

This is also why I’m terrified of running Ceph for production applications. Multiple independent Ceph cluster? Okey, now we’re talking but a single cluster? It’s just a matter of time before Ceph locks up on you and you have a week’s downtime while trying to figure out what the hell is going on.

The keen observer will ask “Aren’t you a massive fan of clusters?” and that’s entirely correct. I’ve run Ceph for my own applications for just over two years and have Elasticsearch, MongoDB and MariaDB clusters set up. But the key point is disaster recovery. Clusters can be great for high availability which is what I’m really a fan of but clusters where there isn’t an override in case the complex logic of the distributed system goes on the fritz are a huge gamble. If MongoDB gets confused I can choose a node and force it to become the primary. If I can get the others to join afterwards that’s fine, otherwise I’ll just have to blank them and rebuild. Same with MariaDB, I can kill the two nodes and make the remaining one a master and take it from there. I don’t need any distributed algorithm to give me permission to bring systems back in a diminished capacity.

By the way, nothing essential runs on my Ceph cluster. Recursive DNS servers, file-shares, backup of file-shares and so on are all running on local storage in a multimaster or master/slave configuration. Ceph going down will disable some convience-functions, my Zabbix monitoring server, Prometheus, Grafana and so on, but I can live without them for a couple of hours while I swear angrily at Ceph. In fairness I haven’t had a serious Ceph-issue now for (checking…) about a year now!