Quick note on network debugging

Dropwatch to monitor packet drops in real time:

dnf install dropwatch
dropwatch –help
dropwatch -l kas
dropwatch> start

perf to figure out which software is making calls whose packets are dropped:

dnf install perf -y
perf record -g -a -e skb:kfree_skb
perf script

Log packets processed in iptables to dmesg:

iptables-legacy -t raw -A PREROUTING -p tcp –dport 9100 -j TRACE
dmesg
dmesg > dump.txt
ls -lh dump.txt
iptables-legacy -t raw -D PREROUTING -p tcp –dport 9100 -j TRACE

Still haven’t figured out why Kubernetes keeps dropping packets intermittently on one of three nodes(which one changes are workloads move around). It’s not conntrack being full or the pod receiving the traffic that’s dropping. It just enters ens18 and never enters the correct calico virtual interface so odds are the kernel drops ’em.

I can’t say I’m saddened by this turn of events. This is precisely the sort of stuff that I’ve been ranting about with this kind of “we handle it for you magically” stuff. Great when it works, not so great when you have to trace intermittent packet loss in a patchwork of vxlan and iptables entries managed by shadowy puppetmaster who doesn’t explain himself.

calico-node’s log with logscreenseverity set to debug and filelogging active:

2021-10-16 22:56:43.035 [INFO][8] startup.go 215: Using node name: kube02.svealiden.se
2021-10-16 22:56:43.196 [INFO][17] allocateip.go 144: Current address is still valid, do nothing currentAddr="10.1.173.128" type="vxlanTunnelAddress"
CALICO_NETWORKING_BACKEND is vxlan - no need to run a BGP daemon
Calico node started successfully

That was almost 4 hours ago…

Had a look at Nomad but I’m a little bit skeptical of that too. You seem to need Nomad, Consul and some networking thing to get a useful stack.