Going Cilium-only: how devata routes a request without kube-proxy

Prerequisites: kubernetes, service, clusterip, coredns, kube-proxy, cilium, ebpf, iptables, endpointslice, cni

Your cluster is missing two things almost every other Kubernetes cluster is running right now. It has no kube-proxy and no Flannel (see cni). Both are so standard that most guides treat them as mandatory, so the honest reaction to a cluster without them is that it should not work at all. Yet devata routes traffic thousands of times a day. Grafana answers, Hubble answers, DNS works.

So something is doing kube-proxy’s job, and you have never seen it happen, because by the time you started poking at devata the old machinery was already gone. This walkthrough fixes that. You will follow one real request across your own cluster, find the exact spot where kube-proxy would sit, then build a throwaway cluster that still has kube-proxy so you can watch it work and then watch routing die when you take it away.

Open a terminal with kubectl pointed at devata. Everything in the first half is read-only.

The address that no machine owns

Start with the service you already know, Grafana.

kubectl get svc -n monitoring kps-grafana

Two addresses. A CLUSTER-IP in 10.96.0.0/12 and an EXTERNAL-IP in your home range, 192.168.1.24x, from metallb. The external one is for you. The interesting one is the clusterip, the address no machine on devata owns. The whole question of this walkthrough is who keeps the agreement it stands for, so a packet sent there comes out at a Grafana pod. On a normal cluster kube-proxy does it. On devata something else does, and we walk right up to the handoff.

Following one request, starting with the name

When something calls http://kps-grafana.monitoring, the first thing that must happen is turning that name into the ClusterIP. That is coredns:

kubectl run trace --image=busybox:1.36 --rm -it --restart=Never -- \
  sh -c 'cat /etc/resolv.conf; echo ---; nslookup kps-grafana.monitoring'

The nameserver is 10.96.0.10, CoreDNS’s own ClusterIP, and the lookup returns Grafana’s ClusterIP. So now your request is holding a ClusterIP. This is the moment kube-proxy would normally earn its keep.

The component your cluster does not run

kube-proxy would live as a per-node daemonset in the system namespace. Look:

kubectl -n kube-system get ds
kubectl -n kube-system get ds kube-proxy

You see cilium and cilium-envoy, and kube-proxy comes back NotFound. There is no kube-proxy on devata at all, yet the Grafana ClusterIP works. The thing doing kube-proxy’s job is cilium, which does the rewriting in the kernel with ebpf. Confirm Cilium was told to take over:

kubectl -n kube-system exec ds/cilium -- cilium-dbg status | grep KubeProxyReplacement

It reports True. That one setting is the entire reason kube-proxy could be deleted without anything breaking. But you have still never seen kube-proxy actually do anything. To understand what was removed, you have to see the thing that is absent. So build a cluster that still has it.

Seeing kube-proxy do its job, in a cluster you can break

For this you need Docker and kind. A fresh kind cluster ships the way most clusters do, with kube-proxy running and writing rules. Nothing here touches devata.

kind create cluster --name kproxy
kubectl --context kind-kproxy -n kube-system get ds kube-proxy
kubectl --context kind-kproxy create deployment web --image=nginx
kubectl --context kind-kproxy expose deployment web --port=80
kubectl --context kind-kproxy get svc web

Note the ClusterIP for web. A kind node is just a Docker container named kproxy-control-plane, so read the rules kube-proxy programmed into iptables:

docker exec kproxy-control-plane iptables-save -t nat | grep <the web ClusterIP>

You are looking at kube-proxy’s actual output: KUBE-SERVICES, KUBE-SVC-..., KUBE-SEP-... chains, ending in the DNAT rule that rewrites the ClusterIP to the nginx pod. kube-proxy is not a proxy traffic flows through, it is a controller that watches Services and writes these rules. If the grep is empty, your kind runs nftables mode; docker exec kproxy-control-plane nft list table ip kube-proxy shows the same thing.

Watching it break

Do the thing you can never safely do on devata.

kubectl --context kind-kproxy -n kube-system delete daemonset kube-proxy
kubectl --context kind-kproxy create deployment web2 --image=nginx
kubectl --context kind-kproxy expose deployment web2 --port=80
kubectl --context kind-kproxy get svc web2
docker exec kproxy-control-plane iptables-save -t nat | grep <the web2 ClusterIP>

Nothing. The web2 ClusterIP has no DNAT rule, because the only thing that would write one is gone. Packets to it go nowhere, even though every node still says Ready. ClusterIP routing is not magic baked into Kubernetes, it is a service some component must provide. Delete the sandbox:

kind delete cluster --name kproxy

Why nothing broke on devata

On kind, removing kube-proxy broke routing because nothing else provided it. On devata, kube-proxy was removed and nothing broke, because cilium had already taken the job with ebpf. Instead of an iptables rule pile, Cilium loads eBPF programs that intercept the Service lookup lower down. See the eBPF view of what kube-proxy kept in iptables:

kubectl -n kube-system exec ds/cilium -- cilium-dbg service list | head

Each line maps a Service address to its backend pods, the same job, held in eBPF maps. That is what kube-proxy replacement means, and you now have both halves: the iptables version you built and broke, and the eBPF version your cluster runs.

The last two hops, quickly

How does anything know which pods are behind the Service? The endpointslice:

kubectl get endpointslices -n monitoring -l kubernetes.io/service-name=kps-grafana -o wide
kubectl get pods -n monitoring -o wide | grep -i grafana

The EndpointSlice addresses match the Grafana pod’s. The ClusterIP out front never changes while this list updates as pods reschedule.

And the pod address itself, in 10.244.0.0/16, is handed out by Cilium as the cni. A pod on one node reaches a pod on another with no translation in between. devata removed Flannel to give that job cleanly to Cilium, for the same reason it removed kube-proxy: it had become dead weight behind the component that actually worked.

The full path

The complete trace: a name resolved by coredns to a clusterip, that virtual address rewritten by a cilium eBPF program to a real pod address from the cni range, with the endpointslice supplying the set of backend pods. On a kube-proxy cluster the rewrite step is an iptables DNAT rule instead, the version you built and then removed, watching Services stop while every node still reported Ready.

devata homelab

Explorer

Going Cilium-only: how devata routes a request without kube-proxy

The address that no machine owns

Following one request, starting with the name

The component your cluster does not run

Seeing kube-proxy do its job, in a cluster you can break

Watching it break

Why nothing broke on devata

The last two hops, quickly

The full path

Cilium: devata's eBPF dataplane

ClusterIP: the address no machine owns

CNI: how pods get onto the network

CoreDNS: the cluster's own DNS server

eBPF: running your code inside the kernel

EndpointSlice: the live list of pods behind a Service

iptables and DNAT: how the kernel rewrites a packet

kube-proxy: the controller that makes ClusterIPs work

MetalLB: external IPs without a cloud