Deconstructing Kubernetes Networking

August 4, 2020

In the first post of the series I've decided to call "Deconstructing Kubernetes", we set up an extremely basic more-or-less-functional Kubernetes cluster with one node. The logical next step is to go multi-node—how hard could it be?

Quite hard, it turns out! (So hard that we won't be able to do it in one blog post.) As soon as we move beyond one node, we have to deal with container networking across hosts, which involves a lot of intricacies. But it's an interesting exercise to dive into the mud and figure out how the various networking pieces fit together.

Failing to Go Multi-Node

Let's first try to set up a naïve two-node cluster to see what we're up against. We'll need to allow kubelet on the second node to talk to the API server on our existing node. In the spirit of completely ignoring security to keep blog post length manageable, we can just allow open access to the Kubernetes API. Edit pods/kube-apiserver.yaml to look like this:

apiVersion: v1
kind: Pod
metadata:
  name: kube-apiserver
  namespace: kube-system
spec:
  containers:
  - name: kube-apiserver
    command:
    - kube-apiserver
    - --etcd-servers=http://127.0.0.1:2379
    - --insecure-bind-address=0.0.0.0
    image: k8s.gcr.io/kube-apiserver:v1.18.5
  hostNetwork: true

(Notice the insecure-bind-address option. This setup is fantastically insecure, and you probably shouldn't follow along with this section if your VMs have public IP addresses.)

Next, we'll set up a second VM that can communicate with the first node. The setup is very similar to what we did for the original node; let's just rush through it:

ubuntu@mink8s2:~$ sudo apt install docker.io
ubuntu@mink8s2:~$ sudo systemctl enable docker
ubuntu@mink8s2:~$ sudo systemctl start docker
ubuntu@mink8s2:~$ curl -L https://storage.googleapis.com/kubernetes-release/release/v1.18.5/bin/linux/amd64/kubelet > kubelet
ubuntu@mink8s2:~$ curl -L https://storage.googleapis.com/kubernetes-release/release/v1.18.5/bin/linux/amd64/kubectl > kubectl
ubuntu@mink8s2:~$ chmod +x kubelet
ubuntu@mink8s2:~$ chmod +x kubectl
ubuntu@mink8s2:~$ API_IP=10.70.10.228 # set to your original node's IP
ubuntu@mink8s2:~$ cat <<EOS > kubeconfig.yaml
apiVersion: v1
kind: Config
clusters:
- cluster:
    server: http://$API_IP:8080
  name: mink8s
contexts:
- context:
    cluster: mink8s
  name: mink8s
current-context: mink8s
EOS
ubuntu@mink8s2:~$ mkdir -p .kube
ubuntu@mink8s2:~$ cp kubeconfig.yaml .kube/config

So far the only difference between this and our original node is that we're pointing to our original node's IP instead of 127.0.0.1 in our kubeconfig files. (In general, there isn't much of a distinction between Kubernetes "control nodes" and "worker nodes"—basically the "control plane" just means whatever nodes are running the Kubernetes API server.) Let's fire up kubelet and see what happens:

ubuntu@mink8s2:~$ sudo ./kubelet --kubeconfig=kubeconfig.yaml

From another terminal, we can try running pods on our new node:

ubuntu@mink8s2:~$ cat <<EOS | ./kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: nginx2
spec:
  containers:
  - image: nginx
    name: nginx
  nodeName: mink8s2
EOS

pod/nginx2 created
ubuntu@mink8s2:~$ ./kubectl get po nginx2 -owide
NAME     READY   STATUS    RESTARTS   AGE   IP           NODE      NOMINATED NODE   READINESS GATES
nginx2   1/1     Running   0          69s   172.17.0.3   mink8s2   <none>           <none>
ubuntu@mink8s2:~$ curl -s 172.17.0.3 | head -4
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>

So it looks like our second node works. That was easy! End of blog post!

But Networking…

Not so fast! A quick check shows that pod-to-pod networking is not working across nodes:

ubuntu@mink8s2:~$ cat <<EOS | ./kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: curlfail
spec:
  containers:
  - image: curlimages/curl
    name: curl
    command: ["curl", "172.17.0.3"]
  nodeName: mink8s
EOS
pod/curlfail created
ubuntu@mink8s2:~$ ./kubectl logs curlfail
curl: (7) Couldn't connect to server

What's going on here? An odd thing about Kubernetes is that it doesn't actually handle networking at all, but instead outsources the configuration to external plugins. Thus far in our journey, kubelet has been relying on Docker to set up networking, but Docker doesn't set up any routing between hosts. We'll need a different solution to set up a multi-node cluster.

So how should we configure pod-to-pod networking? The Kubernetes docs have a description of the "networking model", i.e. the rules that all Kubernetes clusters are supposed to follow:

Kubernetes imposes the following fundamental requirements on any networking implementation (barring any intentional network segmentation policies):
pods on a node can communicate with all pods on all nodes without NAT
agents on a node (e.g. system daemons, kubelet) can communicate with all pods on that node
….
pods in the host network of a node can communicate with all pods on all nodes without NAT

So in order to have a "real Kubernetes cluster", "everything" needs be able to communicate with "everything" (more or less). The "without NAT" requirement is a bit confusing (at least for me), since NAT ends up being important in some networking implementations, but the important part is that each pod gets its own IP address that will be valid across the entire cluster (which implies that IP address overlaps are forbidden).

The standard way to set up networking in a Kubernetes cluster is to use a plugin like Calico or Flannel. But there's very little to learn from that! Instead, we'll dive into the dark arts of container networking and try to implement the networking model ourselves.

netns, bridges, and veths, Oh My!

Let's take a step back and look at how container networking actually works. (Warning: things are going to get a bit networky from here on out! I'll assume you know some networking basics like layer 2 vs layer 3 switching/routing and CIDR notation, but I'll try to keep things as simple as possible.)

The technology at the heart of containerization is Linux namespacing, which allows for isolation of various resources without full OS-level virtualization. The kind of namespace we care about here is a network namespace (aka "netns"), which provides a full copy of the Linux networking stack that's completely isolated from the "main" one. Every Kubernetes pod gets its own network namespace (if there are multiple containers in a pod, they share the same namespace).

A network namespace begins its life as a blank slate with no network devices. In order for anything useful to happen, the network has to be configured. There are many ways to configure pod/container networking, but many of them take advantage of a couple of powerful (and painfully underdocumented) Linux networking features:

bridges: bridges are like virtual network switches that live within the Linux kernel.
veths: veths are like virtual network cables that attach two network devices (physical or virtual). They always come in pairs, one for each end of the "cable".

The steps for setting up networking on a pod look something like:

Add a bridge (typically there will be one per host).
Create a netns for the pod (there will be one per pod).
Add a veth pair with one end of the pair in the pod's netns and the other end connected to the bridge.
Assign IP addresses and add routes as necessary.

Here's a mediocre diagram of the setup we're looking for, with a couple of pods attached to a bridge over veth pairs:

That's all a bit abstract; I find it easier to understand by actually setting everything up manually. You can use the ip command to configure a netns with the appropriate bridge/veths without having to actually make a container:

# Create a netns named "test"
$ sudo ip netns add test
# Create a bridge named "test0"
$ sudo ip link add name test0 type bridge
# Create a veth pair with testveth0<->eth0 as the endpoints
$ sudo ip link add testveth0 type veth peer name eth0
# Move the eth0 side of the veth pair to the "test" netns
$ sudo ip link set eth0 netns test
# "Plug in" the testveth0 side of the veth pair to the test0 bridge
$ sudo ip link set testveth0 master test0
# Bring up the testveth0 side of the veth pair
$ sudo ip link set testveth0 up
# Bring up the eth0 side of the veth pair
$ sudo ip -n test link set eth0 up
# List network devices in the "main" namespace
$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:cf:81:3d brd ff:ff:ff:ff:ff:ff
11: test0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 4a:13:0d:fb:9f:a4 brd ff:ff:ff:ff:ff:ff
15: testveth0@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master test0 state UP mode DEFAULT group default qlen 1000
    link/ether 4a:13:0d:fb:9f:a4 brd ff:ff:ff:ff:ff:ff link-netnsid 1
# List network devices in the "test" namespace
$ sudo ip -n test link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
14: eth0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether a6:4f:92:e2:4b:1e brd ff:ff:ff:ff:ff:ff link-netnsid 0

Don't worry if you don't understand all of that ip output (I'm personally at about 40% comprehension)—the important thing to note is that the "view" of the network looks entirely different from within the test namespace, but bridges and veths give us a way to communicate across namespaces.

How can we get to this setup for our Kubernetes cluster? As I mentioned before, Kubernetes doesn't handle any networking itself—instead, it outsources the configuration to an external plugin. The standard that networking plugins generally use is called the Container Network Interface (CNI). A CNI plugin is basically just a binary that follows the CNI specification; it has the somewhat arbitrary job of setting up a container's network after the network namespace has been created but before the container starts.

Back to Our Node

Enough theory, let's actually set up CNI! For the rest of this post, we'll be working with our original node (mink8s in my case).

First we need to pick an IP range for our pods. I'm going to arbitrarily decide that my mink8s node is going to use the 10.12.1.0/24 range (i.e. 10.12.1.0 - 10.12.1.255). That gives us more than enough IPs to work with for our purposes. (When we go multi-node we can give the other nodes in our cluster similar ranges.)

The first thing we'll have to do (to save many hours of debugging woes) is to disable Docker's built-in networking entirely. For boring historical reasons, Docker does not use CNI, and its built-in solution interferes with the setup we're going for. Edit /etc/docker/daemon.json to look like this:

{
  "exec-opts": ["native.cgroupdriver=cgroupfs"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "bridge": "none",
  "iptables": false,
  "ip-masq": false,
  "storage-driver": "overlay2"
}

Most of these settings aren't important for our purposes, but the bridge, iptables, and ip-masq options are critical. Once you've edited that file, reboot the machine to clear out old network settings and iptables rules. (Trust me, this will make your life much easier! It's also probably a good idea to delete any existing pods you have running to avoid confusion.)

Now we'll have to get CNI up and running. We're going to use the example plugins provided by the CNI project; by convention, the binaries live in /opt/cni/bin:

$ curl -L https://github.com/containernetworking/plugins/releases/download/v0.8.6/cni-plugins-linux-amd64-v0.8.6.tgz > cni.tgz
$ sudo mkdir -p /opt/cni/bin
$ sudo tar xzvf cni.tgz -C /opt/cni/bin

Now we'll make a CNI network configuration file that will use the bridge CNI plugin, which sets up networking according to the basic scheme outlined earlier. Confusingly, to use CNI we actually need to configure two plugins: a "main" plugin and an "IPAM" plugin (IPAM stands for IP Address Management). The IPAM plugin is responsible for allocating IPs for pods while the main plugin does most of the rest of the configuration. We'll be using the host-local IPAM plugin, which just allocates IPs from a range and makes sure there are no overlaps on the host.

OK enough theory—let's take a first crack at a minimal CNI configuration. Kubelet will look for CNI configuration files in the /etc/cni/net.d directory by default. Put the following in /etc/cni/net.d/mink8s.conf:

{
    "cniVersion": "0.3.1",
    "name": "mink8s",
    "type": "bridge",
    "bridge": "mink8s0",
    "ipam": {
        "type": "host-local",
        "ranges": [
          [{"subnet": "10.12.1.0/24"}]
        ]
    }
}

To dissect that configuration a bit:

type and ipam.type specify the actual plugin binary names (so it will look for /opt/cni/bin/bridge and /opt/cni/bin/host-local for the plugins we're using).
bridge specifies the name of the network bridge that the bridge plugin will create.
ipam.ranges specifies the IP ranges to allocate to pods. In our case, we're going to allocate IPs in the 10.12.1.0/24 range.

Now we'll restart kubelet and pass the network-plugin=cni option:

$ sudo ./kubelet --network-plugin=cni --pod-manifest-path=pods --kubeconfig=kubeconfig.yaml

And then we'll create two "sleeping" pods to see if networking actually works:

$ for i in 1 2; do cat <<EOS | ./kubectl apply -f - ; done
---
apiVersion: v1
kind: Pod
metadata:
  name: sleep${i}
spec:
  containers:
  - image: alpine
    name: alpine
    command: ["sleep", "5000000"]
  nodeName: mink8s
EOS

Some poking around shows that both pods get IP addresses and can ping each other, which is a great first step!

$ ./kubectl get po -owide
NAME     READY   STATUS    RESTARTS   AGE   IP          NODE     NOMINATED NODE   READINESS GATES
sleep1   1/1     Running   0          7s    10.12.1.4   mink8s   <none>           <none>
sleep2   1/1     Running   0          6s    10.12.1.5   mink8s   <none>           <none>
$ ./kubectl exec sleep1 -- ping 10.12.1.5
PING 10.12.1.5 (10.12.1.5): 56 data bytes
64 bytes from 10.12.1.5: seq=0 ttl=64 time=0.627 ms
64 bytes from 10.12.1.5: seq=1 ttl=64 time=0.075 ms
64 bytes from 10.12.1.5: seq=2 ttl=64 time=0.116 ms

Some more poking around shows that the bridge plugin has indeed created a bridge named mink8s0 as well as a veth pair for each pod:

$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:cf:81:3d brd ff:ff:ff:ff:ff:ff
3: mink8s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 46:ee:b5:e0:67:a4 brd ff:ff:ff:ff:ff:ff
6: veth19e99be3@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master mink8s0 state UP mode DEFAULT group default
    link/ether 46:ee:b5:e0:67:a4 brd ff:ff:ff:ff:ff:ff link-netnsid 0
7: veth5947e6fb@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master mink8s0 state UP mode DEFAULT group default
    link/ether b2:b6:d4:49:fb:b9 brd ff:ff:ff:ff:ff:ff link-netnsid 1

(Annoyingly, kubelet creates the network namespaces in such a way that they don't show up in ip netns. But the link-netnsid attribute gives a hint that the veths are indeed connected to veths in other namespaces.)

We're still a ways off from implementing our full Kubernetes network model, however. Pinging the pods from the host doesn't work (which you may remember is a requirement of the model), and neither does pinging the host from the pods (which I don't think is a strict requirement in theory but is going to be essential in practice):

$ HOST_IP=10.70.10.228 # set to whatever your host's internal IP address is
$ ping 10.12.1.4
PING 10.12.1.4 (10.12.1.4) 56(84) bytes of data.
^C
--- 10.12.1.4 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2047ms

$ ./kubectl exec sleep1 -- ping $HOST_IP
PING 10.70.10.228 (10.70.10.228): 56 data bytes
ping: sendto: Network unreachable
command terminated with exit code 1

The reason the host and pods can't communicate with each other is that they're on different network subnets (in my case, 10.12.1.0/24 for the pods and 10.70.0.0/16 for the VM), which means they can't communicate directly over Ethernet and will need to use IP routing to find each other (for the networking-jargon-inclined: we need to go from layer 2 to layer 3). Linux bridges work on layer 2 by default, but can actually handle layer 3 routing just fine if you assign IP addresses to them. (You can confirm that the bridge doesn't currently have an IP address with ip addr show dev mink8s0.)

To configure the bridge to use layer 3 routing, we'll set the isGateway option in our CNI config file. Here's our next attempt at the configuration:

{
    "cniVersion": "0.3.1",
    "name": "mink8s",
    "type": "bridge",
    "bridge": "mink8s0",
    "isGateway": true,
    "ipam": {
        "type": "host-local",
        "ranges": [
          [{"subnet": "10.12.1.0/24"}]
        ]
    }
}

Whenever we change the CNI configuration, we'll want to delete and recreate all our pods, since the networking configuration is only used on pod creation/deletion. Once we do that, we find that the bridge has been given an IP address and we can ping the pods from the host, but pinging the host from the pods still doesn't work:

$ ip addr show dev mink8s0 | grep 10.12
    inet 10.12.1.1/24 brd 10.12.1.255 scope global mink8s0
$ ./kubectl get po -owide
NAME     READY   STATUS    RESTARTS   AGE     IP          NODE     NOMINATED NODE   READINESS GATES
sleep1   1/1     Running   0          5m56s   10.12.1.8   mink8s   <none>           <none>
sleep2   1/1     Running   0          5m55s   10.12.1.7   mink8s   <none>           <none>
$ ping -c 3 10.12.1.8
PING 10.12.1.8 (10.12.1.8) 56(84) bytes of data.
64 bytes from 10.12.1.8: icmp_seq=1 ttl=64 time=0.063 ms
64 bytes from 10.12.1.8: icmp_seq=2 ttl=64 time=0.087 ms
64 bytes from 10.12.1.8: icmp_seq=3 ttl=64 time=0.099 ms
$ ./kubectl exec sleep1 -- ping $HOST_IP
PING 10.70.10.228 (10.70.10.228): 56 data bytes
ping: sendto: Network unreachable
command terminated with exit code 1

The reason it's still not working is that the pod doesn't have a default route set up (you can confirm this with ./kubectl exec sleep1 -- ip route). We can solve this problem by adding a default route in our CNI config. Let's add a route to our configuration to 0.0.0.0/0 (i.e. everywhere):

{
    "cniVersion": "0.3.1",
    "name": "mink8s",
    "type": "bridge",
    "bridge": "mink8s0",
    "isGateway": true,
    "ipam": {
        "type": "host-local",
        "ranges": [
          [{"subnet": "10.12.1.0/24"}]
        ],
        "routes": [{"dst": "0.0.0.0/0"}]
    }
}

(For reasons I don't entirely understand, setting up routes is the responsibility of the IPAM plugin instead of the bridge plugin.) Once that's saved and our pods have been killed and recreated, we see the default route is set up and pinging the host works fine:

$ ./kubectl exec sleep1 -- ip route
default via 10.12.1.1 dev eth0
10.12.1.0/24 dev eth0 scope link  src 10.12.1.9
$ ./kubectl exec sleep1 -- ping -c3 $HOST_IP
PING 10.70.10.228 (10.70.10.228): 56 data bytes
64 bytes from 10.70.10.228: seq=0 ttl=64 time=0.110 ms
64 bytes from 10.70.10.228: seq=1 ttl=64 time=0.269 ms
64 bytes from 10.70.10.228: seq=2 ttl=64 time=0.233 ms

Our pods can now talk to each other (on the same node) and the host and pods can also talk to each other. So technically you could say we've implemented the Kubernetes networking model for one node. But there's still a glaring omission, which we'll see if we try to ping an address outside of our network:

$ ./kubectl exec sleep1 -- ping -c3 1.1.1.1
PING 1.1.1.1 (1.1.1.1): 56 data bytes

--- 1.1.1.1 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
command terminated with exit code 1

Our pods can't reach the Internet! This isn't particularly surprising, since our pods are connected to the bridge network, not the actual Ethernet adapter of the host.

To get outgoing Internet connectivity working, we'll need to set up NAT using the IP masquerade feature of iptables. (NAT is necessary in this case because all of our pods are going to share the external IP address of our host.) The bridge plugin has us covered with the ipMasq option. Let's save our final (for this blog) CNI configuration:

{
    "cniVersion": "0.3.1",
    "name": "mink8s",
    "type": "bridge",
    "bridge": "mink8s0",
    "isGateway": true,
    "ipMasq": true,
    "ipam": {
        "type": "host-local",
        "ranges": [
          [{"subnet": "10.12.1.0/24"}]
        ],
        "routes": [{"dst": "0.0.0.0/0"}]
    }
}

Once that's applied, our pods can reach the Internet:

$ ./kubectl exec sleep1 -- ping -c3 1.1.1.1
PING 1.1.1.1 (1.1.1.1): 56 data bytes
64 bytes from 1.1.1.1: seq=0 ttl=51 time=4.343 ms
64 bytes from 1.1.1.1: seq=1 ttl=51 time=4.189 ms
64 bytes from 1.1.1.1: seq=2 ttl=51 time=4.285 ms

We can see the IP masquerade rules created by the plugin by poking around with iptables:

$ sudo iptables --list POSTROUTING --numeric --table nat
Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-POSTROUTING  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
CNI-c07db3c8c34133af9e525bf4  all  --  10.12.1.11           0.0.0.0/0            /* name: "mink8s" id: "1793c831da4a054beebc6c4dad02088bb7bd4553d435972b7581cb135d349113" */
CNI-a874815f36fc490c823cf894  all  --  10.12.1.12           0.0.0.0/0            /* name: "mink8s" id: "f98855905b1b070f7aa7387c844308d53fbeeeba65a23a075cfe6f12ea516005" */
$ sudo iptables -L CNI-c07db3c8c34133af9e525bf4 -n -t nat
Chain CNI-c07db3c8c34133af9e525bf4 (1 references)
target     prot opt source               destination
ACCEPT     all  --  0.0.0.0/0            10.12.1.0/24         /* name: "mink8s" id: "1793c831da4a054beebc6c4dad02088bb7bd4553d435972b7581cb135d349113" */
MASQUERADE  all  --  0.0.0.0/0           !224.0.0.0/4          /* name: "mink8s" id: "1793c831da4a054beebc6c4dad02088bb7bd4553d435972b7581cb135d349113" */

For those not fluent in iptablesese, here's a rough translation of these rules:

If a packet comes from a pod IP address, use a special iptables chain for that pod (e.g. in this example, 10.12.1.11 uses the CNI-c07db3c8c34133af9e525bf4 chain).
In that chain, if the packet isn't going to the pod's local network or a special multicast address (the 224.0.0.0/4 business), masquerade it.

Phew!

OK that got pretty long and complicated, so I think I'm going to call it a blog post. We more or less ended up where we started (networking working on a single Kubernetes node), but by switching from Docker to CNI-based networking we're in a good place to get multi-node networking working. And hopefully we learned something along the way!

Next time, we'll try to get a multi-node cluster up and running!