The Simplest Multi-Node Kubernetes Cluster

August 10, 2020

In the last post of this series, we got networking working on a single Kubernetes node using the bridge CNI plugin. (If you haven't read that post, you might want to take a quick look—I'm going to assume you've read and maybe even understood it 😉.) So let's get our cluster set up with multiple nodes! Thankfully it won't be too difficult now that we have a good idea of how container networking and CNI work.

Prerequisites for Following Along

The bad news about Kubernetes networking is that it's hard to set up in a truly generic way. Each cloud provider handles networking slightly differently, and cloud networking is quite a bit different from what you'd typically see in a "traditional" data center. I set this up in my homelab, which runs OpenStack (a mildly insane choice—I'll probably give more details in a future blog post). Things will probably look a bit different if you're trying to follow along in AWS or GCP.

With that caveat, here's the setup I used:

One VM named mink8s (the one we've been working with so far), with internal IP 10.70.10.228.
Another VM named mink8s2, with internal IP 10.70.10.248.

Both VMs are attached to the same network (this would be a VPC in AWS or GCP), with port security disabled.

Yes that's right—I completely disabled network-level security and firewalls to get this working (more on that later). In the spirit of this whole blog series so far, the setup is spectacularly insecure. Not only should you not use it in production, but please don't try this on anything with a public IP. A good option is to set this up locally with something like VirtualBox, but it should also work with your cloud provider of choice with some tweaking (I haven't tested it yet on anything other than OpenStack).

In keeping with the Kubernetes network model that we discussed last time, you'll need to make sure that your two VMs can talk to each other directly over all TCP and UDP ports.

The High-Level Approach

There are two basic approaches to multi-node Kubernetes networking:

Use an overlay network, which is a virtual network that sits on top of your "real" (aka "underlay") network. In this setup, packets that move between hosts are encapsulated somehow (VXLAN seems to be a popular choice), and the network that your pods "see" will be different from your host's network. This is the approach that flannel uses by default.
Use the native network (i.e. whatever network your Kubernetes hosts are attached to) and route traffic using standard IP routing protocols. This means that additional work has to be done to set up routes and non-conflicting IP addresses (one popular way to set up routes is to use BGP). This is the approach that Calico uses by default.

Both approaches have advantages and disadvantages, and in the real world the split isn't always clean—for instance, if you're running in a cloud environment you're probably already using an overlay network for your VMs. But we're going with approach 2 for the boring reason that it will be easier to set up. We'll sidestep the tricky issues of IP overlaps and routing by doing some good old-fashioned hardcoding and manual setup.

Getting the Second Node Up and Running

If you followed along with the last post, our second Kubernetes node should be almost ready. For the lazy reader, here are the instructions for getting it up and running:

ubuntu@mink8s2:~$ sudo apt install docker.io
ubuntu@mink8s2:~$ sudo systemctl enable docker
ubuntu@mink8s2:~$ sudo systemctl start docker
ubuntu@mink8s2:~$ curl -L https://storage.googleapis.com/kubernetes-release/release/v1.18.5/bin/linux/amd64/kubelet > kubelet
ubuntu@mink8s2:~$ curl -L https://storage.googleapis.com/kubernetes-release/release/v1.18.5/bin/linux/amd64/kubectl > kubectl
ubuntu@mink8s2:~$ chmod +x kubelet
ubuntu@mink8s2:~$ chmod +x kubectl
ubuntu@mink8s2:~$ API_IP=10.70.10.228 # set to your original node's IP
ubuntu@mink8s2:~$ cat <<EOS > kubeconfig.yaml
apiVersion: v1
kind: Config
clusters:
- cluster:
    server: http://$API_IP:8080
  name: mink8s
contexts:
- context:
    cluster: mink8s
  name: mink8s
current-context: mink8s
EOS
ubuntu@mink8s2:~$ mkdir -p .kube
ubuntu@mink8s2:~$ cp kubeconfig.yaml .kube/config

We'll also have to install the official CNI plugins and adjust Docker settings, just like we did on the first node:

$ curl -L https://github.com/containernetworking/plugins/releases/download/v0.8.6/cni-plugins-linux-amd64-v0.8.6.tgz > cni.tgz
$ sudo mkdir -p /opt/cni/bin
$ sudo tar xzvf cni.tgz -C /opt/cni/bin
$ cat <<EOS | sudo tee /etc/docker/daemon.json
{
  "exec-opts": ["native.cgroupdriver=cgroupfs"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "bridge": "none",
  "iptables": false,
  "ip-masq": false,
  "storage-driver": "overlay2"
}
EOS

(And I can't stress enough: after you change that Docker config file, reboot! I don't want to disclose how many hours of hair-pulling I went through debugging CNI<>Docker networking issues.)

Finally, we'll make our CNI configuration in /etc/cni/net.d/bridge.conf:

{
    "cniVersion": "0.3.1",
    "name": "mink8s",
    "type": "bridge",
    "bridge": "mink8s0",
    "isGateway": true,
    "ipMasq": true,
    "mtu": 1450,
    "ipam": {
        "type": "host-local",
        "ranges": [
          [{"subnet": "10.12.2.0/24"}]
        ],
        "routes": [{"dst": "0.0.0.0/0"}]
    }
}

This looks almost identical to the config on our original node with one important difference: we're using the 10.12.2.0/24 subnet for our pods instead of 10.12.1.0/24. Since we're using host-local IPAM, we'll have to manually make sure that each node gets a non-overlapping set of IPs to use for pods. (It doesn't really matter what the ranges are as long as they don't overlap with anything else on the network.)

The sharp-eyed reader will also notice the new mtu option. That just sets the MTU on our pods' and bridge's virtual network devices. I'm including it here because OpenStack virtual network devices have an MTU of 1450 (instead of the standard value of 1500). You should just set it to whatever your host's Ethernet adapter's value is (you can find it pretty easily with ip link —it's different across different cloud providers). If your pods' MTU is greater than your host's link's MTU, you might run into IP fragmentation issues when communicating across nodes, which are a nightmare to debug. (You should set this option on your original mink8s node as well.)

Anyways, let's fire up kubelet to get our node up and running:

sudo ./kubelet --network-plugin=cni --kubeconfig=kubeconfig.yaml

And we'll also run a sleep pod on the new node:

$ cat <<EOS | ./kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: sleep3
spec:
  containers:
  - image: alpine
    name: alpine
    command: ["sleep", "5000000"]
  nodeName: mink8s2
EOS

It looks like things are going OK so far; our node has been registered and our pod gets an IP in the right range at least:

$ ./kubectl get no
NAME      STATUS   ROLES    AGE     VERSION
mink8s    Ready    <none>   33d     v1.18.5
mink8s2   Ready    <none>   5d21h   v1.18.5
$ ./kubectl get po -owide
NAME     READY   STATUS    RESTARTS   AGE    IP           NODE      NOMINATED NODE   READINESS GATES
sleep1   1/1     Running   0          43m    10.12.1.15   mink8s    <none>           <none>
Sleep2   1/1     Running   0          104m   10.12.1.14   mink8s    <none>           <none>
sleep3   1/1     Running   0          25m    10.12.2.6    mink8s2   <none>           <none>

But pods still can't ping each other across nodes, so we have some work to do:

$ ./kubectl exec sleep1 -- ping -c 3 10.12.2.2
PING 10.12.2.2 (10.12.2.2): 56 data bytes

--- 10.12.2.2 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
command terminated with exit code 1

A Packet's Incredible Journey

At this point it's worth zooming out a bit to understand how we're expecting our network packets to get to pods in other nodes. Here's a rough diagram of what we're looking for (with IPs from my setup):

So any packet that goes across nodes will have to deal with three hops:

First, it will have to get from the pod to the relevant bridge (via the veth pair we discussed in the last post).
Next, it will have to get to the destination node over the "real" network adapter, ens3 in my case. (Implicitly, the packet has to get from the bridge to the network adapter, but that's handled internally by the kernel.)
Once the packet arrives at its destination host, it has to be routed through the relevant bridge to the destination pod.

It turns out that we have very little work to do to get this routing setup working. In fact, hops 1 and 3 have already been set up by the CNI bridge plugin.

First let's check hop 1 by examining the pod's routing table:

$ ./kubectl exec sleep1 -- ip route
default via 10.12.1.1 dev eth0
10.12.1.0/24 dev eth0 scope link  src 10.12.1.15

The default route goes to 10.12.1.1, which is the bridge's IP address—exactly what we want. (If you remember, this was specified in our CNI network config under ipam.routes.)

For hop 3, we'll check the routing table on the mink8s2 node:

ubuntu@mink8s2:~$ ip route
default via 10.70.0.1 dev ens3 proto dhcp src 10.70.10.248 metric 100
10.12.2.0/24 dev mink8s0 proto kernel scope link src 10.12.2.1
10.70.0.0/16 dev ens3 proto kernel scope link src 10.70.10.248
169.254.169.254 via 10.70.10.1 dev ens3 proto dhcp src 10.70.10.248 metric 100

Packets destined for 10.12.2.0/24 will be routed through our mink8s0 bridge, which again is exactly what we want. This route was automatically set up for us by the bridge plugin.

So hop 2 is the only one we have to worry about. Since 10.12.2.0/24 isn't part of our routing table, the kernel will try to route its packets over the default route, which happens to be 10.70.0.1 (the Internet gateway). That obviously won't work (unless the gateway itself has some fancy routing configuration—hold onto that thought), but we can just add a route manually using the ip command:

ubuntu@mink8s:~$ sudo ip route add 10.12.2.0/24 via 10.70.10.248 dev ens3

Which translates to "route any packet destined for 10.12.2.0/24 through 10.70.10.248 (our mink8s2 node) over the ens3 link. Of course we'll also want response packets to be able to get to our mink8s node, so we have to make a corresponding route on the mink8s2 node:

ubuntu@mink8s2:~$ sudo ip route add 10.12.1.0/24 via 10.70.10.228 dev ens3

Once that's done, it looks like the routing works just like we expected (which we can verify with traceroute):

ubuntu@mink8s:~$ ./kubectl exec sleep1 -- ping -c 3 10.12.2.6
PING 10.12.2.6 (10.12.2.6): 56 data bytes
64 bytes from 10.12.2.6: seq=0 ttl=62 time=0.639 ms
64 bytes from 10.12.2.6: seq=1 ttl=62 time=0.586 ms
64 bytes from 10.12.2.6: seq=2 ttl=62 time=0.506 ms
ubuntu@mink8s:~$ ./kubectl exec sleep1 -- traceroute 10.12.2.6
traceroute to 10.12.2.6 (10.12.2.6), 30 hops max, 46 byte packets
 1  10.12.1.1 (10.12.1.1)  0.016 ms  0.065 ms  0.013 ms
 2  10.70.10.248 (10.70.10.248)  0.768 ms  0.491 ms  0.264 ms
 3  10.12.2.6 (10.12.2.6)  0.434 ms  1.041 ms  0.510 ms

Success! (?)

So we've successfully implemented the Kubernetes model for two nodes, and all it took was a couple ip route commands.¹ You can try playing around with nginx pods on both nodes to confirm that everything looks sane.

To be more specific: "it works on my machine(s)" 😛. There are many reasons this might not work for you, since as I mentioned earlier networking setups tend to vary wildly between environments. Cloud environments usually prefer handling routing tables at the network level instead of within VMs (in our diagram above, that would mean that the blue switch in the middle would be where the routes are configured).

Even in a more "traditional" networking setup, anti-MAC-spoofing protections and other security measures can get in the way of this sort of routing (that's why I had to disable port security in OpenStack). If you try this out and run into issues, drop me a line!

But a more serious problem is that this setup is not very scalable. We had to execute one ip route command for each of our two nodes, but if we have n total nodes we'll have to add n-1 routes per node. That will get tedious very quickly! In the next post, I'll try to show how we can automate the tedium away.

It's a little unclear to me whether we're conforming to the "pods on a node can communicate with all pods on all nodes without NAT" requirement of the networking model, since IP masquerading will apply to pod-to-pod traffic. But this whole setup is very similar to the one in Kubernetes The Hard Way and I trust Kelsey Hightower to have gotten it right. See this blog for some more discussion.