kubernetes internals for developers

This post is aimed at folks who are familiar with the basics of Kubernetes. If you have deployed services into Kubernetes but want to know what's happening under the hood this post is for you.

At its heart, Kubernetes is a collection of services, interfaces and standards that make it possible to operate and scale computing clusters. These tools are the amalgam of years of best practices from manually managing distributed linux systems.

Control loop

A fundamental concept of Kubernetes is that it is not imperative. It is a declarative system. Developers submit manifests to the admin node and internal programs react to and try to to match the desired state that the developers specify. It can be though of like placing an order at a restaurant. The food doesn't come instantaneously, the kitchen staff figure out what needs to be done and then do it.

The types of nodes

There are two types of nodes in a Kubernetes cluster, worker (or compute) nodes and admin (or master nodes). The admin nodes house the Kubernetes control plane. These are several components that need to be run so that Kubernetes can manage and distribute jobs within the cluster. Compute nodes are the nodes where our pods live. They are the fleet of resources we have to run our jobs on. Nodes can join and leave the cluster and Kubernetes will dynamically react to this and distribute jobs based on the resources available in the cluster.

Control Plane

At minimum the control plane consists of the following services:

kube-apiserver
etcd
kube-scheduler
kube-controller-manager

There are add-ons like DNS which are essentially mandatory for running a cluster in production, but I won't talk about them here today.

The kube-apiserver is the heart and soul of the system. All communications between control plane components and worker nodes are mediated by this service.

etcd is a key value store and it is here where Kubernetes stores all cluster data. It is also where we store objects that we want to provision - e.g. when deploying a new pod.

The kube-scheduler is the component that decides which node a pod should be assigned taking into resource constraints and requirements.
One of the major philosophies of Kubernetes is to define implementable interfaces so that external solutions can replace the default services and end users can customise their deployment to meet their requirements. kube-scheduler is the default scheduler but there are alternatives such as Yunikorn and Volcano which are geared towards scheduling big data workloads.

The kube-controller-manager is a component that manages the controllers which are used to react to and adjust the current state of the cluster towards the desired state. You will see how this works in detail below in the example.

Worker Nodes
Worker nodes are fairly simple they just contain:

A container runtime
kubelet
kube-proxy

Having a container runtime means that the nodes need to be able to run containers. This means that something like the docker engine needs to be present on the machine. Containers runtimes follow the container runtime interface specification and so any container runtime that complies with this is compatible.

kubelet is an agent that runs on the worker machines and ensures that the pods that the node is supposed to be running are indeed running.

kube-proxy is again a reference implementation of a network proxy that maintains the network rules on the nodes. This component generally needs root access
and will automatically manipulate iptable rules to reflect rbac rules and service declarations passed down to it. This enables communication between nodes
and admin nodes as well as node to node communication.

The journey of a kubernetes manifest

When we interact with the kubectl console, we might fire off commands the one below to tell it to go and start our web service:

$ kubect run my-website --image=nginx --replicas 4

This feels like it will be added to some sort of queue and executed imperatively but to reiterate, Kubernetes is a declarative system. This means that there is a central control loop that constantly tries to integrate our specifications with the state of the world.

Say we have the following deployment.yaml file.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80

When we execute the command

kubectl apply -f deployment.yaml This manifest will be sent off to the api-server that lives on the control plane.

The kube-apiserver will reformat this and store the associated objects in etcd.

There are several controllers watching the etcd data store which the controller-manager manages. Each resource has a controller which makes changes attempting to move the current state towards the desired state.

In our hypothetical deployment we have specified 3 replicas. This means the deployment controller, having seen a new Deployment object in etcd (added by the api-server).

It will see that we want 3 replicas and it will write a ReplicaSet object to etcd. After this happens, the replication controller will wake up and see that 3 pods need to be created. So they will write 3 PendingPod objects into etc.

The kube-scheduler, who is also watching, will now see that there are 3 unassigned and pending pods and now will figure out where the pods should run and write that information back to etcd.

Once the kube-scheduler figures out where to send the pods this is communicated to the kube-apiserver and from there to the kube-proxy on the worker nodes.

kubelet on the worker nodes will begin to pull images and start up the pods.

When the pods are ready this is communcated from kubelet to the kube-proxy who lets the kube-apiserver know. Finally the kube-apiserver writes back to etcd to reflect the started state of the pods.

Conclusion

We have gone into how Kubernetes works at a fairly granular level. This should give you a good intuition for what is going on when you deploy.
We have also touched on one of the key tenents of Kubernetes' design philosophy which is the idea of pluggable interfaces. Whilst we have talked
about a vanilla deploy of Kubernetes most large companies will be using additional plugins and alternate service plane components. Some examples of this
are Istio and Cilium. These additional/replacement components mean that there is considerably more depth to the semantics of how these tools operate. To give you some flavour, Cilium is an eBPF based networking solution. This means that critical networking operations can be executed in the linux kernel itself. This means you need to reason about your system in a completely different way when debugging.
In future posts I hope to go a little bit deeper into these things.

Additional Resources

https://github.com/jamiehannaford/what-happens-when-k8s/blob/master/README.md