How to isolate critical system pods from application pods on Azure Kubernetes Services

Introduction

When you are hosting multiple mission-critical line-of-business applications on a single AKS cluster, you want to ensure the cluster runs as stable as possible.

We can and should apply multiple best practices, such as limiting resource usage of pods and using resource quotas on namespaces.

Another best practice I'd recommend is separating critical system pods from application pods with dedicated node pools. This separation helps to protect system components from rogue application workloads that might negatively affect the stability of the cluster.

This article will demonstrate how to create dedicated node pools and prevent any user workload from being scheduled on the critical system node pool.

How can we achieve that goal?

This can be achieved by using a feature called taints and tolerations. Taints are the opposite of node affinity. Instead of attracting pods to nodes, they allow repelling a set of pods. Tolerations, conversely, work together with taints and allow a pod to be scheduled on a tainted node.

So what we're going to do is configure taints on an existing (or a new) system node pool to keep every pod away that hasn't set a specific toleration.

Let's walk through the process step by step.

Step by step

Create a user node pool for the application workload

It's important to have a user node pool in place before we start setting taints on the system node pool, so we don't prematurely stop AKS from scheduling new pods.

az aks nodepool add \
  --nodepool-name app \
  --cluster-name aks-azureblue \
  --resource-group rg-kubernetes \
  --mode User \
  --node-count 1 \ 
  --os-type Linux \
  --node-vm-size Standard_B4ms

Adding a user node pool to an existing AKS cluster

💡 Obviously, the command from above creates a very limited user node pool for demonstration purposes only!

Now that the user node pool is up and running we can add a taint to the system node pool.

Updating an existing system node pool

az aks nodepool update \
  --cluster-name aks-azureblue \
  --nodepool-name system \
  --resource-group rg-kubernetes \
  --node-taints "CriticalAddonsOnly=true:NoSchedule"

💡 It is important to note, that the taint key name, CriticalAddonsOnly, can not be choosen as wish. This is because the system pods, which are managed by Microsoft, have a default toleration configured, that matches this taint. Otherwise pods, such as CoreDNS, wont get scheduled anymore!

Let's double-check that the correct taint got created.

az aks nodepool show \
  --cluster-name aks-azureblue \
  --resource-group rg-kubernetes \
  --nodepool-name system \
  --query nodeTaints

This should return with the following

[
  "CriticalAddonsOnly=true:NoSchedule"
]

Alternatively, you can use some kubectl voodoo (or dig through kubectl describe/get nodes ...)

kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints --no-headers

Verify the setup

Let's schedule a very basic pod in the default namespace by running kubectl apply -f pod.yaml.

kind: Pod
apiVersion: v1
metadata:
  name: pod-a
spec:
  containers:
    - name: pod-a
      image: mcr.microsoft.com/oss/nginx/nginx:1.15.5-alpine
      resources:
        requests:
          cpu: 100m
          memory: 128Mi
        limits:
          cpu: 250m
          memory: 256Mi

pod.yaml

By running kubectl get pod -wide we can verify it does get scheduled on the user node pool. Note the name of the node carrying part of the node pool name.

$ kubectl get pod -o wide
NAME    READY   STATUS    RESTARTS   AGE   IP           NODE                          NOMINATED NODE   READINESS GATES
pod-a   1/1     Running   0          5s    10.244.1.4   aks-app-95917181-vmss000000   <none>           <none>

Taking it one step further

There might be situations where you'd want to host other critical pods on the system node pool as well, which, however, aren't exactly application workloads. An example could be ingress-nginx.

This is again, where tolerations come into play. But instead of adjusting each and every helm chart that you might have in use, we can instead apply default tolerations on a namespace scope. The relevant annotation is called scheduler.alpha.kubernetes.io/defaultTolerations.

Quoting the documentation: "[...] This annotation key allows assigning tolerations to a namespace and any new pods created in this namespace would get these tolerations added."

Further, we need to explicitly bind every pod that should be excluded to the system node pool, by using the scheduler.alpha.kubernetes.io/node-selector annotation key.

This is what the final namespace object would look like. Every pod belonging to this namespace would get scheduled on the system node pool.

apiVersion: v1
kind: Namespace
metadata:
  name: ingress-nginx 
  annotations:
    scheduler.alpha.kubernetes.io/defaultTolerations: '[{"Key": "CriticalAddonsOnly", "Operator": "Equal", "Value": "true", "Effect": "NoSchedule"}]'
    scheduler.alpha.kubernetes.io/node-selector: "kubernetes.azure.com/mode=system"

Configuring tolerations and node-pool affinity on the namespace

Conclusion

Let me summarize the most important key takeaways.

Separating application workload from system pods helps to increase the AKS cluster stability
We can use taints on nodes to repel a set of pods and tolerations to add exclusions
We are required to use the key name CriticalAddonsOnly for the taints, and cannot freely choose them
We can inherit tolerations on the namespace level by using the scheduler.alpha.kubernetes.io/defaultTolerations

That's it for today. I hope you enjoyed reading it! 😎