How to configure node (pool) affinity for pods with AKS

Introduction

When dealing with multiple node pools, you usually want to configure node affinity so that pods stick to nodes with a specific characteristic.

The reasons for this can be manifold. For example, to take advantage of specialized hardware or resources on specific nodes, such as GPUs or high-memory nodes. It also can help increasing security by running sensitive workloads on separate nodes. Another use-case is saving money, e.g., by separating application environments (DEV, QA, UAT, PROD) onto different types of node poolds.

What ever your reasons are, this post will show you two methods to bind pods to node pools. The first is to use nodeSelector and the second is called Node affinity, which is conceptually similar but is more expressive and allows specifying soft rules. Let's dive in!

Option 1: Using the `nodeSelector`

This is the simplest way to bind pods to nodes. All you need to do is to optionally add labels to your nodes and then add the nodeSelector field to your pod specification.

💡 Depending on your use-case, you might want to consider using the auto-created labels by Azure, for example agentpool=foobarpool!

Labeling the nodes (optionally)

You'd have to issue the Azure CLI command below to label an existing node pool.

az aks nodepool update \
    --resource-group rg-demo \
    --cluster-name aks-azureblue \
    --name foobarpool \
    --labels tier=memory-optimized

Updating labels on existing node pools

All nodes in the pool will inherit this label. You can check the success with kubectl get nodes --show-label or alternatively use Azure CLI

az aks nodepool show \
   --resource-group rg-demo \
   --cluster-name aks-azureblue \
   --name foobarpool \
   --query nodeLabels

Check the labels

Set the `nodeSelector`

The nodeSelector field belongs to the PodSpec and is the simplest recommended form of node selection constraint. It expects a map, which is a collection of key-value pairs, that you'd usually set in the PodTemplateSpec of your deployment manifest.

nodeSelector (map[string]string)

According to the syntax, both of the manifests below are valid.

apiVersion: v1
kind: Pod
metadata:
  namespace: demo
  name: myapp
  labels:
    name: myapp
spec:
  containers:
  - name: myapp
    image: nginx:latest
    resources:
      limits:
        memory: "128Mi"
        cpu: "500m"
    ports:
      - containerPort: 80
  nodeSelector:
    tier: memory-optimized

Single map

However, Kubernetes only schedules pods onto nodes that have each label specified (AND condition). So in our example, the pod below won't be scheduled on the foobarpool node pool.

apiVersion: v1
kind: Pod
metadata:
  namespace: demo
  name: myapp
  labels:
    name: myapp
spec:
  containers:
  - name: myapp
    image: nginx:latest
    resources:
      limits:
        memory: "128Mi"
        cpu: "500m"
    ports:
      - containerPort: 80
  nodeSelector:
    tier: memory-optimized
    foo: bar

Multiple maps

If the condition defined by the nodeSelector can not be fulfilled, that pod won't get scheduled and be stuck in Pending state.

0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available ││ : 2 Preemption is not helpful for scheduling.

Event

Verify the result

Lastly, let's verify that the pods end up on the expected nodes/node pool.

kubectl get pods -n demo -o wide

NAME    READY   STATUS    RESTARTS   AGE   IP             NODE                                 NOMINATED NODE   READINESS GATES
myapp   1/1     Running   0          31s   10.224.0.236   aks-foobarpool-37985905-vmss000000   <none>           <none>

Lastly

As depicted in the output, the pod runs on a node belonging to the foobarpool node pool.

Option 2: Using `affinity`

As mentioned in the introduction, nodeSelector is a simple and quick way to configure node affinity. However, a second option provides more granular control... the object's name is affinity. Let's have a look at it.

According to the Kubernetes API, the affinity object can take three different types of constraints, which are nodeAffinity, podAffinity and podAntiAffinity.

NodeAffinity allows binding pods to nodes, whereas podAffinity and podAntiAffinity allow grouping multiple pods together on a single node respectively keeping them apart from each other (this blog post will only deal with nodeAffinitiy).

Skimming further through the API documentation, we can see that nodeAffinity can take two types of affinity scheduling rules, which are:

preferredDuringSchedulingIgnoredDuringExecution
requiredDuringSchedulingIgnoredDuringExecution

The first rule, preferredDuringSchedulingIgnoredDuringExecution, is a soft rule. It indicates a preferred node with the specified label values for the pod to be scheduled on. But it also may choose a node that violets one or more of the expressions, if none is matching.

The second rule, requiredDuringSchedulingIgnoredDuringExecution, is a hard rule, meaning if the node selection expression doesn't resolve a node, the pod won't get scheduled. This is the behavior we've already seen with option 1 when using the simple nodeSelector.

The hard rule

Let's start with requiredDuringSchedulingIgnoredDuringExecution and mimic the nodeSelectorbehavior from the first option. The pod defined below will only get scheduled if a node with the label agentpool=foobarpool is availeble.

apiVersion: v1
kind: Pod
metadata:
  namespace: demo
  name: myapp
  labels:
    name: myapp
spec:
  containers:
  - name: myapp
    image: nginx:latest
    resources:
      limits:
        memory: "128Mi"
        cpu: "500m"
    ports:
      - containerPort: 80
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: agentpool
                operator: In 
                values: 
                - foobarpool

This type of node selector rules provide a lot of additional flexibility. For example you can add multiple matchExpressions blocks to form OR conditions. The pod manifest below will get scheduled on nodes having one OR the other label.

apiVersion: v1
kind: Pod
metadata:
  ...
spec:
  ...    
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: agentpool
                operator: In 
                values: 
                - foobarpool
          - matchExpressions:
              - key: tier
                operator: In
                values: 
                - memory-optimized

To form AND conditions you'd add multiple keys, e.g. like so. Here, only nodes will be selected that have both labels set.

apiVersion: v1
kind: Pod
metadata:
  ...
spec:
  ...
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: agentpool
                operator: In 
                values: 
                - foobarpool
              - key: tier
                operator: In
                values: 
                - memory-optimized

🔎 It's worth noting that the operator allows for additional flexibility and can take the following arguments: DoesNotExists, Exists, Gt, Lt, In, NotIn.

The soft rule

As already mentioned, the soft rule defines a preference. The affinity definition below will give priority to a node with the label tier=general-purpose, if that preference can't be fulfilled, a different node in state ready will be selected.

apiVersion: v1
kind: Pod
metadata:
  ...
spec:
  ...
  affinity:  
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:          
      - preference: 
          matchExpressions:
            - key: tier
              operator: In 
              values:
                - general-purpose
        weight: 100

You can even define weights, which act as a tiebreaker in case multiple conditions are fullfilled. The definition below will first select a node with label tier=general-purpose, if such a node is not available, look for one with agentpool=foobarpool, if such a node is also not available choose whichever node is in state ready.

apiVersion: v1
kind: Pod
metadata:
  ...
spec:
  ...
  affinity:  
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:          
      - preference: 
          matchExpressions:
            - key: tier
              operator: In 
              values:
                - general-purpose
        weight: 100
      - preference:
          matchExpressions:
            - key: agentpool
              operator: In 
              values:
                - foobarpool
        weight: 90

The preferences above are evaluated by an OR condition. But nothing stops us from combining them. Below, the evaluation term becomes something like

1 * (tier=general-purpose && agentpool=memory-optimized) || 0.9 * agentpool=foobarpool || any node in state ready

apiVersion: v1
kind: Pod
metadata:
  ...
spec:
  ...
  affinity:  
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:          
      - preference: 
          matchExpressions:
            - key: tier
              operator: In 
              values:
                - general-purpose
            - key: agentpool
              operator: In 
              values: 
              - memory-optimizied
        weight: 100
      - preference:
          matchExpressions:
            - key: agentpool
              operator: In 
              values:
                - foobarpool
        weight: 90

So far, we have only matched against labels by using matchExpressions. But there is another node selector term that can be used for selecting by fields called, well matchFields. Here is an example

apiVersion: v1
kind: Pod
metadata:
  ...
spec:
  ...
  affinity:  
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchFields:
              - key: metadata.name 
                operator: In 
                values: 
                - aks-foobarpool-37985905-vmss000000

That's it! I hope you enjoyed reading this article. In case of any questions or comments, please leave a message! Happy scheduling! 🤓

Summary

A label added to an AKS node pool will get inherited by all nodes
Azure auto creates a label for each node pool, e.g., agentpool=foobarpool
nodeSelector is a hard constraint that, if unfulfillable, can lead to unscheduled pods
nodeSelector can take multiple labels, that all need to be fulfilled
affinity.nodeAffinityallows binding pods to nodes, whereas affinity.podAffinity and affinity.podAntiAffinity allow grouping multiple pods together on a single node respectively keeping them apart from each other
requiredDuringSchedulingIgnoredDuringExecution is a hard affinity rule similar to nodeSelector. If the conditions are not fulfilled the pod won't get scheduled
preferredDuringSchedulingIgnoredDuringExecution is a soft rule defining preferences. If the expressions don't match, the pod can still get scheduled on another node
To match against labels, use matchExpressions, to match against fields use matchFields

How to configure node (pool) affinity for pods with AKS

Matthias Güntert

Introduction

Option 1: Using the `nodeSelector`

Labeling the nodes (optionally)

Set the `nodeSelector`

Verify the result

Option 2: Using `affinity`

The hard rule

The soft rule

Summary

Further reading

Introduction

Option 1: Using the nodeSelector

Labeling the nodes (optionally)

Set the nodeSelector

Verify the result

Option 2: Using affinity

The hard rule

The soft rule

Summary

Further reading

Option 1: Using the `nodeSelector`

Set the `nodeSelector`

Option 2: Using `affinity`