Pages

Wednesday, 7 January 2026

Kubernetes Scheduling

 


Pod scheduling is controlled by pod scheduling constraints section of the Kubernetes pod/deployment configuration which can be found in Kubernetes manifest (YAML) for resources like:
  • Deployment
  • StatefulSet
  • Pod
  • DaemonSet
  • Job/CronJob

Kubernetes scheduling mechanisms:
  • Tolerations
  • Node Selectors
  • Node Affinity
  • Node Affinity
  • Pod Affinity/Anti-Affinity
  • Taints (node-side)
  • Priority and Preemption
  • Topology Spread Constraints
  • Resource Requests/Limits
  • Custom Schedulers
  • Runtime Class


Example:

    tolerations:
      - key: "karpenter/elastic"
        operator: "Exists"
        effect: "NoSchedule"
    nodeSelector:
      karpenter-node-pool: elastic
      node.kubernetes.io/instance-type: m7g.large
      karpenter.sh/capacity-type: "on-demand"


Tolerations


Specify what node taints it can tolerate.

tolerations:
  - key: "karpenter/elastic"
    operator: "Exists"
    effect: "NoSchedule"

Allows the pod to be scheduled on nodes with the taint karpenter/elastic:NoSchedule.
Without this toleration, the pod would be repelled from those nodes.
operator: "Exists" means it tolerates the taint regardless of its value.

Karpenter applies the taint karpenter/elastic:NoSchedule to nodes in the "elastic" pool. This taint acts as a gatekeeping mechanism - it says: "Only pods that explicitly tolerate this taint can schedule here". By default, most pods CANNOT schedule on these nodes (they lack the toleration). Our pod explicitly opts in with the toleration, saying "I'm allowed on elastic nodes".

Why This Pattern?

This is actually a common workload isolation strategy:

Regular pods (no toleration) 
  ↓
  ❌ BLOCKED from elastic nodes
  ✅ Schedule on general-purpose nodes

Elastic workload pods (with toleration)
  ↓  
  ✅ CAN schedule on elastic nodes
  ✅ Can also schedule elsewhere (unless nodeSelector restricts)

Real-World Use Case:

# Elastic nodes are tainted to reserve them for specific workloads
# General traffic shouldn't land here accidentally

# Your pod says: "I'm an elastic workload, let me in"
tolerations:
  - key: "karpenter/elastic"
    operator: "Exists"
    effect: "NoSchedule"

# PLUS you add nodeSelector to say: "And I ONLY want elastic nodes"
nodeSelector:
  karpenter-node-pool: elastic


The Karpenter Perspective

Karpenter knows the node state perfectly. The taint isn't about node health—it's about reserving capacity for specific workloads. This prevents:
  • Accidental scheduling of non-elastic workloads
  • Resource contention
  • Cost inefficiency (elastic nodes might be expensive/specialized)

Think of it like a VIP section: the velvet rope (taint) keeps everyone out except those with a pass (toleration).


Node Selector


nodeSelector:
  karpenter-node-pool: elastic
  node.kubernetes.io/instance-type: m7g.large
  karpenter.sh/capacity-type: "on-demand"

Requires the pod to run only on nodes matching ALL these labels:
  • Must be in the "elastic" Karpenter node pool
  • Must be an AWS m7g.large instance (ARM-based Graviton3)
  • Must be on-demand (not spot instances; karpenter.sh/capacity-type can also have value "spot")

What This Means


This pod is configured to run on dedicated elastic infrastructure managed by Karpenter (a Kubernetes node autoscaler), specifically targeting:
  • ARM-based instances (m7g = Graviton)
  • On-demand capacity (predictable, no interruptions)
  • A specific node pool for workload isolation

This is common for workloads that need consistent performance or have specific architecture requirements.

Node Affinity


More flexible than nodeSelector with support for soft/hard requirements:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:  # Hard requirement
      nodeSelectorTerms:
      - matchExpressions:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m7g.large", "m7g.xlarge"]
    preferredDuringSchedulingIgnoredDuringExecution:  # Soft preference
    - weight: 100
      preference:
        matchExpressions:
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-east-1a"]


Pod Affinity/Anti-Affinity


Schedule pods based on what other pods are running:

affinity:
  podAffinity:  # Schedule NEAR certain pods
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          app: cache
      topologyKey: kubernetes.io/hostname
      
  podAntiAffinity:  # Schedule AWAY from certain pods
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchLabels:
            app: my-app
        topologyKey: topology.kubernetes.io/zone


It is possible to manually patch a deployment, temporary and if necessary, during cluster management:

% kubectl patch deployment coredns -n kube-system -p '{"spec":{"template":{"spec":{"affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"k8s-app","operator":"In","values":["kube-dns"]}]},"topologyKey":"kubernetes.io/hostname"}]}}}}}}'
deployment.apps/coredns patched


In Kubernetes pod affinity and anti-affinity, topologyKey defines the scope or boundary of the scheduling rule. It refers to a node label key that the scheduler uses to group nodes into "topology domains". 

How it Works


The scheduler looks at all nodes that share the same value for the specified topologyKey. Nodes with identical values for that label are treated as part of the same domain (e.g., the same rack, node, or availability zone). 

Pod Affinity: The scheduler will only place a new pod in a domain if that domain already contains a pod matching your labelSelector.
Pod Anti-Affinity: The scheduler will avoid placing a new pod in any domain that already contains a pod matching your labelSelector


Common Examples


topologyKey Value: kubernetes.io/hostname
Domain Scope: Individual Node
Typical Use Case: Ensure two pods never run on the same physical machine.

topologyKey Value: topology.kubernetes.io/zone
Domain Scope: Availability Zone
Typical Use Case: Spread replicas across different data centres for fault tolerance.

topologyKey Value: topology.kubernetes.io/region
Domain Scope: Geographic Region
Typical Use Case: Ensure workloads are distributed across broad regions (e.g., us-east-1 vs us-west-2).

topologyKey Value: Custom Labels (e.g., rack)
Domain Scope: Physical Rack
Typical Use Case: Group nodes by their specific server rack in a private data centre.


Important Constraints

  • Performance: For required pod anti-affinity, the topologyKey is often restricted to kubernetes.io/hostname for performance reasons unless cluster-level admission controllers are modified.
  • Consistency: Every node in your cluster should be consistently labeled with the topologyKey you choose. If labels are missing, the scheduler may exhibit unexpected behaviour.
  • Usage: You cannot leave topologyKey empty; it is a required field for both affinity and anti-affinity rules.

To ensure high availability, you can use Pod Anti-Affinity with topology.kubernetes.io/zone as the topologyKey. This prevents the scheduler from placing multiple replicas of the same application in the same availability zone.


Taints (node-side)


Complement to tolerations, applied to nodes:

kubectl taint nodes node1 dedicated=gpu:NoSchedule


Priority and Preemption


Control which pods get scheduled first and can evict lower-priority pods:

priorityClassName: high-priority


Topology Spread Constraints


Distribute pods evenly across zones, nodes, or other topology domains:

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: my-app


Resource Requests/Limits


Influence scheduling based on available resources:

resources:
  requests:
    memory: "64Mi"
    cpu: "250m"
  limits:
    memory: "128Mi"
    cpu: "500m"


Custom Schedulers


You can even specify a completely different scheduler:

schedulerName: my-custom-scheduler


Runtime Class


For specialized container runtimes (like gVisor, Kata Containers):

runtimeClassName: gvisor

Each mechanism serves different use cases—nodeSelector is simple but rigid, while affinity rules and topology constraints offer much more flexibility for complex scheduling requirements.


No comments:

Post a Comment