Pages

Wednesday, 7 January 2026

Kubernetes Scheduling

 


Pod scheduling is controlled by pod scheduling constraints section of the Kubernetes pod/deployment configuration which can be found in Kubernetes manifest (YAML) for resources like:
  • Deployment
  • StatefulSet
  • Pod
  • DaemonSet
  • Job/CronJob

Kubernetes scheduling mechanisms:
  • Tolerations
  • Node Selectors
  • Node Affinity
  • Node Affinity
  • Pod Affinity/Anti-Affinity
  • Taints (node-side)
  • Priority and Preemption
  • Topology Spread Constraints
  • Resource Requests/Limits
  • Custom Schedulers
  • Runtime Class


Example:

    tolerations:
      - key: "karpenter/elastic"
        operator: "Exists"
        effect: "NoSchedule"
    nodeSelector:
      karpenter-node-pool: elastic
      node.kubernetes.io/instance-type: m7g.large
      karpenter.sh/capacity-type: "on-demand"


Tolerations


Specify what node taints it can tolerate.

tolerations:
  - key: "karpenter/elastic"
    operator: "Exists"
    effect: "NoSchedule"

Allows the pod to be scheduled on nodes with the taint karpenter/elastic:NoSchedule.
Without this toleration, the pod would be repelled from those nodes.
operator: "Exists" means it tolerates the taint regardless of its value.

Karpenter applies the taint karpenter/elastic:NoSchedule to nodes in the "elastic" pool. This taint acts as a gatekeeping mechanism - it says: "Only pods that explicitly tolerate this taint can schedule here". By default, most pods CANNOT schedule on these nodes (they lack the toleration). Our pod explicitly opts in with the toleration, saying "I'm allowed on elastic nodes".

Why This Pattern?

This is actually a common workload isolation strategy:

Regular pods (no toleration) 
  ↓
  ❌ BLOCKED from elastic nodes
  ✅ Schedule on general-purpose nodes

Elastic workload pods (with toleration)
  ↓  
  ✅ CAN schedule on elastic nodes
  ✅ Can also schedule elsewhere (unless nodeSelector restricts)

Real-World Use Case:

# Elastic nodes are tainted to reserve them for specific workloads
# General traffic shouldn't land here accidentally

# Your pod says: "I'm an elastic workload, let me in"
tolerations:
  - key: "karpenter/elastic"
    operator: "Exists"
    effect: "NoSchedule"

# PLUS you add nodeSelector to say: "And I ONLY want elastic nodes"
nodeSelector:
  karpenter-node-pool: elastic


The Karpenter Perspective

Karpenter knows the node state perfectly. The taint isn't about node health—it's about reserving capacity for specific workloads. This prevents:
  • Accidental scheduling of non-elastic workloads
  • Resource contention
  • Cost inefficiency (elastic nodes might be expensive/specialized)

Think of it like a VIP section: the velvet rope (taint) keeps everyone out except those with a pass (toleration).


Node Selector


nodeSelector:
  karpenter-node-pool: elastic
  node.kubernetes.io/instance-type: m7g.large
  karpenter.sh/capacity-type: "on-demand"

Requires the pod to run only on nodes matching ALL these labels:
  • Must be in the "elastic" Karpenter node pool
  • Must be an AWS m7g.large instance (ARM-based Graviton3)
  • Must be on-demand (not spot instances; karpenter.sh/capacity-type can also have value "spot")

What This Means

This pod is configured to run on dedicated elastic infrastructure managed by Karpenter (a Kubernetes node autoscaler), specifically targeting:
  • ARM-based instances (m7g = Graviton)
  • On-demand capacity (predictable, no interruptions)
  • A specific node pool for workload isolation

This is common for workloads that need consistent performance or have specific architecture requirements.

Node Affinity


More flexible than nodeSelector with support for soft/hard requirements:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:  # Hard requirement
      nodeSelectorTerms:
      - matchExpressions:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m7g.large", "m7g.xlarge"]
    preferredDuringSchedulingIgnoredDuringExecution:  # Soft preference
    - weight: 100
      preference:
        matchExpressions:
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-east-1a"]


Pod Affinity/Anti-Affinity


Schedule pods based on what other pods are running:

affinity:
  podAffinity:  # Schedule NEAR certain pods
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          app: cache
      topologyKey: kubernetes.io/hostname
      
  podAntiAffinity:  # Schedule AWAY from certain pods
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchLabels:
            app: my-app
        topologyKey: topology.kubernetes.io/zone


Taints (node-side)


Complement to tolerations, applied to nodes:

kubectl taint nodes node1 dedicated=gpu:NoSchedule


Priority and Preemption


Control which pods get scheduled first and can evict lower-priority pods:

priorityClassName: high-priority


Topology Spread Constraints


Distribute pods evenly across zones, nodes, or other topology domains:

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: my-app


Resource Requests/Limits


Influence scheduling based on available resources:

resources:
  requests:
    memory: "64Mi"
    cpu: "250m"
  limits:
    memory: "128Mi"
    cpu: "500m"


Custom Schedulers


You can even specify a completely different scheduler:

schedulerName: my-custom-scheduler


Runtime Class


For specialized container runtimes (like gVisor, Kata Containers):

runtimeClassName: gvisor

Each mechanism serves different use cases—nodeSelector is simple but rigid, while affinity rules and topology constraints offer much more flexibility for complex scheduling requirements.


No comments:

Post a Comment