Pod scheduling is controlled by pod scheduling constraints section of the Kubernetes pod/deployment configuration which can be found in Kubernetes manifest (YAML) for resources like:
- Deployment
- StatefulSet
- Pod
- DaemonSet
- Job/CronJob
Kubernetes scheduling mechanisms:
- Tolerations
- Node Selectors
- Node Affinity
- Node Affinity
- Pod Affinity/Anti-Affinity
- Taints (node-side)
- Priority and Preemption
- Topology Spread Constraints
- Resource Requests/Limits
- Custom Schedulers
- Runtime Class
Example:
tolerations:
- key: "karpenter/elastic"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
karpenter-node-pool: elastic
node.kubernetes.io/instance-type: m7g.large
karpenter.sh/capacity-type: "on-demand"
Tolerations
Specify what node taints it can tolerate.
tolerations:
- key: "karpenter/elastic"
operator: "Exists"
effect: "NoSchedule"
Allows the pod to be scheduled on nodes with the taint karpenter/elastic:NoSchedule.
Without this toleration, the pod would be repelled from those nodes.
operator: "Exists" means it tolerates the taint regardless of its value.
Karpenter applies the taint karpenter/elastic:NoSchedule to nodes in the "elastic" pool. This taint acts as a gatekeeping mechanism - it says: "Only pods that explicitly tolerate this taint can schedule here". By default, most pods CANNOT schedule on these nodes (they lack the toleration). Our pod explicitly opts in with the toleration, saying "I'm allowed on elastic nodes".
Why This Pattern?
This is actually a common workload isolation strategy:
Regular pods (no toleration)
↓
❌ BLOCKED from elastic nodes
✅ Schedule on general-purpose nodes
Elastic workload pods (with toleration)
↓
✅ CAN schedule on elastic nodes
✅ Can also schedule elsewhere (unless nodeSelector restricts)
Real-World Use Case:
# Elastic nodes are tainted to reserve them for specific workloads
# General traffic shouldn't land here accidentally
# Your pod says: "I'm an elastic workload, let me in"
tolerations:
- key: "karpenter/elastic"
operator: "Exists"
effect: "NoSchedule"
# PLUS you add nodeSelector to say: "And I ONLY want elastic nodes"
nodeSelector:
karpenter-node-pool: elastic
The Karpenter Perspective
Karpenter knows the node state perfectly. The taint isn't about node health—it's about reserving capacity for specific workloads. This prevents:
- Accidental scheduling of non-elastic workloads
- Resource contention
- Cost inefficiency (elastic nodes might be expensive/specialized)
Think of it like a VIP section: the velvet rope (taint) keeps everyone out except those with a pass (toleration).
Node Selector
nodeSelector:
karpenter-node-pool: elastic
node.kubernetes.io/instance-type: m7g.large
karpenter.sh/capacity-type: "on-demand"
Requires the pod to run only on nodes matching ALL these labels:
- Must be in the "elastic" Karpenter node pool
- Must be an AWS m7g.large instance (ARM-based Graviton3)
- Must be on-demand (not spot instances; karpenter.sh/capacity-type can also have value "spot")
What This Means
This pod is configured to run on dedicated elastic infrastructure managed by Karpenter (a Kubernetes node autoscaler), specifically targeting:
- ARM-based instances (m7g = Graviton)
- On-demand capacity (predictable, no interruptions)
- A specific node pool for workload isolation
This is common for workloads that need consistent performance or have specific architecture requirements.
Node Affinity
More flexible than nodeSelector with support for soft/hard requirements:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution: # Hard requirement
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values: ["m7g.large", "m7g.xlarge"]
preferredDuringSchedulingIgnoredDuringExecution: # Soft preference
- weight: 100
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a"]
Pod Affinity/Anti-Affinity
Schedule pods based on what other pods are running:
affinity:
podAffinity: # Schedule NEAR certain pods
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: cache
topologyKey: kubernetes.io/hostname
podAntiAffinity: # Schedule AWAY from certain pods
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: my-app
topologyKey: topology.kubernetes.io/zone
Taints (node-side)
Complement to tolerations, applied to nodes:
kubectl taint nodes node1 dedicated=gpu:NoSchedule
Priority and Preemption
Control which pods get scheduled first and can evict lower-priority pods:
priorityClassName: high-priority
Topology Spread Constraints
Distribute pods evenly across zones, nodes, or other topology domains:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-app
Resource Requests/Limits
Influence scheduling based on available resources:
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
Custom Schedulers
You can even specify a completely different scheduler:
schedulerName: my-custom-scheduler
Runtime Class
For specialized container runtimes (like gVisor, Kata Containers):
runtimeClassName: gvisor
Each mechanism serves different use cases—nodeSelector is simple but rigid, while affinity rules and topology constraints offer much more flexibility for complex scheduling requirements.

No comments:
Post a Comment