---
My Public Notepad
Bits and bobs about computers and programming
Wednesday, 28 January 2026
How to test Redis connectivity
---
Sunday, 25 January 2026
Strategies for AWS EKS Cluster Kubernetes Version Upgrade
General Upgrade order
- Non-prod clusters first
- Control plane
- Managed add-ons
- VPC CNI
- CoreDNS
- kube-proxy
- Worker nodes
- Platform controllers
- Ingress
- Autoscalers
- Observability
Wednesday, 14 January 2026
Elasticsearch Data Streams
Key Characteristics
- Time-Series Focus: Every document indexed into a data stream must contain a @timestamp field, which is used to organize and query the data.
- Append-Only Design: Data streams are optimized for use cases where data is rarely updated or deleted. We cannot send standard update or delete requests directly to the stream; these must be performed via _update_by_query or directed at specific backing indices.
- Unified Interface: Users interact with a single named resource (the data stream name) for both indexing and searching, even though the data is physically spread across multiple underlying indices.
Architecture: Backing Indices
- Write Index: The most recently created backing index. All new documents are automatically routed here.
- Rollover: When the write index reaches a specific size or age, Elasticsearch automatically creates a new backing index (rollover) and sets it as the new write index.
- Search: Search requests sent to the data stream are automatically routed to all of its backing indices to return a complete result set.
Automated Management
- Index Templates: These define the stream's structure, including field mappings and settings, and must include a data_stream object to enable the feature.
- Lifecycle Management (ILM/DSL): Tools like Index Lifecycle Management (ILM) or the newer Data Stream Lifecycle automate tasks like moving old indices to cheaper hardware (hot/warm/cold tiers) and eventually deleting them based on retention policies.
When to Use
- Ideal for: Logs, events, performance metrics, and security traces.
- Avoid for: Use cases requiring frequent updates to existing records (like a product catalog) or data that lacks a timestamp.
How does data stream know when to rollover?
- Index Lifecycle Management (ILM)
- Data Stream Lifecycle (DSL) - newer concept
Data Stream Lifecycle (DSL)
How to find out if data stream is managed by Index Lifecycle Management (ILM) or Data Stream Lifecycle (DSL)?
- enabled (Boolean):
- Interpretation: Determines if Elasticsearch should actively manage this data stream using DSL.
- Behavior: When set to true, Elasticsearch automatically handles rollover (based on cluster defaults) and deletion (based on our retention settings). If this is missing but other attributes are present, it often defaults to true.
- data_retention (String):
- Interpretation: The minimum amount of time Elasticsearch is guaranteed to store our data.
- Format: Uses time units like 90d (90 days), 30m (30 minutes), or 1h (1 hour).
- Behavior: This period is calculated starting from the moment a backing index is rolled over (it becomes "read-only"), not from its creation date.
- effective_retention
- This is the final calculated value that Elasticsearch actually uses to delete data.
- What it represents: It is the minimum amount of time our data is guaranteed to stay in the cluster after an index has rolled over.
- Why it might differ from our setting: We might set data_retention: "90d", but the cluster might have a global "max retention" or "default retention" policy that overrides our specific request
- retention_determined_by
- This attribute identifies the source of the effective_retention value. Common values include:
- data_stream_configuration: The retention is coming directly from the data_retention we set in our index template or data stream.
- default_retention: We didn't specify a retention period, so Elasticsearch is using the cluster-wide default (e.g., data_streams.lifecycle.retention.default).
- max_retention: We tried to set a very long retention (e.g., 1 year), but a cluster admin has capped all streams at a lower value (e.g., 90 days) using data_streams.lifecycle.retention.max
- downsampling (Object/Array):
- Interpretation: Configures the automatic reduction of time-series data resolution over time.
- Behavior: It defines when (e.g., after 7 days) and how (e.g., aggregate 1-minute metrics into 1-hour blocks) data should be condensed to save storage space while keeping historical trends searchable.
- If a Max Retention is set on the cluster and our setting exceeds it, Max Retention wins.
- If we have configured Data Retention on the stream, it is used (as long as it's under the max).
- If we have not configured anything, the Default Retention for the cluster is used.
- If no defaults or maxes exist and we haven't set a value, retention is Infinite.
- true: Elasticsearch will use the ILM policy defined in index template setting index.lifecycle.name. It will ignore the lifecycle block (DSL). Use this if you need granular control over shard allocation, force merging, or specific rollover ages that DSL doesn't offer.
- false (or unset): Elasticsearch will prioritize the Data Stream Lifecycle (DSL) block. It will ignore the ILM policy for rollover and retention. This is the default behavior in modern 2026 clusters to encourage the use of the simpler DSL.
What if data retention is set both in lifecycle (DSL) and in ILM associated to the index template used for data stream?
- The lifecycle block wins: Elasticsearch will prioritize the Data Stream Lifecycle (DSL) for retention and rollover.
- The ILM policy is ignored: We will often see a warning in the logs or the "Explain" API indicating that the ILM policy is being bypassed because DSL is active.
How do we associate ILM policy with data stream?
What if index template has lifecycle attribute but no index.lifecycle.name?
1. DSL vs. ILM Precedence
2. How to associate the ILM Policy
Key Interpretations
- The lifecycle block at the root: This governs retention (90 days) and rollover (defaulted to auto, which usually means 7 days or 30 days depending on retention).
- The settings.index block: This is where you define the ILM link.
- Conflict Resolution: If you don't add prefer_ilm: true, Elasticsearch 2026 defaults to using the lifecycle block. Your data stream will continue rolling over every 7–30 days based on the auto logic, even if you put an ILM policy name in the settings.
Recommendation
How to fix a Shard Explosion?
What Force Merge Does
- Merges Segments: It reduces the number of Lucene segments in each shard, ideally to one, which improves search performance.
- Cleans Deleted Docs: It permanently removes (expunges) documents that were soft-deleted, freeing up disk space.
- Targets Shards: The operation is performed on the shards of one or more indices (or data stream backing indices).
What Reduces Backing Indices
- Rollover (ILM): Index Lifecycle Management (ILM) creates new backing indices and can automatically delete old ones based on age or size.
- Data Stream Lifecycle: This automates the deletion of backing indices that are older than the defined retention period.
- Shrink API: While not typically used to combine multiple daily indices into one, it can be used to reduce the primary shard count of a specific, read-only backing index.
- Delete Index API: You can manually delete older backing indices (except the current write index).
Summary:
- Force Merge = Reduces segments inside shards (better search, less disk space).
- Rollover/Delete = Reduces the total number of backing indices (fewer indices overall).
Wednesday, 7 January 2026
Kubernetes Scheduling
- Deployment
- StatefulSet
- Pod
- DaemonSet
- Job/CronJob
- Tolerations
- Node Selectors
- Node Affinity
- Node Affinity
- Pod Affinity/Anti-Affinity
- Taints (node-side)
- Priority and Preemption
- Topology Spread Constraints
- Resource Requests/Limits
- Custom Schedulers
- Runtime Class
Tolerations
- Accidental scheduling of non-elastic workloads
- Resource contention
- Cost inefficiency (elastic nodes might be expensive/specialized)
Node Selector
- Must be in the "elastic" Karpenter node pool
- Must be an AWS m7g.large instance (ARM-based Graviton3)
- Must be on-demand (not spot instances; karpenter.sh/capacity-type can also have value "spot")
- ARM-based instances (m7g = Graviton)
- On-demand capacity (predictable, no interruptions)
- A specific node pool for workload isolation
Node Affinity
Pod Affinity/Anti-Affinity
Taints (node-side)
Priority and Preemption
Topology Spread Constraints
Resource Requests/Limits
Custom Schedulers
Runtime Class
Useful kubectl commands
- NAME e.g. ip-10-2-12-73.us-east-1.compute.internal
- STATUS e.g. Ready
- ROLES <none>
- AGE e.g. 1d
- VERSION e.g. v1.32.9-eks-ecaa3a6
- NAME e.g. ip-10-2-12-73.us-east-1.compute.internal
- STATUS e.g. Ready
- ROLES <none>
- AGE e.g. 1d
- VERSION e.g. v1.32.9-eks-ecaa3a6
- INSTANCE-TYPE e.g. m7g.2xlarge
- ZONE e.g. us-east-2a
- CAPACITY-TYPE e.g. on-demand
- NAME e.g. ip-10-2-12-73.us-east-1.compute.internal
- STATUS e.g. Ready
- ROLES <none>
- AGE e.g. 1d
- VERSION e.g. v1.32.9-eks-ecaa3a6
- INTERNAL-IP e.g. 10.2.12.73
- EXTERNAL-IP e.g. <none>
- OS-IMAGE e.g. Amazon Linux 2023.9.20251208
- KERNEL-VERSION e.g. 6.1.158-180.294.amzn2023.aarch64 or 6.1.132-147.221.amzn2023.x86_64
- CONTAINER-RUNTIME e.g. containerd://2.1.5
- NAMESPACE
- NAME
- READY (X/Y means X out of Y containers are ready)
- STATUS
- RESTARTS
- AGE
- IP
- NODE
- NOMINATED NODE
- READINESS GATES
- NAME e.g. default
- STATUS e.g. Active
- AGE e.g. 244d
Kubernetes DaemonSet
Kubernetes DaemonSet is a workload resource that ensures a specific pod runs on all (or selected) nodes in a cluster. It's commonly used for deploying node-level services like log collectors, monitoring agents, or network plugins.
Example:
Elasticsearch Agents are Elastic’s unified data shippers typically used in k8s cluster to collect container logs, Kubernetes metrics, node-level metrics and ship all of that data to Elasticsearch. They are deployed in the cluster as a DaemonSet.
We can use a DaemonSet to run a copy of a pod on every node, or we can use node affinity or selector rules to run it on only certain nodes.
What is the difference between ReplicaSet and DaemonSet?
ReplicaSets ensure a specific number of identical pods run for scaling stateless apps (e.g., web servers), while DaemonSets guarantee one pod runs on every (or a subset of) node(s) for node-specific tasks like logging or monitoring. The key difference is quantity versus location: ReplicaSets focus on maintaining pod count for availability, whereas DaemonSets ensure pod presence on each node for system-level services.
ReplicaSet
- Purpose: Maintain a stable set of replica pods for stateless applications, ensuring high availability and scalability.
- Scaling: Scales pods up or down based on the replicas field you define in the manifest.
- Use Case: Running web frontends, APIs, or any application needing multiple identical instances.
- Behavior: If a pod dies, it creates a new one to meet the replica count; if a node fails, it tries to reschedule elsewhere.
DaemonSet
- Purpose: Run a single copy of a pod on every (or specific) node in the cluster for node-specific tasks.
- Scaling: Automatically adds a pod when a new node joins the cluster and removes it when a node leaves.
- Use Case: Logging agents (Fluentd, Elastic Agent), monitoring agents (Prometheus node-exporter), or storage daemons.
- Behavior: Ensures that a particular service runs locally on each machine for local data collection or management.
References:
DevOps Interview: Replica sets vs Daemon sets - DEV Community
Monday, 5 January 2026
Kubernetes ReplicaSets
Key Functions
- If pods crash or are deleted, it creates new ones to replace them
- If there are too many pods, it terminates the excess ones
- This self-healing behavior keeps your application running reliably
How It Works
- Selector: Labels used to identify which pods belong to this ReplicaSet
- Replicas: The desired number of pod copies
- Pod template: The specification for creating new pods when needed
What is the relation between replicaset and deployment?
- Creating a new ReplicaSet with the updated pod specification
- Gradually scaling up the new ReplicaSet while scaling down the old one
- Keeping both ReplicaSets around for rollback capability
What Deployment Adds
- Rolling updates: Gradual replacement of old pods with new ones
- Rollback: Easy reversion to previous versions
- Update strategies: Control how updates happen (RollingUpdate, Recreate)
- Revision history: Track changes over time
- Pause/resume: Control when updates are applied
In Practice
Why is there many old ReplicaSets in my cluster?
- Creates a new ReplicaSet
- Scales the old one down to 0
- Keeps the old ReplicaSet for rollback purposes
- Image tag change
- Env var change
- ConfigMap checksum change
- Resource requests/limits change
- Annotation change on the pod template
- Helm re-deploy with different values
- revisionHistoryLimit is unset
- or explicitly set high
- or Helm chart doesn’t define it
- Multiple deployments over months
- Each one created a new ReplicaSet
- All old ones were scaled to 0
- Only the newest one is active:
- Consume almost no resources
- Do not schedule pods
- Exist mainly as metadata
- Performance impact → negligible
- Scheduler impact → none
- You have hundreds or thousands of old ReplicaSets
- kubectl get rs becomes noisy
- GitOps / audits become painful
- You accidentally roll back to a very old revision
- etcd size is a concern (rare, but real at scale)
- Kubernetes keeps only the last 3 old ReplicaSets
- Older ones are automatically deleted
- 2–3 for staging
- 5–10 for prod (depending on rollback needs)
- This is a one-off cleanup
- Without fixing revisionHistoryLimit, they’ll come back
TL;DR
- Many old ReplicaSets is normal
- They exist for rollback history
- They’re old because Kubernetes keeps them indefinitely
- Fix it with revisionHistoryLimit
- Manual deletion is safe but not a long-term solution


