Wednesday, 1 July 2026

Introduction to Velero

 


Velero is an open-source disaster recovery, Kubernetes-native backup, restore and migration tool for Kubernetes. It allows you to back up and restore both your Kubernetes cluster resources and, optionally, the persistent volumes (PVs) that hold application data.

It was originally created by Heptio (the company founded by two of Kubernetes' creators) and is now maintained by VMware and the open-source community.

What does Velero back up?

Velero can back up:

  • Kubernetes resources (Cluster API objects):
    • Deployments
    • StatefulSets
    • Services
    • Ingresses
    • ConfigMaps
    • Secrets
    • CRDs
    • Namespaces
    • RBAC resources
    • Custom Resources
They are stored as tarballs in an S3 bucket.

Optionally, it can also back up:

  • Persistent Volumes (application data)
    • via storage snapshots (AWS EBS, Azure Disk, GCP Persistent Disk, etc.)
      • cloud snapshots (EBS snapshots through the CSI driver)
    • or via a file-level backup tool called Node Agent (formerly Restic)
      • file-level backup with the built-in Kopia/Restic uploader for non-snapshottable volumes (EFS, hostPath, etc.)

How it works

A typical Velero deployment consists of:


                    +----------------+
| Kubernetes API |
+--------+-------+
|
Velero Server
|
+-------------------+-------------------+
| |
Metadata Backup Volume Backup
| |
v v
Object Storage Snapshot or File Backup
(S3, Azure Blob, (EBS, CSI Snapshot,
GCS, MinIO...) Node Agent/Restic)


For example:

  • Cluster metadata → stored in an S3 bucket
  • PV data → stored as EBS snapshots or uploaded to object storage

Typical use cases

1. Disaster recovery

Your EKS cluster is accidentally deleted.

With Velero you can:

  • recreate the cluster
  • install Velero
  • restore all workloads
  • restore persistent data

2. Accidental deletion

Someone runs:

kubectl delete namespace production

Instead of recreating everything manually:

velero restore create \
--from-backup production-backup

3. Cluster migration

Move workloads from:

  • EKS → EKS
  • EKS → AKS
  • EKS → GKE
  • On-prem → cloud

Velero restores Kubernetes objects into the new cluster.


4. Scheduled backups

Example:

Every night at 2 AM



Backup namespaces:
- production
- monitoring
- logging

Retention can be configured, for example:

Keep 30 daily backups
Delete older ones automatically

What it does NOT back up

Velero does not automatically back up:

  • etcd directly
  • cloud infrastructure (VPCs, Load Balancers, IAM, Security Groups)
  • managed databases like RDS
  • container images (they remain in your registry)
  • external services

Those require separate backup strategies.


Storage providers

Velero supports many object storage backends:

  • Amazon S3
  • MinIO
  • Azure Blob Storage
  • Google Cloud Storage
  • OCI Object Storage
  • many S3-compatible systems

Persistent Volume backup methods

There are two main approaches.

1. CSI snapshots (preferred)

If your storage class supports the Container Storage Interface (CSI) snapshot API:

PVC

VolumeSnapshot

Cloud snapshot

Advantages:

  • very fast
  • incremental (depending on storage backend)
  • cloud-native
  • recommended

2. Node Agent (formerly Restic)

If snapshots aren't available:

PVC

Read filesystem

Compress

Upload to object storage

Advantages:

  • works almost everywhere
  • storage-independent

Disadvantages:

  • slower
  • consumes CPU and network bandwidth

Example architecture in AWS

                 Amazon EKS
|
+-----------+-----------+
| |
Kubernetes API Persistent Volumes
| |
Velero Server EBS Snapshots
|
|
S3 Bucket
backups/

Example installation

Install the CLI:

brew install velero

Deploy into an EKS cluster:

velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws \
--bucket my-backups \
--backup-location-config region=eu-west-2

Example backup

Back up an entire cluster:

velero backup create full-cluster

Back up a namespace:

velero backup create production \
--include-namespaces production

Example restore

velero restore create \
--from-backup full-cluster

Scheduling

Create a nightly backup:

velero schedule create nightly \
--schedule="0 2 * * *"

When should you use Velero?

Velero is a good fit if you want to:

  • Recover Kubernetes workloads after accidental deletion or cluster failure.
  • Back up application manifests and, optionally, persistent data.
  • Migrate workloads between Kubernetes clusters or cloud providers.
  • Automate recurring backups with retention policies.
  • Protect stateful applications running on Kubernetes.

If your applications also depend on external systems (for example, managed databases, message brokers, or cloud resources), Velero should be part of a broader disaster recovery strategy rather than the only backup solution.

Velero vs. etcd backup

FeatureVeleroetcd backup
Kubernetes resources
Persistent volume data
Restore individual namespaces
Restore individual applications
Migrate between clustersLimited
Cloud agnosticMostly
Disaster recovery for applicationsPartial

For managed Kubernetes services such as Amazon Elastic Kubernetes Service (EKS), Velero is generally the preferred backup tool because it focuses on application-level recovery rather than restoring the control plane itself. In contrast, direct etcd backups are more common in self-managed Kubernetes clusters where you control the control plane.


Would we benefit from Velero if we already have everything in Terraform?


Yes, because Terraform and Velero back up different layers. Terraform captures the desired infrastructure (the cluster, node groups, add-ons) and our GitOps/Helm captures workload definitions. Neither captures the runtime state that lives inside the cluster:

  • SecretsConfigMaps created in-cluster rather than from Git
    • If source of truth is AWS Secrets Manager and the definitions are in Git, we recover every secret by redeploying external-secrets and waiting ~1 minute for re-sync. Velero adds little here.
  • ConfigMaps created in-cluster rather than from Git
    •  If they are100% in Git: they can be recovered either via kubernetes_config_map Terraform resources (e.g. the kube-prometheus-stack one with computed IAM ARNs) or static YAML applied via kubectl_manifest (e.g. MSK scrape targets). No imperative creation.
  • Dynamically provisioned PVCs and the data on their volumes
  • Operator-managed CRs and other objects mutated after (terraform) apply

So Terraform lets us rebuild an empty, correctly-shaped cluster; it does not restore what was running and stored in it at a point in time. Velero fills exactly that gap — plus it gives granular (namespace/PVC-level) and fast restores that a Terraform rebuild can't.


Do we actually need cluster (k8s object definitions + storage) backups?


Yes — most acutely on the stateful clusters e.g. those running MongoDB and Clickhouse, where large EBS-backed PVs hold real data. In non-stateful clusters the main value is namespace/object-level DR (Disaster Recovery) and fast granular restore (e.g. an accidentally deleted namespace or PVC). The EBS CSI driver is already installed on all clusters, so PV snapshots are straightforward to enable.


How does Velero's PVC backup differ from taking snapshots of all volumes associated with all PVCs in cluster? What's the value that Velero brings here?  


The short version: snapshotting every volume gives you a pile of disk images; Velero gives you disk images that are still wired into the Kubernetes objects that make them usable. The bytes are the easy part — the value is everything around them.
  
Concretely, here's what a "loop aws ec2 create-snapshot over all PVC-backed volumes" script leaves you to build yourself, and Velero does for free:

1. The snapshot↔object binding (the big one)

  
A raw EBS snapshot is an orphan — a snap-0abc… with no memory of which PVC, PV, namespace, StorageClass, or pod it belonged to. To restore it you must, per volume, by hand:

  1. create-volume from the snapshot (right AZ, right type),
  2. hand-craft a PV pointing at the new vol-… ID,
  3. craft a matching PVC with the correct size/StorageClass/claimRef,
  4. fix the binding so they actually pair up,
  5. restart the workload so it mounts it.

For one volume that's annoying; for a whole cluster it's an error-prone afternoon during an incident. Velero records the full PVC→PV→StorageClass→namespace mapping in its backup manifest and reconstructs all of it on restore — you get a bound PVC attached to a running pod, not a disk image.
  

2. Consistency & point-in-time grouping


A create-snapshot loop is staggered — volume 1 snapped at T, volume 50 at T+30s, no coordination. For a multi-volume app (say data + WAL on separate PVs) that's an inconsistent set. Velero treats a backup as one logical operation and supports pre/post hooks (fsfreeze, db.fsyncLock()) to quiesce the app first (to quiesce = to pause or alter a device or application to achieve a consistent state, usually in preparation for a backup or other maintenance), so you can get application-consistent — not just crash-consistent — snapshots. This is exactly the nuance behind your earlier "do we still need the mongodump if we snapped the PVCs" question: a bare snapshot is crash-consistent; Velero + hooks tightens that.
  

3. Discovery & scoping via the K8s API


Velero enumerates PVCs through the Kubernetes API, so it knows which volumes are PVCs (vs node root disks vs unrelated EBS) and can scope by namespace or label — "back up just the mongodb namespace." A raw script has to reconstruct the volume→PVC mapping from EBS tags and risks grabbing node OS disks or missing dynamically-provisioned ones. There's no namespace concept at the EBS layer at all.

4. Selective & cross-cluster restore


Because Velero holds the object graph, a restore can remap — into a different namespace, a different StorageClass (gp2→gp3, or a renamed SC [Storage Class] in the target cluster), or an entirely different cluster/region. This is what makes the blue/green K8s-upgrade path work. Raw snapshots have none of this; you'd script every remap yourself.

5. Lifecycle, catalog, portability


- Retention/GC: Velero backups carry a TTL and expired snapshots are garbage-collected. A snapshot script means you own tagging, retention, and GC — and snapshot sprawl (cost) if you get it wrong.
- Catalog: velero backup describe / restore gives you a queryable history. Raw snapshots are just IDs in EC2 you have to correlate.
- Portability (DataMover): Velero's CSI snapshot data mover can push the snapshot data into S3 via Kopia, so backups aren't locked to regional, account-bound EBS snapshots — enabling cheaper long retention and cross-region DR. It also covers non-EBS volumes (EFS, etc.) nthat create-snapshot can't touch.
  
 
In short: Raw volume snapshots back up the data; Velero backs up the data plus everything needed to turn it back into a running, bound, correctly-classed volume in the right namespace — atomically, with consistency hooks, retention, and cross-cluster remapping. The reattachment/orchestration layer is the value, not the snapshot itself.

Can Velero be used in cluster K8s version upgrades?


Yes, in two distinct ways:
  1. Pre-upgrade safety net (in place): take a full Velero backup immediately before an EKS control-plane/version bump. If the upgrade goes wrong, we have an instant restore target for workloads + PV data, independent of the control-plane change.
  2. Blue/green upgrade (migration): stand up a new cluster on the target version and velero restore the workloads into it, then cut traffic over — instead of an in-place upgrade. Velero remaps storage classes and strips node-specific fields on restore, which makes this cross-cluster move practical. This is the cleaner pattern for risky major jumps.

Caveat: on version upgrades, watch for deprecated/removed APIs — objects backed up under an old apiVersion may need conversion on restore into a newer cluster; test-restore before relying on it.



Show Case: 4 Kubernetes Clusters in AWS EKS


For e.g. a 4-EKS-cluster estate on AWS, the concrete wins:

1. Disaster recovery


If a cluster gets corrupted, mis-applied, or accidentally deleted, you can stand up a fresh EKS cluster and Velero restore the whole thing — workloads, configs, PVs — instead of hoping every namespace is reproducible from Git. Even with GitOps, Velero captures the runtime state (secrets created in-cluster, dynamically provisioned PVCs, operator-managed CRs) that Git doesn't.

2. Namespace / PVC-level granular restore


"Someone deleted the prod PVC / dropped a namespace" — restore just that namespace or volume from last night's backup. Much faster than rebuilding.

3. Cluster migration & cloning


Lift workloads from one cluster to another — e.g. clone prod into a staging cluster for testing, or migrate to a new cluster during an EKS version upgrade or AZ/region move. Velero remaps storage classes and strips node-specific fields on restore.

4. Scheduled backups with retention


Cron-style Schedule CRs (e.g. nightly full, hourly for critical namespaces) with TTL-based expiry. One bucket per cluster (or prefixes within one), lifecycle-policied in S3.

5. Pre-upgrade safety net


Take a backup right before an EKS control-plane bump, Helm release, or risky terraform apply that touches cluster resources — instant rollback target.


How it fits this AWS setup

  • Plugin: velero-plugin-for-aws:
    • object store = S3
    • volume snapshotter = EBS
  • Auth: IRSA (IAM Roles for Service Accounts) per cluster — no static keys. This should ideally match how the rest of the stack is wired.
  • CSI snapshots: since you're on EBS, enable the CSI snapshot data mover so PV backups are real EBS snapshots, not just file copies.
  • One S3 bucket, per-cluster prefix (or one bucket each) with versioning + a lifecycle rule for expiry.

Honest caveats

  • It is not an etcd backup — EKS already manages the control plane / etcd for you, so that's fine, but don't think of Velero as "backing up the cluster brain."
  • It is not a substitute for application-level DB backups (RDS snapshots, Mongo dumps). Treat Velero as the Kubernetes object + PV layer; keep your existing DB backup story separate.
  • Restores of stateful apps need testing — ordering, hooks, and re-attach behaviour matter. A backup you've never test-restored isn't a backup.

A sensible rollout for 4 clusters: install via the Velero Helm chart through your existing Terraform/Helm pipeline, IRSA per cluster, nightly scheduled backups to S3 with a 30-day TTL, CSI snapshot data mover on, and one documented test-restore into a throwaway cluster to prove the loop.


How does Velero k8s objects backup differentiate from kubectl get all --all-namespaces -o yaml > "$backup_file_name"

kubectl get all dump is a far weaker thing than it looks, and the gap is exactly where Velero earns its keep. Breaking it down:
  

1. kubectl get all doesn't actually get "all"


all is a small, hardcoded category — roughly pods, services, deployments, replicasets, daemonsets, statefulsets, jobs, cronjobs. It silently omits most of what you'd need to reconstruct a namespace:

  - Secrets and ConfigMaps ← this alone makes the dump useless for DR
  - PVCs, Ingresses, NetworkPolicies, ResourceQuotas, LimitRanges, HPAs
  - ServiceAccounts, Roles, RoleBindings (all RBAC)
  - Namespaces themselves (you're dumping resources inside namespaces)
  - Cluster-scoped objects entirely — CRDs, ClusterRoles, ClusterRoleBindings, PVs, StorageClasses, webhooks
  - CRD instances (your custom resources — cert-manager Certificates, external-secrets, Karpenter NodePools, etc.)

So --all-namespaces is a misnomer twice over. Velero, by contrast, enumerates the API discovery list and backs up every resource type (namespaced + cluster-scoped), with include/exclude filters if you want to trim.

2. The biggest gap: no volume data


A YAML dump captures the PVC object — a few hundred bytes of metadata saying "I want 300Gi". It does not capture the 300Gi of bytes on the EBS volume. Velero snapshots the actual disk (EBS snapshot via CSI, or file-level via Kopia). For e.g. Mongo/ClickHouse clusters this is the entire point — the dump backs up the description of your data, Velero backs up your data.

3. The dump isn't restorable as-is


kubectl get -o yaml includes live runtime cruft that a blind kubectl apply chokes on:

  - status: blocks, resourceVersion, uid, creationTimestamp, managedFields
  - Cluster-assigned values: Service clusterIP/nodePort, PV volumeName bindings, node-specific nodeName on pods
  - Objects that shouldn't be recreated at all (running pods owned by a ReplicaSet, endpoint objects)

Re-applying that fails on immutable fields and conflicts. Velero strips and regenerates these on restore (drops status/managedFields, re-allocates ClusterIPs, rebinds PVCs to freshly-restored PVs, lets controllers re-adopt pods).

4. Restore ordering & dependencies


A single giant file applied top-to-bottom breaks: you can't create a CR before its CRD is Established, a PVC before its PV, a RoleBinding before its namespace/ServiceAccount. Velero restores in a dependency-aware order (namespaces → CRDs → then the rest, PVs before PVCs) and waits for prerequisites.

5. Selective / cross-cluster restore

  
Velero restores are first-class operations: 
  • restore one namespace
  • remap namespace names
  • filter by label
  • restore into a different cluster (this is what makes the blue/green K8s-upgrade story work)
  • skip-existing-resources
  • run restore hooks

The dump is all-or-nothing text with none of that.

6. Lifecycle, not a loose file

  
Velero stores backups to S3 as tracked objects with TTL/retention, checksums, and a queryable history (velero backup get), plus scheduling, metrics, and consistency hooks (fsfreeze / db.fsyncLock() pre-snapshot). The kubectl dump is one file you now have to name, rotate, secure (it contains plaintext secrets!), and hope is internally consistent since it was assembled by looping over the API.

  ---

Where the dump is fine: a quick "let me eyeball/grep the current state of a namespace" or copying one manifest between environments. As a disaster-recovery mechanism it fails on the two things that matter most — it doesn't capture secrets or volume data, and what it does capture isn't cleanly restorable.



No comments: