Velero is an open-source disaster recovery, Kubernetes-native backup, restore and migration tool for Kubernetes. It allows you to back up and restore both your Kubernetes cluster resources and, optionally, the persistent volumes (PVs) that hold application data.
It was originally created by Heptio (the company founded by two of Kubernetes' creators) and is now maintained by VMware and the open-source community.
What does Velero back up?
Velero can back up:
- Kubernetes resources (Cluster API objects):
- Deployments
- StatefulSets
- Services
- Ingresses
- ConfigMaps
- Secrets
- CRDs
- Namespaces
- RBAC resources
- Custom Resources
Optionally, it can also back up:
-
Persistent Volumes (application data)
- via storage snapshots (AWS EBS, Azure Disk, GCP Persistent Disk, etc.)
- cloud snapshots (EBS snapshots through the CSI driver)
- or via a file-level backup tool called Node Agent (formerly Restic)
- file-level backup with the built-in Kopia/Restic uploader for non-snapshottable volumes (EFS, hostPath, etc.)
How it works
A typical Velero deployment consists of:
+----------------+
| Kubernetes API |
+--------+-------+
|
Velero Server
|
+-------------------+-------------------+
| |
Metadata Backup Volume Backup
| |
v v
Object Storage Snapshot or File Backup
(S3, Azure Blob, (EBS, CSI Snapshot,
GCS, MinIO...) Node Agent/Restic)
For example:
- Cluster metadata → stored in an S3 bucket
- PV data → stored as EBS snapshots or uploaded to object storage
Typical use cases
1. Disaster recovery
Your EKS cluster is accidentally deleted.
With Velero you can:
- recreate the cluster
- install Velero
- restore all workloads
- restore persistent data
2. Accidental deletion
Someone runs:
kubectl delete namespace production
Instead of recreating everything manually:
velero restore create \
--from-backup production-backup
3. Cluster migration
Move workloads from:
- EKS → EKS
- EKS → AKS
- EKS → GKE
- On-prem → cloud
Velero restores Kubernetes objects into the new cluster.
4. Scheduled backups
Example:
Every night at 2 AM
↓
Backup namespaces:
- production
- monitoring
- logging
Retention can be configured, for example:
Keep 30 daily backups
Delete older ones automatically
What it does NOT back up
Velero does not automatically back up:
- etcd directly
- cloud infrastructure (VPCs, Load Balancers, IAM, Security Groups)
- managed databases like RDS
- container images (they remain in your registry)
- external services
Those require separate backup strategies.
Storage providers
Velero supports many object storage backends:
- Amazon S3
- MinIO
- Azure Blob Storage
- Google Cloud Storage
- OCI Object Storage
- many S3-compatible systems
Persistent Volume backup methods
There are two main approaches.
1. CSI snapshots (preferred)
If your storage class supports the Container Storage Interface (CSI) snapshot API:
PVC
↓
VolumeSnapshot
↓
Cloud snapshot
Advantages:
- very fast
- incremental (depending on storage backend)
- cloud-native
- recommended
2. Node Agent (formerly Restic)
If snapshots aren't available:
PVC
↓
Read filesystem
↓
Compress
↓
Upload to object storage
Advantages:
- works almost everywhere
- storage-independent
Disadvantages:
- slower
- consumes CPU and network bandwidth
Example architecture in AWS
Amazon EKS
|
+-----------+-----------+
| |
Kubernetes API Persistent Volumes
| |
Velero Server EBS Snapshots
|
|
S3 Bucket
backups/
Example installation
Install the CLI:
brew install velero
Deploy into an EKS cluster:
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws \
--bucket my-backups \
--backup-location-config region=eu-west-2
Example backup
Back up an entire cluster:
velero backup create full-cluster
Back up a namespace:
velero backup create production \
--include-namespaces production
Example restore
velero restore create \
--from-backup full-cluster
Scheduling
Create a nightly backup:
velero schedule create nightly \
--schedule="0 2 * * *"
When should you use Velero?
Velero is a good fit if you want to:
- Recover Kubernetes workloads after accidental deletion or cluster failure.
- Back up application manifests and, optionally, persistent data.
- Migrate workloads between Kubernetes clusters or cloud providers.
- Automate recurring backups with retention policies.
- Protect stateful applications running on Kubernetes.
If your applications also depend on external systems (for example, managed databases, message brokers, or cloud resources), Velero should be part of a broader disaster recovery strategy rather than the only backup solution.
Velero vs. etcd backup
| Feature | Velero | etcd backup |
|---|---|---|
| Kubernetes resources | ✅ | ✅ |
| Persistent volume data | ✅ | ❌ |
| Restore individual namespaces | ✅ | ❌ |
| Restore individual applications | ✅ | ❌ |
| Migrate between clusters | ✅ | Limited |
| Cloud agnostic | ✅ | Mostly |
| Disaster recovery for applications | ✅ | Partial |
For managed Kubernetes services such as Amazon Elastic Kubernetes Service (EKS), Velero is generally the preferred backup tool because it focuses on application-level recovery rather than restoring the control plane itself. In contrast, direct etcd backups are more common in self-managed Kubernetes clusters where you control the control plane.
Would we benefit from Velero if we already have everything in Terraform?
- SecretsConfigMaps created in-cluster rather than from Git
- If source of truth is AWS Secrets Manager and the definitions are in Git, we recover every secret by redeploying external-secrets and waiting ~1 minute for re-sync. Velero adds little here.
- ConfigMaps created in-cluster rather than from Git
- If they are100% in Git: they can be recovered either via kubernetes_config_map Terraform resources (e.g. the kube-prometheus-stack one with computed IAM ARNs) or static YAML applied via kubectl_manifest (e.g. MSK scrape targets). No imperative creation.
- Dynamically provisioned PVCs and the data on their volumes
- Operator-managed CRs and other objects mutated after (terraform) apply
Do we actually need cluster (k8s object definitions + storage) backups?
How does Velero's PVC backup differ from taking snapshots of all volumes associated with all PVCs in cluster? What's the value that Velero brings here?
1. The snapshot↔object binding (the big one)
2. Consistency & point-in-time grouping
3. Discovery & scoping via the K8s API
4. Selective & cross-cluster restore
5. Lifecycle, catalog, portability
Can Velero be used in cluster K8s version upgrades?
- Pre-upgrade safety net (in place): take a full Velero backup immediately before an EKS control-plane/version bump. If the upgrade goes wrong, we have an instant restore target for workloads + PV data, independent of the control-plane change.
- Blue/green upgrade (migration): stand up a new cluster on the target version and velero restore the workloads into it, then cut traffic over — instead of an in-place upgrade. Velero remaps storage classes and strips node-specific fields on restore, which makes this cross-cluster move practical. This is the cleaner pattern for risky major jumps.
Show Case: 4 Kubernetes Clusters in AWS EKS
1. Disaster recovery
2. Namespace / PVC-level granular restore
3. Cluster migration & cloning
4. Scheduled backups with retention
5. Pre-upgrade safety net
How it fits this AWS setup
- Plugin: velero-plugin-for-aws:
- object store = S3
- volume snapshotter = EBS
- Auth: IRSA (IAM Roles for Service Accounts) per cluster — no static keys. This should ideally match how the rest of the stack is wired.
- CSI snapshots: since you're on EBS, enable the CSI snapshot data mover so PV backups are real EBS snapshots, not just file copies.
- One S3 bucket, per-cluster prefix (or one bucket each) with versioning + a lifecycle rule for expiry.
Honest caveats
- It is not an etcd backup — EKS already manages the control plane / etcd for you, so that's fine, but don't think of Velero as "backing up the cluster brain."
- It is not a substitute for application-level DB backups (RDS snapshots, Mongo dumps). Treat Velero as the Kubernetes object + PV layer; keep your existing DB backup story separate.
- Restores of stateful apps need testing — ordering, hooks, and re-attach behaviour matter. A backup you've never test-restored isn't a backup.
How does Velero k8s objects backup differentiate from kubectl get all --all-namespaces -o yaml > "$backup_file_name"?
1. kubectl get all doesn't actually get "all"
2. The biggest gap: no volume data
3. The dump isn't restorable as-is
4. Restore ordering & dependencies
5. Selective / cross-cluster restore
- restore one namespace
- remap namespace names
- filter by label
- restore into a different cluster (this is what makes the blue/green K8s-upgrade story work)
- skip-existing-resources
- run restore hooks

No comments:
Post a Comment