Tips: How to Resolve Most “PartiallyFailed” Errors in Velero Volume Backups
In some environments, Velero backups frequently fail to capturing some pods peristent volumes, resulting in a “PartiallyFailed” status. This issue can be critical when volume-level backups are required on mission critical enviroments. It tends to occurs most frequently during full cluster backups that demand reliable volume coverage and consistency.
After testing with Tanzu Kubernetes Grid (TKG) 2.5, I’ve found a reliable workaround that ensures successful full-cluster backups, even including persistent volumes and works every time.
This solution is not compatible with backups managed via Tanzu Mission Control (TMC). Applying this fix will break TMC integration.
Solution: Patch Velero Node Agents
To use this solution, Velero must be installed with the --use-node-agent flag. This allows direct control over the node-agent DaemonSet.
Edit the Node Agent DaemonSet
kubectl -n velero edit ds node-agent
Add Tolerations to Enable Scheduling on Control Plane Nodes
Insert the following under .spec.template.spec:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
Now run a full cluster backup to test that all components, including peristent volumes are backuped successfully.
Member discussion