Maintenance

Regular Maintenance Tasks

This guide covers scheduled maintenance tasks to keep the homelab running smoothly, securely, and reliably.

Maintenance Approach

The homelab is designed for minimal maintenance through:

Automated updates where safe
Self-healing components (ArgoCD, Kubernetes, Ceph)
Infrastructure as code (reproducible)
GitOps (declarative, auditable)

However, manual oversight ensures:

Security updates are applied
Capacity is managed proactively
Configuration drift is detected
Backups are verified

Daily Tasks (Automated)

These tasks run automatically, no intervention required:

Task	Automation	Verification
etcd snapshots	K3s built-in (daily 12:00 AM)	Check `/var/lib/rancher/k3s/server/db/snapshots/`
Certificate renewal	cert-manager (30 days before expiry)	Check `kubectl get certificates -A`
DNS synchronization	external-dns (on ingress change)	Check MikroTik DNS records
ArgoCD sync	ArgoCD (every 3 minutes)	Check `argocd app list`
Metrics collection	Victoria Metrics (every 30s)	Check Grafana dashboards
Ceph rebalancing	Rook operator (on OSD/node changes)	Check `kubectl -n rook-ceph get cephcluster`

Weekly Tasks (Manual)

Monday: Review & Triage

Goal: Identify issues before they become problems


# 1. Check cluster health
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
 
# 2. Review ArgoCD application status
argocd app list
# All apps should be: Healthy, Synced
 
# 3. Check Grafana dashboards
open https://grafana.homelab.int.zengarden.space
# Review:
# - Node resource utilization (CPU, RAM, storage)
# - Pod restart rates
# - Ingress error rates
 
# 4. Review AlertManager
open https://alerts.homelab.int.zengarden.space
# Any firing alerts?
 
# 5. Check Gitea sync job logs
kubectl -n gitea logs job/gitea-sync --tail=100
# Verify GitHub<->Gitea sync is healthy

Expected Duration: 15-20 minutes

Wednesday: Security Review

Goal: Ensure security posture remains strong


# 1. Check certificate expiration
kubectl get certificates --all-namespaces -o json | \
  jq -r '.items[] | select(.status.notAfter != null) | "\(.metadata.namespace)/\(.metadata.name): \(.status.notAfter)"' | \
  while read line; do
    expiry=$(echo $line | cut -d: -f2)
    days=$((( $(date -d "$expiry" +%s) - $(date +%s) ) / 86400))
    if [ $days -lt 30 ]; then
      echo "⚠️  $line ($days days remaining)"
    else
      echo "✅ $line ($days days remaining)"
    fi
  done
 
# 2. Review Kubernetes audit logs (recent suspicious activity)
ssh [email protected]
sudo cat /var/log/kubernetes/audit.log | jq 'select(.responseStatus.code >= 400)' | tail -20
 
# 3. Check for failed authentication attempts
sudo journalctl -u k3s | grep "authentication failed" | tail -20
 
# 4. Review RBAC denials
kubectl get events --all-namespaces | grep "Forbidden"
 
# 5. Check for Pod Security violations
kubectl get events --all-namespaces | grep "PodSecurity"

Expected Duration: 10-15 minutes

Friday: Capacity Planning

Goal: Ensure resources are not nearing limits


# 1. Node resource utilization
kubectl top nodes
# Warning if any node >70% CPU or >80% RAM sustained
 
# 2. Storage utilization
kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') -- ceph df
 
# 3. PVC usage
kubectl get pvc --all-namespaces -o json | \
  jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): \(.status.capacity.storage)"'
 
# 4. Check for evicted pods (indicates resource pressure)
kubectl get pods --all-namespaces --field-selector status.phase=Failed | grep Evicted
 
# 5. Review resource requests/limits
kubectl get pods --all-namespaces -o json | \
  jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): CPU request=\(.spec.containers[0].resources.requests.cpu // "none"), RAM request=\(.spec.containers[0].resources.requests.memory // "none")"'

Expected Duration: 15-20 minutes

Monthly Tasks (Manual)

First Monday: Software Updates

Goal: Keep software current with security patches

1. Update K3s


# Check current version
kubectl version --short
 
# Check latest K3s version
curl -s https://api.github.com/repos/k3s-io/k3s/releases/latest | grep tag_name
 
# Update Ansible playbook
cd ../ansible/install-k3s
nano install.yaml
# Edit: k3s_version: v1.32.5+k3s1 (example)
 
# Update cluster (one node at a time for HA)
ansible-playbook -i hosts.yaml install.yaml --limit blade001
# Wait for blade001 to rejoin
ansible-playbook -i hosts.yaml install.yaml --limit blade002
# ... continue for all nodes
 
# Verify all nodes updated
kubectl get nodes -o wide

Expected Duration: 1-2 hours

2. Update Helm Charts


cd ../helmfile
 
# Update chart versions in helmfile.yaml.gotmpl files
# Example: cert-manager
cd cert-manager
nano helmfile.yaml.gotmpl
# Edit: version: 1.18.2 → 1.19.0
 
# Check diff
helmfile diff
 
# Apply update
helmfile apply
 
# Verify
kubectl -n cert-manager get pods
 
# Repeat for other components:
# - argocd
# - gitea
# - victoria-metrics
# - metabase
# - external-dns
# - ingress-nginx

Expected Duration: 2-3 hours

3. Update OS Packages


# Update all nodes
for i in {11..15}; do
  echo "Updating blade$(printf %03d $(($i - 10)))..."
  ssh [email protected].$i 'sudo apt update && sudo apt upgrade -y && sudo apt autoremove -y'
done
 
# Reboot nodes if kernel updated (one at a time for HA)
ssh [email protected] 'sudo reboot'
# Wait 5 minutes, verify node back
kubectl get nodes
# Continue for other nodes

Expected Duration: 1-2 hours

Second Monday: Backup Verification

Goal: Ensure backups are restorable

1. Verify etcd Snapshots


# SSH to master node
ssh [email protected]
 
# List snapshots
sudo k3s etcd-snapshot ls
 
# Verify recent snapshots exist (daily)
# Expected: Snapshots from last 7+ days
 
# Test restore to temp cluster (optional, advanced)
# DO NOT run on production cluster!

2. Test Application Backup/Restore


# Create test namespace
kubectl create namespace backup-test
 
# Deploy test application
kubectl -n backup-test create deployment nginx --image=nginx
kubectl -n backup-test expose deployment nginx --port=80
 
# Verify running
kubectl -n backup-test get pods
 
# Delete namespace
kubectl delete namespace backup-test
 
# Recreate from scratch (simulates restore)
kubectl create namespace backup-test
kubectl -n backup-test create deployment nginx --image=nginx
kubectl -n backup-test expose deployment nginx --port=80
 
# Verify working
kubectl -n backup-test get pods

3. Backup Configuration Files


# Backup env.yaml files (contain no secrets, but configurations)
cd ../helmfile
tar -czf ~/homelab-env-$(date +%Y%m%d).tar.gz */env.yaml
 
# Backup integrations.yaml (CONTAINS SECRETS - encrypt!)
kubectl -n integrations get secret integrations -o yaml > ~/integrations-backup-$(date +%Y%m%d).yaml
# IMPORTANT: Store securely (password manager, encrypted drive)
 
# Backup master password (for DerivedSecrets)
kubectl -n derived-secret-operator get secret master-password -o jsonpath="{.data.password}" | base64 -d
# IMPORTANT: Store in password manager

Third Monday: Security Audit

Goal: Review and update security policies

1. Review RBAC Permissions


# List all ClusterRoleBindings
kubectl get clusterrolebindings
 
# Review admin access (should be limited)
kubectl get clusterrolebindings -o json | \
  jq -r '.items[] | select(.roleRef.name=="cluster-admin") | "\(.metadata.name): \(.subjects)"'
 
# Expected: Only your OIDC user should have cluster-admin
 
# Review service account permissions
kubectl get clusterrolebindings -o json | \
  jq -r '.items[] | select(.subjects[]?.kind=="ServiceAccount") | "\(.metadata.name): \(.subjects)"'

2. Review Firewall Rules


# SSH to MikroTik
ssh [email protected]
 
# List firewall rules
/ip firewall filter print
 
# Verify firewall rules are properly configured for network segmentation

3. Review Network Policies


# List all NetworkPolicies
kubectl get networkpolicies --all-namespaces
 
# Verify isolation
kubectl -n production get networkpolicy
 
# Test connectivity (should be blocked)
kubectl run test -it --rm --image=alpine/curl -- \
  curl -m 5 http://service-in-another-namespace.another-namespace.svc.cluster.local
# Expected: Timeout (blocked by NetworkPolicy)

4. Check for Security Updates


# Check for CVEs in Kubernetes
curl -s https://kubernetes.io/docs/reference/issues-security/official-cve-feed/index.json | jq
 
# Check for CVEs in container images (using Trivy)
# Install Trivy
wget https://github.com/aquasecurity/trivy/releases/download/v0.50.0/trivy_0.50.0_Linux-ARM64.deb
sudo dpkg -i trivy_0.50.0_Linux-ARM64.deb
 
# Scan images
trivy image gitea.homelab.int.zengarden.space/zengarden-space/my-app:latest

Fourth Monday: Cleanup & Optimization

Goal: Remove unused resources, optimize performance

1. Clean Up Unused Resources


# Remove unused container images on nodes
for i in {11..15}; do
  echo "Cleaning blade$(printf %03d $(($i - 10)))..."
  ssh [email protected].$i 'sudo k3s crictl rmi --prune'
done
 
# Remove unused PVCs (manual review)
kubectl get pvc --all-namespaces
# Delete unused:
# kubectl delete pvc <pvc-name> -n <namespace>
 
# Remove completed jobs older than 7 days
kubectl get jobs --all-namespaces -o json | \
  jq -r '.items[] | select(.status.succeeded==1 and (.status.completionTime | fromdateiso8601) < (now - 604800)) | "\(.metadata.namespace) \(.metadata.name)"' | \
  while read ns name; do
    kubectl delete job $name -n $ns
  done

2. Optimize Resource Requests/Limits


# Review actual vs requested resources
kubectl top pods --all-namespaces
 
# Compare to requests
kubectl get pods --all-namespaces -o json | \
  jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): request=\(.spec.containers[0].resources.requests.cpu // "none")/\(.spec.containers[0].resources.requests.memory // "none")"'
 
# Adjust requests/limits in manifests if needed
# Rule of thumb:
# - Request = 75% of observed usage
# - Limit = 150% of observed usage

3. Review and Update Monitoring Dashboards


# Access Grafana
open https://grafana.homelab.int.zengarden.space
 
# Review dashboards:
# - Are metrics still relevant?
# - Add new panels for new services
# - Remove panels for removed services

Quarterly Tasks (Manual)

Full Security Audit

Goal: Comprehensive security review

Penetration Testing (optional)
- Use tools like kube-bench for CIS Kubernetes Benchmark
- Run kube-hunter for attack surface analysis
Review Threat Model
- Update threat model document
- Review mitigation strategies
- Identify new threats
Security Compliance Check
- OWASP Kubernetes Top 10
- CIS Kubernetes Benchmark
- Pod Security Standards

Performance Testing

Goal: Identify performance bottlenecks

Load Testing
- Use k6 or locust to simulate load
- Identify resource bottlenecks
- Optimize as needed
Storage Performance
- Test Ceph read/write performance
- Use fio for benchmark
- Compare to baseline
Network Performance
- Test pod-to-pod latency
- Test ingress throughput
- Identify bottlenecks

Documentation Review

Goal: Keep documentation current

Update Architecture Diagrams
- Reflect current state
- Add new components
- Remove deprecated components
Update Runbooks
- Review troubleshooting procedures
- Add new procedures for new services
- Remove outdated procedures
Review This Maintenance Guide
- Are tasks still relevant?
- Add new tasks as needed
- Remove obsolete tasks

Incident-Driven Maintenance

Post-Incident Review

After any significant incident:

Document Timeline
- When did it start?
- When was it detected?
- When was it resolved?
Root Cause Analysis
- What was the root cause?
- Why wasn’t it caught earlier?
Preventive Measures
- What changes prevent recurrence?
- Add monitoring/alerting
- Update runbooks
Update Documentation
- Add to troubleshooting guide
- Update architecture if changed
- Share learnings

Maintenance Calendar

Example maintenance schedule:


Week 1:
  Mon: Review & Triage (weekly)
  Mon: Software Updates (monthly)
  Wed: Security Review (weekly)
  Fri: Capacity Planning (weekly)

Week 2:
  Mon: Review & Triage (weekly)
  Mon: Backup Verification (monthly)
  Wed: Security Review (weekly)
  Fri: Capacity Planning (weekly)

Week 3:
  Mon: Review & Triage (weekly)
  Mon: Security Audit (monthly)
  Wed: Security Review (weekly)
  Fri: Capacity Planning (weekly)

Week 4:
  Mon: Review & Triage (weekly)
  Mon: Cleanup & Optimization (monthly)
  Wed: Security Review (weekly)
  Fri: Capacity Planning (weekly)

Quarter End:
  Full Security Audit
  Performance Testing
  Documentation Review

Maintenance Checklist Template

Copy this for each maintenance session:


# Maintenance - [DATE]
 
## Pre-Maintenance
- [ ] Notify users (if applicable)
- [ ] Review last maintenance notes
- [ ] Backup etcd snapshot
 
## Tasks
- [ ] Check cluster health
- [ ] Review ArgoCD status
- [ ] Review Grafana dashboards
- [ ] Review AlertManager
- [ ] [Add task-specific items]
 
## Issues Found
- [ ] Issue 1: [description]
- [ ] Issue 2: [description]
 
## Actions Taken
- [ ] Action 1: [description]
- [ ] Action 2: [description]
 
## Post-Maintenance
- [ ] Verify cluster health
- [ ] Update documentation
- [ ] Schedule follow-up if needed
 
## Notes
[Additional notes]

Automation Opportunities

Consider automating:

Weekly health report
- Script to check cluster health
- Email/notify summary
Certificate expiration alerts
- Already handled by AlertManager
- Could add email notifications
Backup verification
- Automated etcd restore test (separate cluster)
Dependency updates
- Renovate bot for Helm charts
- Dependabot for application dependencies

Next Steps

Review Monitoring for dashboard and alert setup
See main Operations Guide for troubleshooting procedures

Regular maintenance ensures long-term reliability, security, and performance of the homelab platform.