Tools & Technology Selection
Tool Selection Criteria
Every tool in this homelab was chosen based on these criteria:
- Production-Grade: Must be used in enterprise environments
- Open Source: Prefer FOSS for learning and customization
- Active Community: Strong community support and documentation
- Resource Efficiency: Must run on ARM64 with limited RAM
- Learning Value: Should teach transferable skills
Infrastructure as Code
Ansible
Purpose: Bare metal automation and OS-level configuration
Why Ansible:
| Criterion | Evaluation |
|---|---|
| Agentless | ⭐⭐⭐⭐⭐ SSH-based, no agents to manage |
| Idempotent | ⭐⭐⭐⭐ Safe to re-run playbooks |
| Learning Curve | ⭐⭐⭐⭐ YAML-based, human-readable |
| Community | ⭐⭐⭐⭐⭐ Massive ecosystem, roles, modules |
| ARM64 Support | ⭐⭐⭐⭐⭐ Python-based, platform agnostic |
Alternatives Considered:
- Terraform: Better for cloud, but overkill for bare metal SSH tasks
- SaltStack: Requires agent installation
- Chef/Puppet: Overly complex for homelab scale
Use Cases in This Homelab:
- Partitioning NVMe drives for local-path
- Installing and configuring K3s
- Deploying restrictive HTTP proxy
- Setting up SSH keys across nodes
- Configuring systemd services
Key Ansible Files:
ansible/
├── install-k3s/
│ ├── install.yaml # Main K3s installation playbook
│ ├── hosts.yaml # Inventory (blade001-005)
│ └── .env # Google OIDC credentials
├── install-restrictive-proxy/
│ ├── install.yaml # Proxy deployment playbook
│ └── server.js # Node.js proxy implementation
└── partition-nvme-drives/
└── partition.yaml # NVMe partitioning for etcd + local-pathHelmfile
Purpose: Kubernetes infrastructure orchestration and deployment
Why Helmfile over Plain Helm:
| Feature | Helmfile | Plain Helm |
|---|---|---|
| Multi-release management | ✅ Single file | ❌ Manual scripting |
| Dependency ordering | ✅ Automatic | ❌ Manual |
| Environment management | ✅ Built-in | ❌ Values files |
| Templating | ✅ Gotemplates | ⚠️ Limited |
| Declarative | ✅ GitOps-friendly | ⚠️ Imperative |
Why NOT Terraform/Crossplane:
- Terraform requires state management (adds complexity)
- Crossplane heavier than needed for homelab
- Helmfile simpler, Helm-native, sufficient for this scale
Architecture:
helmfile/
├── helmfile.yaml # Root orchestration
├── integrations.yaml # Shared credentials
└── */
├── helmfile.yaml.gotmpl # Component-specific helmfile
├── env.yaml # Component environment vars
└── charts/ # Custom chartsKey Features Used:
- Dependency ordering:
needs: [metallb-system, secrets-system] - Go templating: Dynamic values from
.envfiles - Namespace management: Automatic namespace creation
- Hook support: Pre/post-install hooks
GitOps & Continuous Deployment
ArgoCD
Purpose: Declarative GitOps continuous deployment
Why ArgoCD:
| Feature | Benefit |
|---|---|
| Declarative | Git as single source of truth |
| Automatic sync | Self-healing on configuration drift |
| Multi-cluster | Supports multiple K8s clusters (future) |
| RBAC & SSO | Google OIDC integration |
| application | Automatic app discovery from Git |
| UI + CLI | Great visibility and debugging |
Alternatives Considered:
- FluxCD: More Kubernetes-native, but less mature UI
- Jenkins X: Opinionated CI/CD, heavier resource footprint
- Spinnaker: Enterprise-grade but over-engineered for homelab
Configuration Highlights:
argocd:
server:
config:
url: https://argocd.homelab.int.zengarden.space
dex.config: |
connectors:
- type: oidc
id: google
name: Google
config:
issuer: https://accounts.google.com
clientID: $GOOGLE_CLIENT_ID
clientSecret: $GOOGLE_CLIENT_SECRET
rbacConfig:
policy.csv: |
g, [email protected], role:admin
g, role:readonly, role:readonlyapplication for Auto-Discovery:
apiVersion: argoproj.io/v1alpha1
kind: application
metadata:
name: applications
spec:
generators:
- git:
repoURL: https://gitea.homelab.int.zengarden.space/zengarden-space/manifests.git
revision: main
directories:
- path: manifests/*
template:
spec:
source:
repoURL: https://gitea.homelab.int.zengarden.space/zengarden-space/manifests.git
targetRevision: main
path: '{{path}}'
destination:
server: https://kubernetes.default.svc
namespace: '{{path.basename}}'
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=trueGitea
Purpose: Self-hosted Git server with CI/CD capabilities
Why Gitea:
| Criterion | Evaluation |
|---|---|
| Lightweight | ⭐⭐⭐⭐⭐ Single binary, ~50MB RAM |
| GitHub-compatible | ⭐⭐⭐⭐ API compatibility for tooling |
| Built-in CI/CD | ⭐⭐⭐⭐ Gitea Actions (GitHub Actions clone) |
| Self-hosted | ⭐⭐⭐⭐⭐ Full data ownership |
| ARM64 Support | ⭐⭐⭐⭐⭐ Native binaries |
Alternatives Considered:
- GitLab: Too resource-intensive (~4GB RAM minimum)
- Gogs: Gitea fork with fewer features
- GitHub: Cloud-hosted (doesn’t meet self-hosted goal)
Integration Flow:
GitHub (zengarden-space org)
│
│ (Bidirectional sync via Gitea Automation chart)
▼
Gitea (gitea.homelab.int.zengarden.space/zengarden-space/)
│
│ (Git push event → webhook)
▼
Gitea Actions (CI/CD pipeline)
│
│ (Build + test + push to registry)
▼
manifests repository (gitea.homelab.int.zengarden.space/zengarden-space/manifests)
│
│ (ArgoCD watches)
▼
Kubernetes deploymentGitea Actions Example:
name: Build and Deploy
on: [push]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Build Docker image
run: docker build -t gitea.homelab.int.zengarden.space/app:${{ github.sha }} .
- name: Push to Gitea registry
run: docker push gitea.homelab.int.zengarden.space/app:${{ github.sha }}
- name: Update manifests
run: |
git clone https://gitea.homelab.int.zengarden.space/zengarden-space/manifests
cd manifests/app
kustomize edit set image app=gitea.homelab.int.zengarden.space/app:${{ github.sha }}
git commit -am "Update app to ${{ github.sha }}"
git pushNetworking & Ingress
Cilium (CNI)
Purpose: Container Network Interface with eBPF-based networking
Why Cilium:
| Feature | Benefit |
|---|---|
| eBPF-based | Bypasses iptables, ~10-15% better performance |
| NetworkPolicy | Native K8s NetworkPolicy support |
| kube-proxy replacement | eBPF load balancing |
| Hubble | Network observability and flow visualization |
| Encryption | Optional WireGuard pod-to-pod encryption |
Alternatives Considered:
- Flannel: Simpler but fewer features (no NetworkPolicy)
- Calico: iptables-based, higher resource usage
- Weave: Deprecated
Key Configuration:
cilium:
k8sServiceHost: 192.168.77.170
k8sServicePort: 6443
kubeProxyReplacement: strict # Replace kube-proxy with eBPF
bpf:
masquerade: true
hubble:
enabled: true
ui:
enabled: trueMetalLB
Purpose: Load balancer for bare-metal Kubernetes
Why MetalLB:
| Criterion | Evaluation |
|---|---|
| Bare-metal support | ⭐⭐⭐⭐⭐ Purpose-built for bare metal |
| Layer 2 mode | ⭐⭐⭐⭐⭐ Simple ARP-based (no BGP needed) |
| IP pool management | ⭐⭐⭐⭐ IP address allocation |
| LoadBalancer services | ⭐⭐⭐⭐⭐ Native K8s LoadBalancer type |
Alternatives:
- K3s ServiceLB: Limited to single IP per service
- Kube-VIP: More complex setup
- External load balancer: Requires additional hardware
Configuration:
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: default
namespace: metallb-system
spec:
addresses:
- 192.168.77.200-192.168.77.254ingress-nginx
Purpose: HTTP(S) ingress controller
Why ingress-nginx:
| Feature | Benefit |
|---|---|
| Mature | ⭐⭐⭐⭐⭐ Production-proven |
| Performance | ⭐⭐⭐⭐ Nginx-based |
| Features | ⭐⭐⭐⭐ WebSocket, TCP/UDP, ModSecurity |
| Community | ⭐⭐⭐⭐⭐ CNCF project |
Dual Ingress Strategy:
-
Internal ingress (class:
internal)- MetalLB IP: 192.168.77.200
- Used for: Gitea, ArgoCD, Grafana, Metabase
- No ModSecurity (trusted internal network)
-
External ingress (class:
external)- Cloudflare Tunnel backend
- Used for: Public-facing services
- ModSecurity WAF enabled
cert-manager
Purpose: Automated TLS certificate management
Why cert-manager:
- ACME support: Let’s Encrypt integration
- DNS-01 challenge: Cloudflare DNS validation
- Internal CA: Custom CA for internal domains
- Automatic renewal: No manual certificate management
- CRD-based: Kubernetes-native
Configuration:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: [email protected]
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- dns01:
cloudflare:
email: [email protected]
apiTokenSecretRef:
name: cloudflare-api-token
key: api-tokenexternal-dns
Purpose: Automatic DNS record synchronization
Why external-dns:
- Watches Ingress/Service resources
- Creates DNS records automatically
- Supports multiple providers (Cloudflare, MikroTik via webhook)
- Eliminates manual DNS management
MikroTik Integration (via webhook):
external-dns:
provider:
name: webhook
webhook:
image: ghcr.io/mirceanton/external-dns-provider-mikrotik:v1.4.4
env:
MIKROTIK_BASEURL: http://mikrotik-proxy.homelab.int.zengarden.space
MIKROTIK_SKIP_TLS_VERIFY: "true"
sources:
- ingress
- service
domainFilters:
- homelab.int.zengarden.spaceSecrets Management
External Secrets Operator (ESO)
Purpose: Synchronize external secrets into Kubernetes
Why ESO:
- Fetches secrets from external sources (Kubernetes secrets in
integrationsnamespace) - Creates Kubernetes Secrets in target namespaces
- Centralized credential management
- Supports secret rotation
Configuration:
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: integrations
namespace: argocd
spec:
provider:
kubernetes:
server:
url: "https://kubernetes.default.svc"
caProvider:
type: ConfigMap
name: kube-root-ca.crt
key: ca.crt
auth:
serviceAccount:
name: external-secrets
remoteNamespace: integrationsDerivedSecrets Operator (Custom)
Purpose: Derive deterministic passwords from a master key using Argon2id
Why Custom Operator:
- Security: Single master password derives all component passwords
- Deterministic: Same input always produces same output
- Cryptographically secure: Argon2id is OWASP-recommended KDF
- Namespace isolation: Each component gets unique derived secrets
How It Works:
Master Password (stored in derived-secret-operator namespace)
│
│ Argon2id(password, salt=context, iterations=4, memory=65536)
▼
DerivedSecret CRD:
spec:
password: 32 # Generate 32-character password
apiToken: 48 # Generate 48-character API token
│
▼
Kubernetes Secret:
data:
password: <base64-encoded derived password>
apiToken: <base64-encoded derived token>CRD Example:
apiVersion: zengarden.space/v1
kind: DerivedSecret
metadata:
name: gitea-admin
namespace: gitea
spec:
password: 32
apiToken: 48Alternatives Considered:
- Sealed Secrets: Encrypted secrets in Git (not deterministic, complex key management)
- HashiCorp Vault: Over-engineered for homelab, resource intensive
- SOPS: Good for GitOps but manual encryption/decryption
Monitoring & Observability
Victoria Metrics
Purpose: Metrics collection, storage, and querying
Why Victoria Metrics over Prometheus:
| Feature | Victoria Metrics | Prometheus |
|---|---|---|
| Resource usage | ⭐⭐⭐⭐⭐ 50% less RAM | ⭐⭐⭐ |
| Storage efficiency | ⭐⭐⭐⭐⭐ 7× compression | ⭐⭐⭐ |
| PromQL compatibility | ⭐⭐⭐⭐⭐ 100% compatible | ⭐⭐⭐⭐⭐ Native |
| High availability | ⭐⭐⭐⭐ Built-in clustering | ⭐⭐⭐ Requires federation |
| Performance | ⭐⭐⭐⭐⭐ Faster queries | ⭐⭐⭐⭐ |
Components:
- vmagent: Metrics collection (Prometheus scraping)
- vmstorage: Time-series storage
- vmselect: Query engine
- vmalert: Alerting rules
- vmauth: Authentication proxy
Grafana
Purpose: Metrics visualization and dashboards
Why Grafana:
- Industry standard for metrics dashboards
- Prometheus/VictoriaMetrics data source support
- Alerting integration
- Template variables and dynamic dashboards
- OIDC authentication support
Dashboards:
- Node Exporter (system metrics)
- Kubernetes cluster metrics
- local-path storage metrics
- Application-specific dashboards
AlertManager + Gotify
Purpose: Alert routing and push notifications
Why This Stack:
- AlertManager: Industry-standard alert routing
- Gotify: Self-hosted push notifications (vs. cloud services)
- alertmanager-gotify-nodejs: Custom webhook bridge
Alert Flow:
vmalert (evaluates rules)
│
▼
AlertManager (routes, deduplicates, groups)
│
▼
alertmanager-gotify-nodejs (webhook → Gotify API)
│
▼
Gotify (push notifications to mobile/web)Loki + Promtail
Purpose: Log aggregation and querying
Why Loki:
| Feature | Benefit |
|---|---|
| Lightweight | ⭐⭐⭐⭐⭐ Lower resource usage than ELK |
| Grafana integration | ⭐⭐⭐⭐⭐ Unified metrics + logs UI |
| Label-based | ⭐⭐⭐⭐ Similar to PromQL |
| Storage efficiency | ⭐⭐⭐⭐ Only indexes labels, not content |
Components:
- Loki: Log aggregation storage (SingleBinary mode)
- Promtail: Log shipping agent (DaemonSet)
- Loki Gateway: Query routing
Why NOT ELK/OpenSearch:
- ElasticSearch/OpenSearch are resource-intensive (~4GB+ RAM)
- Full-text indexing unnecessary for homelab scale
- Loki’s label-based approach sufficient for filtering
- Better Grafana integration (unified observability)
Database
CloudNativePG (CNPG)
Purpose: PostgreSQL operator for Kubernetes
Why CNPG:
| Feature | Benefit |
|---|---|
| Native operator | ⭐⭐⭐⭐⭐ True Kubernetes operator |
| HA clustering | ⭐⭐⭐⭐ Automatic failover |
| Backup/restore | ⭐⭐⭐⭐ Built-in backup to S3/PVC |
| Connection pooling | ⭐⭐⭐⭐ PgBouncer integration |
| Monitoring | ⭐⭐⭐⭐ Prometheus metrics |
Alternatives Considered:
- Zalando Postgres Operator: More complex, less actively maintained
- Crunchy Data: Enterprise-focused, heavier
- Manual StatefulSet: No HA, manual management
Metabase
Purpose: Data analytics and visualization
Why Metabase:
- Self-hosted alternative to cloud BI tools
- Automatic CNPG database discovery (custom operator)
- SQL + visual query builder
- User-friendly for non-technical users
- Low resource footprint
Custom Tooling
Restrictive HTTP Proxy (Node.js)
Purpose: Limit MikroTik API access to specific paths
Why Custom:
- Security: Credential blast radius mitigation
- Path restrictions: Only allow DNS API calls (
/rest/ip/dns/static**) - Watch mode: Log-only mode for testing
- Simplicity: 200 lines of Node.js
Implementation:
// Path matching with glob patterns
const isPathAllowed = (method, path) => {
const restrictions = config.restrictions;
const patterns = restrictions[method] || [];
return patterns.some(pattern => minimatch(path, pattern));
};Gitea Automation (Playwright + Bash)
Purpose: Automate OAuth setup and GitHub synchronization
Components:
-
OAuth Setup Job (Playwright):
- Automates browser interaction with Gitea UI
- Configures Google OAuth provider
- Retries until successful (60 attempts × 10s)
-
Sync Job (Bash + curl):
- Creates organization
- Generates personal access tokens
- Syncs repositories bidirectionally (GitHub ↔ Gitea)
- Sets up push mirrors (Gitea → GitHub, 8h interval)
- Creates ArgoCD webhook
PartialIngress Operator (Shell-Operator + Python)
Purpose: Enable partial environment deployments for PR/CI environments
Why Custom:
- Reduces resource usage in PR environments by 50-80%
- Automatically replicates missing services from base environments
- Allows deploying only changed components
- Provides seamless routing between PR and base services
How It Works:
1. Deploy only frontend in PR environment (ci-pr-123-myapp)
2. PartialIngress detects missing backend service
3. Operator replicates backend Ingress from dev environment
4. Nginx merges rules: PR frontend + dev backend
5. Result: Full-stack environment with partial deploymentCRDs:
- PartialIngress: Drop-in replacement for standard Ingress with auto-replication
- CompositeIngressHost: Declares base environment and hostname pattern
Implementation:
- Shell-operator for event watching
- Python sidecar for complex Kubernetes API operations
- File-based IPC between containers
- Finalizers for cross-namespace cleanup
Example:
# Base environment (dev-myapp)
apiVersion: networking.zengarden.space/v1
kind: CompositeIngressHost
metadata:
name: myapp-composite
namespace: dev-myapp
spec:
baseHost: "myapp.dev.domain.com"
hostPattern: "myapp-*.domain.com"
ingressClassName: internal
---
# PR environment (ci-pr-123-myapp) - only frontend deployed
apiVersion: networking.zengarden.space/v1
kind: PartialIngress
metadata:
name: frontend
namespace: ci-pr-123-myapp
spec:
ingressClassName: internal
rules:
- host: myapp-pr-123.domain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: frontend
port:
number: 80
# Operator automatically creates backend Ingress in dev-myapp
# with myapp-pr-123.domain.com hostnameGrafana Alert Operator (Shell-Operator + Python)
Purpose: Manage Grafana alerting resources declaratively via Kubernetes CRDs
Why Custom:
- GitOps-native alerting: Alerts live in Git alongside application manifests
- Declarative configuration: Define alerts, notification policies, mute timings as K8s resources
- Version control: Full change history for alerting configuration
- Namespace isolation: Teams manage their own alerts
- Automated reconciliation: Syncs CRDs to Grafana via HTTP API
How It Works:
1. Developer defines GrafanaAlertRule CRD
2. Shell-operator detects CRD change (Add/Modify/Delete)
3. Python service reconciles via Grafana Provisioning API
4. Grafana evaluates alert rules and fires to AlertManager
5. Operator updates CRD status with UID and sync statusCRDs:
- GrafanaAlertRule: Prometheus-style alert rules with queries
- GrafanaNotificationPolicy: Alert routing and grouping policies
- GrafanaMuteTiming: Schedule-based alert muting
- GrafanaNotificationTemplate: Custom notification message templates
Implementation:
- Shell-operator for event watching
- Python service for Grafana API operations
- File-based IPC between containers (
/shared/binding-context.json) - Service account token authentication
- Finalizers for cleanup on deletion
- Status subresources for observability
Example CRD:
apiVersion: monitoring.zengarden.space/v1
kind: GrafanaAlertRule
metadata:
name: high-cpu-usage
namespace: monitoring
spec:
grafanaRef:
name: grafana-service-account
namespace: victoria-metrics
folderUID: homelab-alerts
ruleGroup: infrastructure
title: High CPU Usage
condition: B
data:
- refId: A
queryType: prometheus
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- refId: B
queryType: math
expression: $A > 85
status:
uid: alert_rule_abc123 # Managed by operator
syncStatus: Synced
lastSynced: "2025-11-01T10:00:00Z"Token Provisioning:
A companion grafana-service-account chart provisions API tokens:
- Helm post-install/post-upgrade hook job
- Creates service account in Grafana via HTTP API
- Generates long-lived token with
adminpermissions - Stores token in Kubernetes Secret for operator consumption
Metabase CNPG Operator (Shell-Operator)
Purpose: Auto-register CNPG databases in Metabase
Why Custom:
- Eliminates manual database addition in Metabase UI
- Watches
databases.postgresql.cnpg.ioCRDs - Creates Metabase connections via API
- Enables auto-sync and caching
Hook Configuration:
configVersion: v1
kubernetes:
- apiVersion: databases.postgresql.cnpg.io/v1
kind: Database
executeHookOnEvent: [Added, Modified]
jqFilter: .metadata.nameSummary of Tool Choices
| Category | Tool | Key Reason |
|---|---|---|
| Bare Metal IaC | Ansible | Agentless, idempotent, simple YAML |
| K8s IaC | Helmfile | Multi-release management, dependencies |
| K8s Distribution | K3s | Lightweight, ARM64-optimized, batteries included |
| CNI | Cilium | eBPF performance, NetworkPolicy, Hubble |
| Storage | K3s local-path | Distributed, replicated, mature |
| Load Balancer | MetalLB | Bare-metal support, simple layer-2 mode |
| Ingress | ingress-nginx | Mature, feature-rich, ModSecurity support |
| Certificates | cert-manager | ACME automation, internal CA |
| DNS | external-dns | Automatic record sync, multi-provider |
| GitOps | ArgoCD | application, auto-discovery, SSO |
| Git Server | Gitea | Lightweight, GitHub-compatible, Actions |
| Secrets | ESO + DerivedSecrets | External integration + deterministic derivation |
| Metrics | Victoria Metrics | Resource-efficient, PromQL-compatible |
| Dashboards | Grafana | Industry standard, flexible |
| Logs | Loki + Promtail | Lightweight, label-based, Grafana-integrated |
| Alerts | AlertManager + Gotify | Alert routing + self-hosted notifications |
| Database | CloudNativePG | Native operator, HA, backups |
| Analytics | Metabase | Self-hosted, user-friendly BI |
Next Steps
- Review Security to understand how these tools are hardened
- Proceed to Deployment for step-by-step installation
- Explore Operations for ongoing management
Tool selection prioritizes production-grade capabilities, resource efficiency, and learning value to create a sustainable platform engineering environment.