Tools & Technology Selection

Tool Selection Criteria

Every tool in this homelab was chosen based on these criteria:

Production-Grade: Must be used in enterprise environments
Open Source: Prefer FOSS for learning and customization
Active Community: Strong community support and documentation
Resource Efficiency: Must run on ARM64 with limited RAM
Learning Value: Should teach transferable skills

Infrastructure as Code

Ansible

Purpose: Bare metal automation and OS-level configuration

Why Ansible:

Criterion	Evaluation
Agentless	⭐⭐⭐⭐⭐ SSH-based, no agents to manage
Idempotent	⭐⭐⭐⭐ Safe to re-run playbooks
Learning Curve	⭐⭐⭐⭐ YAML-based, human-readable
Community	⭐⭐⭐⭐⭐ Massive ecosystem, roles, modules
ARM64 Support	⭐⭐⭐⭐⭐ Python-based, platform agnostic

Alternatives Considered:

Terraform: Better for cloud, but overkill for bare metal SSH tasks
SaltStack: Requires agent installation
Chef/Puppet: Overly complex for homelab scale

Use Cases in This Homelab:

Partitioning NVMe drives for local-path
Installing and configuring K3s
Deploying restrictive HTTP proxy
Setting up SSH keys across nodes
Configuring systemd services

Key Ansible Files:


ansible/
├── install-k3s/
│   ├── install.yaml       # Main K3s installation playbook
│   ├── hosts.yaml         # Inventory (blade001-005)
│   └── .env              # Google OIDC credentials
├── install-restrictive-proxy/
│   ├── install.yaml       # Proxy deployment playbook
│   └── server.js          # Node.js proxy implementation
└── partition-nvme-drives/
    └── partition.yaml     # NVMe partitioning for etcd + local-path

Helmfile

Purpose: Kubernetes infrastructure orchestration and deployment

Why Helmfile over Plain Helm:

Feature	Helmfile	Plain Helm
Multi-release management	✅ Single file	❌ Manual scripting
Dependency ordering	✅ Automatic	❌ Manual
Environment management	✅ Built-in	❌ Values files
Templating	✅ Gotemplates	⚠️ Limited
Declarative	✅ GitOps-friendly	⚠️ Imperative

Why NOT Terraform/Crossplane:

Terraform requires state management (adds complexity)
Crossplane heavier than needed for homelab
Helmfile simpler, Helm-native, sufficient for this scale

Architecture:


helmfile/
├── helmfile.yaml              # Root orchestration
├── integrations.yaml          # Shared credentials
└── */
    ├── helmfile.yaml.gotmpl   # Component-specific helmfile
    ├── env.yaml               # Component environment vars
    └── charts/                # Custom charts

Key Features Used:

Dependency ordering: needs: [metallb-system, secrets-system]
Go templating: Dynamic values from .env files
Namespace management: Automatic namespace creation
Hook support: Pre/post-install hooks

GitOps & Continuous Deployment

ArgoCD

Purpose: Declarative GitOps continuous deployment

Why ArgoCD:

Feature	Benefit
Declarative	Git as single source of truth
Automatic sync	Self-healing on configuration drift
Multi-cluster	Supports multiple K8s clusters (future)
RBAC & SSO	Google OIDC integration
application	Automatic app discovery from Git
UI + CLI	Great visibility and debugging

Alternatives Considered:

FluxCD: More Kubernetes-native, but less mature UI
Jenkins X: Opinionated CI/CD, heavier resource footprint
Spinnaker: Enterprise-grade but over-engineered for homelab

Configuration Highlights:


argocd:
  server:
    config:
      url: https://argocd.homelab.int.zengarden.space
      dex.config: |
        connectors:
          - type: oidc
            id: google
            name: Google
            config:
              issuer: https://accounts.google.com
              clientID: $GOOGLE_CLIENT_ID
              clientSecret: $GOOGLE_CLIENT_SECRET
    rbacConfig:
      policy.csv: |
        g, [email protected], role:admin
        g, role:readonly, role:readonly

application for Auto-Discovery:


apiVersion: argoproj.io/v1alpha1
kind: application
metadata:
  name: applications
spec:
  generators:
    - git:
        repoURL: https://gitea.homelab.int.zengarden.space/zengarden-space/manifests.git
        revision: main
        directories:
          - path: manifests/*
  template:
    spec:
      source:
        repoURL: https://gitea.homelab.int.zengarden.space/zengarden-space/manifests.git
        targetRevision: main
        path: '{{path}}'
      destination:
        server: https://kubernetes.default.svc
        namespace: '{{path.basename}}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true

Gitea

Purpose: Self-hosted Git server with CI/CD capabilities

Why Gitea:

Criterion	Evaluation
Lightweight	⭐⭐⭐⭐⭐ Single binary, ~50MB RAM
GitHub-compatible	⭐⭐⭐⭐ API compatibility for tooling
Built-in CI/CD	⭐⭐⭐⭐ Gitea Actions (GitHub Actions clone)
Self-hosted	⭐⭐⭐⭐⭐ Full data ownership
ARM64 Support	⭐⭐⭐⭐⭐ Native binaries

Alternatives Considered:

GitLab: Too resource-intensive (~4GB RAM minimum)
Gogs: Gitea fork with fewer features
GitHub: Cloud-hosted (doesn’t meet self-hosted goal)

Integration Flow:


GitHub (zengarden-space org)
    │
    │ (Bidirectional sync via Gitea Automation chart)
    ▼
Gitea (gitea.homelab.int.zengarden.space/zengarden-space/)
    │
    │ (Git push event → webhook)
    ▼
Gitea Actions (CI/CD pipeline)
    │
    │ (Build + test + push to registry)
    ▼
manifests repository (gitea.homelab.int.zengarden.space/zengarden-space/manifests)
    │
    │ (ArgoCD watches)
    ▼
Kubernetes deployment

Gitea Actions Example:


name: Build and Deploy
on: [push]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Build Docker image
        run: docker build -t gitea.homelab.int.zengarden.space/app:${{ github.sha }} .
      - name: Push to Gitea registry
        run: docker push gitea.homelab.int.zengarden.space/app:${{ github.sha }}
      - name: Update manifests
        run: |
          git clone https://gitea.homelab.int.zengarden.space/zengarden-space/manifests
          cd manifests/app
          kustomize edit set image app=gitea.homelab.int.zengarden.space/app:${{ github.sha }}
          git commit -am "Update app to ${{ github.sha }}"
          git push

Networking & Ingress

Cilium (CNI)

Purpose: Container Network Interface with eBPF-based networking

Why Cilium:

Feature	Benefit
eBPF-based	Bypasses iptables, ~10-15% better performance
NetworkPolicy	Native K8s NetworkPolicy support
kube-proxy replacement	eBPF load balancing
Hubble	Network observability and flow visualization
Encryption	Optional WireGuard pod-to-pod encryption

Alternatives Considered:

Flannel: Simpler but fewer features (no NetworkPolicy)
Calico: iptables-based, higher resource usage
Weave: Deprecated

Key Configuration:


cilium:
  k8sServiceHost: 192.168.77.170
  k8sServicePort: 6443
  kubeProxyReplacement: strict       # Replace kube-proxy with eBPF
  bpf:
    masquerade: true
  hubble:
    enabled: true
    ui:
      enabled: true

MetalLB

Purpose: Load balancer for bare-metal Kubernetes

Why MetalLB:

Criterion	Evaluation
Bare-metal support	⭐⭐⭐⭐⭐ Purpose-built for bare metal
Layer 2 mode	⭐⭐⭐⭐⭐ Simple ARP-based (no BGP needed)
IP pool management	⭐⭐⭐⭐ IP address allocation
LoadBalancer services	⭐⭐⭐⭐⭐ Native K8s LoadBalancer type

Alternatives:

K3s ServiceLB: Limited to single IP per service
Kube-VIP: More complex setup
External load balancer: Requires additional hardware

Configuration:


apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: default
  namespace: metallb-system
spec:
  addresses:
    - 192.168.77.200-192.168.77.254

ingress-nginx

Purpose: HTTP(S) ingress controller

Why ingress-nginx:

Feature	Benefit
Mature	⭐⭐⭐⭐⭐ Production-proven
Performance	⭐⭐⭐⭐ Nginx-based
Features	⭐⭐⭐⭐ WebSocket, TCP/UDP, ModSecurity
Community	⭐⭐⭐⭐⭐ CNCF project

Dual Ingress Strategy:

Internal ingress (class: internal)
- MetalLB IP: 192.168.77.200
- Used for: Gitea, ArgoCD, Grafana, Metabase
- No ModSecurity (trusted internal network)
External ingress (class: external)
- Cloudflare Tunnel backend
- Used for: Public-facing services
- ModSecurity WAF enabled

cert-manager

Purpose: Automated TLS certificate management

Why cert-manager:

ACME support: Let’s Encrypt integration
DNS-01 challenge: Cloudflare DNS validation
Internal CA: Custom CA for internal domains
Automatic renewal: No manual certificate management
CRD-based: Kubernetes-native

Configuration:


apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
      - dns01:
          cloudflare:
            email: [email protected]
            apiTokenSecretRef:
              name: cloudflare-api-token
              key: api-token

external-dns

Purpose: Automatic DNS record synchronization

Why external-dns:

Watches Ingress/Service resources
Creates DNS records automatically
Supports multiple providers (Cloudflare, MikroTik via webhook)
Eliminates manual DNS management

MikroTik Integration (via webhook):


external-dns:
  provider:
    name: webhook
    webhook:
      image: ghcr.io/mirceanton/external-dns-provider-mikrotik:v1.4.4
      env:
        MIKROTIK_BASEURL: http://mikrotik-proxy.homelab.int.zengarden.space
        MIKROTIK_SKIP_TLS_VERIFY: "true"
  sources:
    - ingress
    - service
  domainFilters:
    - homelab.int.zengarden.space

Secrets Management

External Secrets Operator (ESO)

Purpose: Synchronize external secrets into Kubernetes

Why ESO:

Fetches secrets from external sources (Kubernetes secrets in integrations namespace)
Creates Kubernetes Secrets in target namespaces
Centralized credential management
Supports secret rotation

Configuration:


apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: integrations
  namespace: argocd
spec:
  provider:
    kubernetes:
      server:
        url: "https://kubernetes.default.svc"
        caProvider:
          type: ConfigMap
          name: kube-root-ca.crt
          key: ca.crt
      auth:
        serviceAccount:
          name: external-secrets
      remoteNamespace: integrations

DerivedSecrets Operator (Custom)

Purpose: Derive deterministic passwords from a master key using Argon2id

Why Custom Operator:

Security: Single master password derives all component passwords
Deterministic: Same input always produces same output
Cryptographically secure: Argon2id is OWASP-recommended KDF
Namespace isolation: Each component gets unique derived secrets

How It Works:


Master Password (stored in derived-secret-operator namespace)
    │
    │ Argon2id(password, salt=context, iterations=4, memory=65536)
    ▼
DerivedSecret CRD:
  spec:
    password: 32      # Generate 32-character password
    apiToken: 48      # Generate 48-character API token
    │
    ▼
Kubernetes Secret:
  data:
    password: <base64-encoded derived password>
    apiToken: <base64-encoded derived token>

CRD Example:


apiVersion: zengarden.space/v1
kind: DerivedSecret
metadata:
  name: gitea-admin
  namespace: gitea
spec:
  password: 32
  apiToken: 48

Alternatives Considered:

Sealed Secrets: Encrypted secrets in Git (not deterministic, complex key management)
HashiCorp Vault: Over-engineered for homelab, resource intensive
SOPS: Good for GitOps but manual encryption/decryption

Monitoring & Observability

Victoria Metrics

Purpose: Metrics collection, storage, and querying

Why Victoria Metrics over Prometheus:

Feature	Victoria Metrics	Prometheus
Resource usage	⭐⭐⭐⭐⭐ 50% less RAM	⭐⭐⭐
Storage efficiency	⭐⭐⭐⭐⭐ 7× compression	⭐⭐⭐
PromQL compatibility	⭐⭐⭐⭐⭐ 100% compatible	⭐⭐⭐⭐⭐ Native
High availability	⭐⭐⭐⭐ Built-in clustering	⭐⭐⭐ Requires federation
Performance	⭐⭐⭐⭐⭐ Faster queries	⭐⭐⭐⭐

Components:

vmagent: Metrics collection (Prometheus scraping)
vmstorage: Time-series storage
vmselect: Query engine
vmalert: Alerting rules
vmauth: Authentication proxy

Grafana

Purpose: Metrics visualization and dashboards

Why Grafana:

Industry standard for metrics dashboards
Prometheus/VictoriaMetrics data source support
Alerting integration
Template variables and dynamic dashboards
OIDC authentication support

Dashboards:

Node Exporter (system metrics)
Kubernetes cluster metrics
local-path storage metrics
Application-specific dashboards

AlertManager + Gotify

Purpose: Alert routing and push notifications

Why This Stack:

AlertManager: Industry-standard alert routing
Gotify: Self-hosted push notifications (vs. cloud services)
alertmanager-gotify-nodejs: Custom webhook bridge

Alert Flow:


vmalert (evaluates rules)
    │
    ▼
AlertManager (routes, deduplicates, groups)
    │
    ▼
alertmanager-gotify-nodejs (webhook → Gotify API)
    │
    ▼
Gotify (push notifications to mobile/web)

Loki + Promtail

Purpose: Log aggregation and querying

Why Loki:

Feature	Benefit
Lightweight	⭐⭐⭐⭐⭐ Lower resource usage than ELK
Grafana integration	⭐⭐⭐⭐⭐ Unified metrics + logs UI
Label-based	⭐⭐⭐⭐ Similar to PromQL
Storage efficiency	⭐⭐⭐⭐ Only indexes labels, not content

Components:

Loki: Log aggregation storage (SingleBinary mode)
Promtail: Log shipping agent (DaemonSet)
Loki Gateway: Query routing

Why NOT ELK/OpenSearch:

ElasticSearch/OpenSearch are resource-intensive (~4GB+ RAM)
Full-text indexing unnecessary for homelab scale
Loki’s label-based approach sufficient for filtering
Better Grafana integration (unified observability)

Database

CloudNativePG (CNPG)

Purpose: PostgreSQL operator for Kubernetes

Why CNPG:

Feature	Benefit
Native operator	⭐⭐⭐⭐⭐ True Kubernetes operator
HA clustering	⭐⭐⭐⭐ Automatic failover
Backup/restore	⭐⭐⭐⭐ Built-in backup to S3/PVC
Connection pooling	⭐⭐⭐⭐ PgBouncer integration
Monitoring	⭐⭐⭐⭐ Prometheus metrics

Alternatives Considered:

Zalando Postgres Operator: More complex, less actively maintained
Crunchy Data: Enterprise-focused, heavier
Manual StatefulSet: No HA, manual management

Metabase

Purpose: Data analytics and visualization

Why Metabase:

Self-hosted alternative to cloud BI tools
Automatic CNPG database discovery (custom operator)
SQL + visual query builder
User-friendly for non-technical users
Low resource footprint

Custom Tooling

Restrictive HTTP Proxy (Node.js)

Purpose: Limit MikroTik API access to specific paths

Why Custom:

Security: Credential blast radius mitigation
Path restrictions: Only allow DNS API calls (/rest/ip/dns/static**)
Watch mode: Log-only mode for testing
Simplicity: 200 lines of Node.js

Implementation:


// Path matching with glob patterns
const isPathAllowed = (method, path) => {
  const restrictions = config.restrictions;
  const patterns = restrictions[method] || [];
  return patterns.some(pattern => minimatch(path, pattern));
};

Gitea Automation (Playwright + Bash)

Purpose: Automate OAuth setup and GitHub synchronization

Components:

OAuth Setup Job (Playwright):
- Automates browser interaction with Gitea UI
- Configures Google OAuth provider
- Retries until successful (60 attempts × 10s)
Sync Job (Bash + curl):
- Creates organization
- Generates personal access tokens
- Syncs repositories bidirectionally (GitHub ↔ Gitea)
- Sets up push mirrors (Gitea → GitHub, 8h interval)
- Creates ArgoCD webhook

PartialIngress Operator (Shell-Operator + Python)

Purpose: Enable partial environment deployments for PR/CI environments

Why Custom:

Reduces resource usage in PR environments by 50-80%
Automatically replicates missing services from base environments
Allows deploying only changed components
Provides seamless routing between PR and base services

How It Works:


1. Deploy only frontend in PR environment (ci-pr-123-myapp)
2. PartialIngress detects missing backend service
3. Operator replicates backend Ingress from dev environment
4. Nginx merges rules: PR frontend + dev backend
5. Result: Full-stack environment with partial deployment

CRDs:

PartialIngress: Drop-in replacement for standard Ingress with auto-replication
CompositeIngressHost: Declares base environment and hostname pattern

Implementation:

Shell-operator for event watching
Python sidecar for complex Kubernetes API operations
File-based IPC between containers
Finalizers for cross-namespace cleanup

Example:


# Base environment (dev-myapp)
apiVersion: networking.zengarden.space/v1
kind: CompositeIngressHost
metadata:
  name: myapp-composite
  namespace: dev-myapp
spec:
  baseHost: "myapp.dev.domain.com"
  hostPattern: "myapp-*.domain.com"
  ingressClassName: internal
---
# PR environment (ci-pr-123-myapp) - only frontend deployed
apiVersion: networking.zengarden.space/v1
kind: PartialIngress
metadata:
  name: frontend
  namespace: ci-pr-123-myapp
spec:
  ingressClassName: internal
  rules:
  - host: myapp-pr-123.domain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: frontend
            port:
              number: 80
# Operator automatically creates backend Ingress in dev-myapp
# with myapp-pr-123.domain.com hostname

Grafana Alert Operator (Shell-Operator + Python)

Purpose: Manage Grafana alerting resources declaratively via Kubernetes CRDs

Why Custom:

GitOps-native alerting: Alerts live in Git alongside application manifests
Declarative configuration: Define alerts, notification policies, mute timings as K8s resources
Version control: Full change history for alerting configuration
Namespace isolation: Teams manage their own alerts
Automated reconciliation: Syncs CRDs to Grafana via HTTP API

How It Works:


1. Developer defines GrafanaAlertRule CRD
2. Shell-operator detects CRD change (Add/Modify/Delete)
3. Python service reconciles via Grafana Provisioning API
4. Grafana evaluates alert rules and fires to AlertManager
5. Operator updates CRD status with UID and sync status

CRDs:

GrafanaAlertRule: Prometheus-style alert rules with queries
GrafanaNotificationPolicy: Alert routing and grouping policies
GrafanaMuteTiming: Schedule-based alert muting
GrafanaNotificationTemplate: Custom notification message templates

Implementation:

Shell-operator for event watching
Python service for Grafana API operations
File-based IPC between containers (/shared/binding-context.json)
Service account token authentication
Finalizers for cleanup on deletion
Status subresources for observability

Example CRD:


apiVersion: monitoring.zengarden.space/v1
kind: GrafanaAlertRule
metadata:
  name: high-cpu-usage
  namespace: monitoring
spec:
  grafanaRef:
    name: grafana-service-account
    namespace: victoria-metrics
  folderUID: homelab-alerts
  ruleGroup: infrastructure
  title: High CPU Usage
  condition: B
  data:
    - refId: A
      queryType: prometheus
      expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
    - refId: B
      queryType: math
      expression: $A > 85
status:
  uid: alert_rule_abc123  # Managed by operator
  syncStatus: Synced
  lastSynced: "2025-11-01T10:00:00Z"

Token Provisioning:

A companion grafana-service-account chart provisions API tokens:

Helm post-install/post-upgrade hook job
Creates service account in Grafana via HTTP API
Generates long-lived token with admin permissions
Stores token in Kubernetes Secret for operator consumption

Metabase CNPG Operator (Shell-Operator)

Purpose: Auto-register CNPG databases in Metabase

Why Custom:

Eliminates manual database addition in Metabase UI
Watches databases.postgresql.cnpg.io CRDs
Creates Metabase connections via API
Enables auto-sync and caching

Hook Configuration:


configVersion: v1
kubernetes:
  - apiVersion: databases.postgresql.cnpg.io/v1
    kind: Database
    executeHookOnEvent: [Added, Modified]
    jqFilter: .metadata.name

Summary of Tool Choices

Category	Tool	Key Reason
Bare Metal IaC	Ansible	Agentless, idempotent, simple YAML
K8s IaC	Helmfile	Multi-release management, dependencies
K8s Distribution	K3s	Lightweight, ARM64-optimized, batteries included
CNI	Cilium	eBPF performance, NetworkPolicy, Hubble
Storage	K3s local-path	Distributed, replicated, mature
Load Balancer	MetalLB	Bare-metal support, simple layer-2 mode
Ingress	ingress-nginx	Mature, feature-rich, ModSecurity support
Certificates	cert-manager	ACME automation, internal CA
DNS	external-dns	Automatic record sync, multi-provider
GitOps	ArgoCD	application, auto-discovery, SSO
Git Server	Gitea	Lightweight, GitHub-compatible, Actions
Secrets	ESO + DerivedSecrets	External integration + deterministic derivation
Metrics	Victoria Metrics	Resource-efficient, PromQL-compatible
Dashboards	Grafana	Industry standard, flexible
Logs	Loki + Promtail	Lightweight, label-based, Grafana-integrated
Alerts	AlertManager + Gotify	Alert routing + self-hosted notifications
Database	CloudNativePG	Native operator, HA, backups
Analytics	Metabase	Self-hosted, user-friendly BI

Next Steps

Review Security to understand how these tools are hardened
Proceed to Deployment for step-by-step installation
Explore Operations for ongoing management

Tool selection prioritizes production-grade capabilities, resource efficiency, and learning value to create a sustainable platform engineering environment.