Architecture Design
System Architecture Overview
This homelab implements a multi-layered, security-hardened architecture designed for resilience, scalability, and operational simplicity. The architecture takes into consideration factors like network isolation, component interactions, failure domains, and security boundaries.
Architectural Layers
Layer 1: Physical & Network Infrastructure
Components:
- 5x Raspberry Pi Compute Module 5 (8GB RAM each)
- NVMe SSDs for storage (etcd + local-path OSDs)
- MikroTik Router (RouterOS 7) for network management
- Zyxel 5-port PoE switch
Network Topology:
┌────────────────────────────────────────────────────────────────┐
│ Internet (Optical WAN) │
└─────────────────────────────┬──────────────────────────────────┘
│ eth1
┌─────────────────────────────┴──────────────────────────────────┐
│ MikroTik Chateau LTE18 ax (Router) │
│ │
│ Firewall Rules │ Network Routing │ DNS │ DHCP │ WireGuard │
├────────────┬────────────────┬──────────────────┬───────────────┤
│ Cluster │ Home │ Management │ WireGuard VPN│
│ Network │ Network │ Network │ 192.168.216.0/24 │
│ eth2+eth3 │ eth4 │ eth5 │ │
└─────┬──────┴────────────────┴──────────────────┴───────────────┘
│
├─ Zyxel Switch (PoE) ───┬─ blade001 (192.168.77.170)
│ ├─ blade002 (192.168.77.171)
│ ├─ blade003 (192.168.77.172)
│ └─ blade004 (192.168.77.173)
│
└─ Direct Connection ──── blade005 (192.168.77.174)Network Segmentation:
| Network | Subnet | Purpose | Access Rules |
|---|---|---|---|
| Cluster | 192.168.77.0/24 | K3s cluster nodes | Internet access, isolated from home network, limited inbound access |
| Home | 192.168.88.0/24 | Home devices (WiFi, laptops) | Internet access, limited cluster service access via MetalLB |
| Management | 192.168.100.0/24 | Management workstation | Full access to all networks, no inbound access allowed |
| WireGuard | 192.168.216.0/24 | VPN clients | Access to cluster + pod/service networks, blocked from home/mgmt |
Firewall Strategy:
- Default deny-all with explicit allow rules
- Management → All (management access)
- Home → Cluster:80,443 (web services only)
- WireGuard → Cluster + Pod/Service CIDRs
- Cluster → Internet (for updates and external integrations)
Layer 2: Kubernetes Infrastructure
K3s Cluster Topology:
┌────────────────────────────────────────────────────────┐
│ K3s Control Plane (HA) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ blade001 │ │ blade002 │ │ blade003 │ │
│ │ (Master 1) │ │ (Master 2) │ │ (Master 3) │ │
│ │ │ │ │ │ │ │
│ │ etcd member │ │ etcd member │ │ etcd member │ │
│ │ API Server │ │ API Server │ │ API Server │ │
│ │ Scheduler │ │ Scheduler │ │ Scheduler │ │
│ │ Controller │ │ Controller │ │ Controller │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ Worker Nodes │
│ ┌────────────────────────────────┐ ┌────────────────────────┐ │
│ │ blade004 (Worker 1) │ │ blade005 (Worker 2) │ │
│ │ - Application workloads │ │ - Application loads │ │
│ │ - local-path OSD │ │ - local-path OSD │ │
│ └────────────────────────────────┘ └────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘Node Roles:
- blade001-003: Control plane + etcd + worker capabilities + local-path monitors
- blade004-005: Dedicated workers + local-path OSDs
Kubernetes Configuration:
- Version: K3s v1.32.4+k3s1
- CNI: Cilium 1.17.4 (replaces default flannel)
- Storage: local-path RBD via Rook operator
- etcd: Stored on NVMe (10GiB partition per node)
- Pod Security: Admission controller in Restricted mode
- RBAC: Google OIDC for authentication
Layer 3: Platform Services
Platform services provide core capabilities required by all applications.
┌───────────────────────────────────────────────────────────┐
│ Ingress Layer │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ External Ingress │ │ Internal Ingress │ │
│ │ (ModSecurity) │ │ (nginx) │ │
│ │ Class: external │ │ Class: internal │ │
│ └────────┬─────────┘ └────────┬─────────┘ │
│ │ │ │
│ ┌────────┴─────────┐ ┌────────┴─────────┐ │
│ │ Cloudflare Tunnel│ │ MetalLB │ │
│ │ (Public Access) │ │ 192.168.77.200- │ │
│ │ │ │ 192.168.77.254 │ │
│ └──────────────────┘ └──────────────────┘ │
└───────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────┐
│ Core Platform │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ cert-manager │ │ external-dns │ │ MetalLB │ │
│ │ (TLS Certs) │ │ (DNS Sync) │ │ (LoadBalancer│ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────┐
│ Secrets Management │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ External Secrets │ │ DerivedSecrets │ │
│ │ Operator │ │ Operator │ │
│ │ (Integration │ │ (Argon2 KDF) │ │
│ │ sync) │ │ │ │
│ └──────────────────┘ └──────────────────┘ │
└───────────────────────────────────────────────────────┘DNS Management:
┌─────────────────────────────────────────────────────────────┐
│ DNS Zones & Authority │
├─────────────────────────────────────────────────────────────┤
│ PUBLIC │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ zengarden.space (Cloudflare DNS) │ │
│ │ - Authoritative: Cloudflare │ │
│ │ - Writer: external-dns from K3s │ │
│ │ - Ingress: Oracle Cloud public IP │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ homelab.zengarden.space (Cloudflare Tunnel) │ │
│ │ - Authoritative: Cloudflare │ │
│ │ - Writer: cloudflared │ │
│ │ - Ingress: Tunnel (transparent) │ │
│ └──────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ INTERNAL │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ homelab.int.zengarden.space (MikroTik DNS) │ │
│ │ - Authoritative: MikroTik (192.168.77.1) │ │
│ │ - Writer: external-dns → restrictive-proxy → │ │
│ │ MikroTik REST API │ │
│ │ - Ingress: MetalLB (192.168.77.200-254) │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘Layer 4: GitOps & Automation
GitOps Flow:
┌─────────────────────────────────────────────────────────────┐
│ Source of Truth │
│ │
│ GitHub: github.com/zengarden-space/* │
│ │ │
│ │ (Bidirectional Sync) │
│ ▼ │
│ Gitea: gitea.homelab.int.zengarden.space/zengarden-space/* │
│ │ │
│ │ (Git Push Event) │
│ ▼ │
│ ┌────────────────────────────────────────────┐ │
│ │ Gitea Actions (CI/CD Pipeline) │ │
│ │ - Build container images │ │
│ │ - Run tests │ │
│ │ - Push to Gitea registry │ │
│ │ - Render Helm charts │ │
│ │ - Push manifests to homelab/{type}/{env}/ │ │
│ └──────────────────┬─────────────────────────┘ │
│ │ │
│ │ (Git Commit to manifests/homelab/) │
│ ▼ │
│ ┌────────────────────────────────────────────┐ │
│ │ ArgoCD ApplicationSet Controller │ │
│ │ - Watches manifests/homelab/*/*/* │ │
│ │ - Auto-discovers applications by path │ │
│ │ - Creates Application per directory │ │
│ │ - Routes to correct AppProject │ │
│ └──────────────────┬─────────────────────────┘ │
│ │ │
│ │ (kubectl apply) │
│ ▼ │
│ ┌────────────────────────────────────────────┐ │
│ │ Kubernetes Cluster │ │
│ │ - Namespace: {type}-{env}-{projectname} │ │
│ │ - Deploys applications │ │
│ │ - Creates services │ │
│ │ - Provisions ingress │ │
│ └────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘Manifest Organization:
manifests/homelab/
├── apps/ # AppProject: apps (application workloads)
│ ├── dev/ # Namespace prefix: apps-dev-*
│ ├── prod/ # Namespace prefix: apps-prod-*
│ └── ci-*/ # Namespace prefix: apps-ci-*-*
├── system/ # AppProject: system (infrastructure services)
│ ├── dev/ # Namespace prefix: system-dev-*
│ └── prod/ # Namespace prefix: system-prod-*
└── platform/ # AppProject: platform (core platform)
└── prod/ # Namespace prefix: platform-prod-*ApplicationSet Pattern:
- Single unified ApplicationSet watches
homelab/*/*/*pattern - Auto-generates Applications:
homelab.{type}.{env}.{projectname} - Dynamically assigns AppProject based on directory structure
- Namespace:
{type}-{env}-{projectname}
Custom Operators:
-
DerivedSecrets Operator
- Derives deterministic passwords from master key using Argon2id
- Creates Kubernetes Secrets based on
DerivedSecretCRDs - Enables single master password to generate all component credentials
-
PartialIngress Operator
- Enables partial environment deployments for PR/CI environments
- Automatically replicates missing services from base environments
- Reduces resource usage by deploying only changed components
- Uses finalizers for proper cleanup across namespaces
-
Metabase CNPG Operator
- Watches CloudNativePG
Databaseresources - Automatically registers databases in Metabase
- Syncs metadata and enables caching
- Watches CloudNativePG
-
Gitea Automation
- OAuth setup (Playwright-based browser automation)
- Repository synchronization (GitHub ↔ Gitea)
- Push mirror configuration
- ArgoCD webhook creation
Layer 5: Observability
Monitoring Architecture:
┌───────────────────────────────────────────────┐
│ Metrics Collection │
│ │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ Prometheus │ │ Node Exporter │ │
│ │ Node Exporters │ │ (per node) │ │
│ └────────┬───────┘ └────────┬───────┘ │
│ │ │ │
│ │ (scrape) │ (scrape) │
│ ▼ ▼ │
│ ┌────────────────────────────────────────┐ │
│ │ Victoria Metrics (vmagent) │ │
│ │ - Metrics ingestion │ │
│ │ - Time series storage │ │
│ │ - Query engine │ │
│ └────────────────┬───────────────────────┘ │
│ │ │
│ │ (PromQL queries) │
│ ▼ │
│ ┌────────────────────────────────────────┐ │
│ │ Grafana Dashboards │ │
│ │ - Infrastructure metrics │ │
│ │ - Application metrics │ │
│ │ - Custom dashboards │ │
│ └────────────────────────────────────────┘ │
└───────────────────────────────────────────────┘
┌──────────────────────────────────────────────┐
│ Alerting Pipeline │
│ │
│ ┌────────────────┐ │
│ │ vmalert │ │
│ │ - Rule eval │ │
│ └────────┬───────┘ │
│ │ (alert) │
│ ▼ │
│ ┌────────────────────────────────────────┐ │
│ │ AlertManager │ │
│ │ - Alert routing │ │
│ │ - Deduplication │ │
│ │ - Grouping │ │
│ └────────────────┬───────────────────────┘ │
│ │ │
│ │ (webhook) │
│ ▼ │
│ ┌────────────────────────────────────────┐ │
│ │ alertmanager-gotify-nodejs │ │
│ │ (Webhook → Gotify converter) │ │
│ └────────────────┬───────────────────────┘ │
│ │ │
│ │ (HTTP POST) │
│ ▼ │
│ ┌────────────────────────────────────────┐ │
│ │ Gotify (Push Notifications) │ │
│ │ - Mobile push notifications │ │
│ │ - Web UI for alerts │ │
│ └────────────────────────────────────────┘ │
└──────────────────────────────────────────────┘Component Interaction Patterns
Secret Flow
1. Master Password (stored in derived-secret-operator namespace)
│
│ (Argon2id KDF with context)
▼
2. DerivedSecret CRD (spec.password: 32, spec.apiToken: 48)
│
│ (operator watches)
▼
3. Kubernetes Secret (data.password, data.apiToken)
│
│ (referenced by pod)
▼
4. Application Pod (env vars from secret)Certificate Provisioning Flow
1. Ingress with annotation: cert-manager.io/cluster-issuer: letsencrypt-prod
│
│ (cert-manager watches)
▼
2. Certificate resource created
│
│ (ACME challenge via Cloudflare DNS)
▼
3. Secret with TLS cert/key created
│
│ (ingress controller references)
▼
4. TLS termination at ingressDNS Synchronization Flow
1. Ingress/Service with external-dns annotation
│
│ (external-dns watches)
▼
2. Webhook call to external-dns-provider-mikrotik
│
│ (HTTP POST to restrictive-proxy)
▼
3. Restrictive Proxy validates path (POST /rest/ip/dns/static)
│
│ (allowed, forwards with auth)
▼
4. MikroTik REST API creates DNS record
│
▼
5. DNS record available (homelab.int.zengarden.space)Scalability Considerations
Horizontal Scalability
What scales:
- Worker nodes: Add blade006, blade007, etc.
- Application pods: HPA based on CPU/memory
- local-path OSDs: Add more NVMe drives
- MetalLB IP pool: Expand 192.168.77.200-254 range
What doesn’t scale:
- etcd cluster (3 nodes recommended, max 5 for latency)
- Control plane nodes (limited by master node count)
Vertical Scalability
Upgradeable:
- RAM: CM5 supports 2GB, 4GB, 8GB (currently 8GB)
- Storage: NVMe SSDs can be upgraded to larger capacity
- Network: 2.5Gbps possible with appropriate NICs
High Availability
HA Components:
- 3 master nodes for control plane HA
- 3 etcd members (quorum: 2/3)
- ArgoCD: 2 replicas (server, repo-server)
- Redis HA for ArgoCD
Single Points of Failure:
- MikroTik router (acceptable for homelab)
- Internet connection (LTE planned)
- Power supply (UPS planned)
Security Architecture
Trust Boundaries
┌──────────────────────────────────────────────────────────┐
│ Internet (Untrusted) │
└─────────────────────────┬────────────────────────────────┘
│
┌────────┴───────┐
│ Cloudflare │ (TLS termination, DDoS protection)
└────────┬───────┘
│
┌─────────────────────────┴────────────────────────────────┐
│ MikroTik Firewall (Trust Boundary 1) │
│ - Stateful inspection │
│ - Network isolation │
│ - NAT │
└─────────────────────────┬────────────────────────────────┘
│
┌──────────────┼─────────────┐
│ │ │
┌────┴────┐ ┌─────┴─────┐ ┌────┴────┐
│ Cluster │ │ Home │ │ Mgmt │
│ Network │ │ Network │ │ Network │
└────┬────┘ └───────────┘ └─────────┘
│
┌──────────┴─────────────────────────────────────┐
│ Ingress Controller (Trust Boundary 2) │
│ - ModSecurity (external) │
│ - TLS termination │
│ - Request validation │
└─────────────────────────┬──────────────────────┘
│
┌─────────────────────────┴──────────────────────┐
│ Network Policies (Trust Boundary 3) │
│ - Cilium NetworkPolicy │
│ - Pod-to-pod restrictions │
│ - Namespace isolation │
└─────────────────────────┬──────────────────────┘
│
┌──────────────┴──────────────┐
│ │
┌────┴────────┐ ┌─────────┴────────┐
│ Application │ │ Data Plane │
│ Pods │ │ (Databases) │
└─────────────┘ └──────────────────┘Defense in Depth
Layer 1: Network
- Network segmentation
- Firewall rules (default deny)
- WireGuard VPN encryption
Layer 2: Ingress
- ModSecurity WAF (external ingress)
- TLS 1.3 only
- Rate limiting
Layer 3: Kubernetes
- Pod Security Admission (Restricted)
- Network Policies (Cilium)
- RBAC with OIDC
- Secrets encryption at rest
Layer 4: Application
- Non-root containers
- Read-only root filesystems
- Capability dropping
- AppArmor/seccomp profiles
Layer 5: Audit
- Kubernetes audit logging
- Victoria Metrics for security events
- AlertManager for anomaly detection
Disaster Recovery
Backup Strategy
What’s backed up:
- etcd snapshots (daily)
- Persistent volumes (local-path snapshots)
- Git repositories (GitHub + Gitea)
- Configuration (Git-tracked)
What’s not backed up:
- Stateless applications (recreatable from Git)
- Metrics data (ephemeral)
- Logs (ephemeral)
Recovery Procedures
Scenario 1: Single node failure
- K3s: Cluster continues with 2/3 masters
- Action: Replace blade, re-join cluster
Scenario 2: Cluster failure
- Restore from Git repositories
- Run Ansible playbooks to rebuild K3s
- Deploy helmfile to restore platform
- ArgoCD self-heals applications
Scenario 3: Data corruption
- Restore etcd from backup
- Replay Git commits if needed
Design Patterns
Infrastructure as Code Pattern
- All infrastructure in Git
- Declarative manifests (YAML)
- Immutable deployments
- Version control for changes
GitOps Pattern
- Git as single source of truth
- Automated synchronization
- Self-healing on drift
- Audit trail in Git history
Operator Pattern
- Custom controllers for domain-specific logic
- CRDs for declarative config
- Reconciliation loops
- Kubernetes-native automation
Performance Considerations
Bottlenecks
Identified:
- Network: 1Gbps per blade (acceptable for homelab)
- Storage IOPS: NVMe limited by USB3 throughput on CM5
- CPU: ARM Cortex-A76 (acceptable for most workloads)
Optimizations:
- etcd on NVMe (low-latency storage)
- Cilium eBPF (bypasses iptables overhead)
- local-path: Direct NVMe access for OSDs
Resource Allocation
Reserved Resources:
- etcd: 10GiB NVMe partition per node
- System: 1GB RAM per node (kubelet, systemd)
- local-path: Remaining NVMe space
Available Resources:
- Total RAM: 72GB (4 nodes × 16GB + 1 node × 8GB)
- Total CPU: 20 cores (5 nodes × 4 cores)
- Total Storage: ~1TB NVMe (local-path)
Conclusion
This architecture demonstrates how production-grade platform engineering principles can be applied to minimal hardware:
- Resilience: HA control plane, distributed storage, self-healing
- Security: Multiple trust boundaries, defense in depth, least privilege
- Scalability: Horizontal scaling for nodes and applications
- Observability: Comprehensive metrics, logs, and alerts
- Automation: GitOps, custom operators, zero manual steps
The modular design ensures each component can be understood, replaced, or upgraded independently while maintaining overall system integrity.
Next Steps
- Review Infrastructure Planning for hardware details
- Explore Tools & Technology for justification of tool choices
- Understand Security protocols in depth