Skip to Content
IntroductionInfrastructure Planning

Infrastructure Planning

Hardware & Software Overview

This section provides comprehensive infrastructure planning details, including hardware specifications, cost analysis, power consumption metrics, and software stack decisions.

Hardware Infrastructure

Compute Nodes

Selected Platform: Raspberry Pi Compute Module 5 (CM5) Blades

SpecificationDetailsReasoning
CPUARM Cortex-A76 (4 cores)Balance of performance and power efficiency
RAM16GB LPDDR4X (blade001-004), 8GB (blade005)Ample for K8s control plane + workloads
ArchitectureARM64 (aarch64)Low power consumption, modern instruction set
Form FactorCompute BladeSpace-efficient, centralized cooling
Quantity5 nodes3 masters (HA) + 2 workers
Power per Node~5-8W under loadCritical for 24/7 operation cost

Node Roles:

  • blade001-003: K3s masters + etcd + worker capabilities (16GB RAM each)
  • blade004: Worker (16GB RAM)
  • blade005: Worker (8GB RAM)

Total Compute Resources:

  • CPU: 20 cores (5 nodes × 4 cores)
  • RAM: 72GB (4 nodes × 16GB + 1 node × 8GB)
  • Power: 28-40W total cluster

Storage Infrastructure

Initial Approach (Abandoned)

Original Selection: Samsung 990 PRO NVMe SSDs

  • Capacity: 512GB per drive
  • Performance: 7,000 MB/s read, 5,000 MB/s write
  • Problem: Excessive heat generation in enclosed blade chassis
  • Outcome: Thermal throttling, reliability concerns

Final Selection

Adopted Solution: SK hynix BC711 NVMe SSDs

  • Capacity: 256GB per drive
  • Performance: 2,000-3,000 MB/s (sufficient for homelab)
  • Form Factor: M.2 2242 (compact size)
  • Temperature: Minimal heat generation
  • Cost: ~$20-30 per drive
  • Quantity: 5 drives (1 per node)

Storage Configuration:

/dev/nvme0n1 (256GB) → mounted at /var/lib/rancher/k3s
  • No partitioning: Single filesystem for simplicity
  • K3s data: etcd, containerd, kubelet all on NVMe
  • Local-path storage: K3s built-in local-path provisioner

Storage Architecture:

  • Provisioner: rancher.io/local-path (K3s built-in)
  • Total Raw: ~1.2TB (5 × 256GB)
  • Usable: ~1.2TB (no replication, node-local storage)
  • Performance: Direct NVMe access, low latency
  • Note: PVCs are node-local, no cross-node redundancy

Networking Infrastructure

MikroTik Router

Model: MikroTik Chateau LTE18 ax (LTE Cat18 with Wi-Fi 6)

FeatureSpecificationUsage
Ports5× Gigabit Ethernet + 1× SFPWAN, networks, management
CPUDual-core 880 MHz MIPSRouting, firewall, DNS
RAM256MBRouterOS + tables
PoEPoE-out on port 5Powering devices
Power~5WMinimal overhead

Port Assignment:

  • eth1: Optical WAN (fiber internet, failover to LTE)
  • eth2: Cluster network trunk (to Zyxel switch → blades 001-004)
  • eth3: Cluster network access (direct to blade005)
  • eth4: Home network
  • eth5: Management network (management workstation, 1Gbps)

Features Used:

  • Network segmentation (802.1Q)
  • Stateful firewall
  • DHCP server (per network)
  • DNS server (internal zone)
  • WireGuard VPN server
  • NAT (masquerade)
  • LTE failover (built-in modem for internet redundancy)

Zyxel PoE Switch

Model: 5-port Gigabit PoE+ switch

FeatureSpecification
Ports5× Gigabit PoE+
PoE Budget60W total
UsagePowers CM5 blades
Power~3W (idle) + PoE consumption

Connected Devices:

  • blade001: PoE powered
  • blade002: PoE powered
  • blade003: PoE powered
  • blade004: PoE powered
  • Uplink: MikroTik eth2

Note: blade005 connected directly to MikroTik eth3 (separate access port)

Network Topology

┌─────────────────────────────────────────────────────────────┐ │ Internet (Fiber Optic) │ └──────────────────────────┬──────────────────────────────────┘ ┌──────────────────────────┴──────────────────────────────────┐ │ MikroTik Chateau LTE18 ax (Router) │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ Firewall │ Network Routing │ DNS │ DHCP │ WireGuard │ │ │ └──────────────────────────────────────────────────────┘ │ ├────────────┬──────────────┬──────────────┬──────────────────┤ │ Cluster │ Home │ Management │ WireGuard VPN │ │ Network │ Network │ Network │ (Remote) │ │ .77.0/24 │ .88.0/24 │ .100.0/24 │ .216.0/24 │ └─────┬──────┴──────────────┴──────────────┴──────────────────┘ ├─── Zyxel PoE Switch ───┬─── blade001 (.77.11) │ ├─── blade002 (.77.12) │ ├─── blade003 (.77.13) │ └─── blade004 (.77.14) └─── Direct Connection ─── blade005 (.77.15)

Network Design

NetworkSubnetGatewayDNSPurposeFirewall Policy
Cluster192.168.77.0/24.77.1.77.1K3s cluster nodesAllow internet, limited inbound from home network (80/443 only)
Home192.168.88.0/24.88.1.88.1Home devicesAllow internet, allow cluster:80/443 (web services)
Management192.168.100.0/24.100.1.100.1Admin workstationFull access (all networks + internet), no inbound
WireGuard192.168.216.0/24.77.15.77.1VPN clientsAccess cluster + pod/service networks, blocked from home/mgmt

Power Budget

ComponentPower (W)QuantityTotal (W)Notes
CM5 Blades5-8W525-40WVaries with load
MikroTik Router~5W15WConstant
Zyxel Switch~3W + PoE128W3W base + ~25W PoE to blades
Total58-73WMeasured: 60-65W typical

Annual Energy Consumption:

  • Average: 62W × 24h × 365 days = 543 kWh/year
  • Cost: 543 kWh × $0.14/kWh = ~$76/year (~$6/month)

Comparison:

  • Single desktop PC: ~150-250W (3-4× higher)
  • Cloud equivalent: $200-400/month (40-80× higher cost)

Software Infrastructure

Operating System

Selection: Ubuntu Server 24.04 LTS (ARM64)

Reasoning:

  • Long-term support (5 years)
  • Excellent ARM64 support
  • Ansible compatibility
  • Systemd for service management
  • Well-documented

Configuration:

  • Minimal installation (no GUI)
  • Predictable network interface names (eth0)
  • SSH enabled for remote management
  • Automatic security updates disabled (manual control)

Kubernetes Distribution

Selection: K3s v1.32.4+k3s1

AlternativeReason NOT Chosen
kubeadm (vanilla K8s)Higher resource overhead, complex setup
MicroK8sSnap-based (additional overhead), less flexibility
k0sLess mature, smaller community
RKE2More resource intensive than K3s

K3s Advantages:

  • Lightweight: Single binary, minimal dependencies
  • ARM64 optimized: First-class ARM support
  • Embedded etcd: No external etcd cluster needed
  • Batteries included: Traefik, ServiceLB, local-path (removed in favor of custom)
  • Production-ready: CNCF certified Kubernetes

K3s Configuration:

cluster-init: true # Bootstrap etcd cluster disable: - traefik # Use ingress-nginx instead - servicelb # Use MetalLB instead flannel-backend: none # Use Cilium CNI disable-network-policy: true # Cilium handles this secrets-encryption: true # Encrypt secrets at rest kube-apiserver-arg: - "enable-admission-plugins=NodeRestriction,PodSecurity" - "audit-log-path=/var/log/kubernetes/audit.log" - "audit-log-maxage=30" - "audit-log-maxbackup=10" - "audit-log-maxsize=100" - "oidc-issuer-url=https://accounts.google.com" - "oidc-client-id=<google-client-id>" - "oidc-username-claim=email"

Note: K3s includes local-path provisioner by default, providing node-local persistent storage.

Container Network Interface (CNI)

Selection: Cilium 1.17.4

FeatureBenefit
eBPF-basedBypasses iptables overhead, better performance
NetworkPolicyNative support for Kubernetes NetworkPolicy
HubbleObservability for network flows
EncryptionOptional WireGuard-based pod-to-pod encryption
kube-proxy replacementeBPF-based service load balancing

Alternatives Considered:

  • Flannel: Simple but limited features
  • Calico: More resource intensive, iptables-based
  • Weave: Deprecated

Storage

Selection: K3s local-path (built-in)

Architecture:

  • Provisioner: rancher.io/local-path (K3s default)
  • Storage Location: /var/lib/rancher/k3s/storage on each node’s NVMe
  • Binding: WaitForFirstConsumer (binds to node where pod is scheduled)
  • Reclaim Policy: Delete (PV deleted when PVC deleted)

Characteristics:

  • Node-local: PVCs are bound to the node where first pod is scheduled
  • No replication: Data exists only on one node (no cross-node redundancy)
  • Performance: Direct NVMe access, very low latency
  • Simplicity: No additional operators or complexity
  • Limitation: Pods requiring PVC cannot move between nodes

Why local-path for Homelab:

  • Sufficient for stateless apps: Most apps are stateless or use external databases
  • Simpler operations: No distributed storage complexity
  • Better performance: Direct NVMe access without network overhead
  • Lower resource usage: No Ceph daemons consuming CPU/RAM
  • Acceptable risk: Critical data backed up externally (Git, cloud)

Core Platform Services

ServicePurposeVersion
MetalLBLoad balancer for bare metalv0.14.5
ingress-nginxHTTP(S) ingress controllerLatest
cert-managerTLS certificate automationv1.18.1
external-dnsDNS synchronizationv1.15.0
External Secrets OperatorExternal secret integrationv0.10.4
ArgoCDGitOps continuous deploymentLatest
GiteaSelf-hosted Git serverv12.4.0
CloudNativePGPostgreSQL operatorLatest
Victoria MetricsMetrics and monitoringLatest
MetabaseData analytics and visualizationLatest

Custom Operators

OperatorPurposeImplementation
DerivedSecretsDerive passwords from master key using Argon2idShell-operator (Bash + argon2)
PartialIngressPartial environment deployments with auto-replicationShell-operator (Bash + Python)
Metabase CNPGAuto-register CNPG databases in MetabaseShell-operator (Bash + curl)
Gitea AutomationOAuth setup + GitHub syncHelm hooks (Playwright + Bash)
Restrictive ProxyPath-restricted HTTP proxy for MikroTik APINode.js systemd service

Cost Analysis

Initial Capital Expenditure (CapEx)

ItemQuantityUnit PriceTotalNotes
Raspberry Pi CM5 Blades5~$120$6008GB RAM version
NVMe SSDs (final)5~$25$125Budget M.2 2242 drives
NVMe SSDs (initial, abandoned)2~$80$160Samsung 990 PRO (too hot)
MikroTik Chateau LTE18 ax1~$200$200Router + firewall + LTE
Zyxel PoE Switch1~$200$2005-port PoE+
Cables, PowerVarious$100Ethernet cables, power supplies
Total Initial$1,385One-time cost

Lessons Learned:

  • High-end NVMe SSDs ($160) were overkill and caused thermal issues
  • Budget NVMe SSDs ($125) perform adequately and run cool
  • Wasted: $160 on Samsung drives (repurposed for other projects)

Operational Expenditure (OpEx)

Monthly Costs:

ItemCalculationMonthly Cost
Electricity62W × 24h × 30d × $0.14/kWh~$6.20
InternetExisting connection$0 (no additional cost)
Domainzengarden.space$1.00 (annual / 12)
CloudflareFree tier$0
Total Monthly~$7.20

Annual Costs:

ItemAnnual Cost
Electricity~$75
Domain~$12
Total Annual~$87

Total Cost of Ownership (TCO)

5-Year TCO:

ItemCost
Initial CapEx$1,385
5 Years OpEx$435 (5 × $87)
Total$1,820

Cloud Equivalent (AWS):

  • 3× t4g.medium (2vCPU, 4GB): ~$75/month
  • 2× t4g.small (2vCPU, 2GB): ~$40/month
  • 500GB EBS storage: ~$50/month
  • Load balancer: ~$20/month
  • Data transfer: ~$15/month
  • Total: ~$200/month × 60 months = $12,000

Savings: $10,180 over 5 years (6.6× cheaper)

Cost Comparison Analysis

Break-even Point:

  • CapEx: $1,385
  • Monthly savings vs cloud: $200 - $7 = $193
  • Break-even: 1,385 / 193 = 7.2 months

After 1 Year:

  • Total cost: $1,385 + (12 × $7) = $1,469
  • Cloud cost: 12 × $200 = $2,400
  • Savings: $931

After 5 Years:

  • Total cost: $1,820
  • Cloud cost: $12,000
  • Savings: $10,180
  • ROI: 559%

Capacity Planning

Resource Allocation

Total Cluster Resources:

  • CPU: 20 cores (5 nodes × 4 cores)
  • RAM: 72GB (4 × 16GB + 1 × 8GB)
  • Storage: ~1.2TB (5 × 256GB NVMe)

Reserved for System:

  • CPU: ~2 cores (kubelet, systemd per node)
  • RAM: ~12GB (OS + K3s overhead, ~2-3GB per node)
  • Storage: ~100GB (OS, logs, etcd, container images)

Available for Applications:

  • CPU: ~18 cores
  • RAM: ~60GB
  • Storage: ~1.1TB (node-local PVCs)

Workload Planning

Current Workloads:

ApplicationCPU RequestRAM RequestReplicasTotal CPUTotal RAM
ArgoCD200m512Mi4800m2Gi
Gitea500m1Gi1500m1Gi
Metabase500m1Gi1500m1Gi
Victoria Metrics500m2Gi31.56Gi
Ingress200m256Mi2400m512Mi
External DNS50m128Mi150m128Mi
Cert-Manager100m256Mi3300m768Mi
Custom Operators100m128Mi3300m384Mi
ApplicationsVariableVariableVariable~5~10Gi
Total~9.35~22Gi

Headroom: ~8.5 cores CPU, ~10GB RAM

Scaling Considerations

Horizontal Scaling (add nodes):

  • Limit: Power budget, network ports, physical space
  • Recommendation: Max 7-10 nodes before needing more infrastructure

Vertical Scaling (upgrade nodes):

  • RAM: CM5 available in 2GB, 4GB, 8GB, 16GB (blade005 could upgrade from 8GB to 16GB)
  • Storage: Upgrade NVMe to 512GB or 1TB drives (M.2 2242 form factor)
  • CPU: Fixed at 4 cores (cannot upgrade)

When to Scale:

  • CPU utilization >70% sustained
  • RAM utilization >80% sustained
  • Storage >75% used on any node

Infrastructure Design Decisions

Why Raspberry Pi CM5?

Decision FactorEvaluation
Power Efficiency⭐⭐⭐⭐⭐ 5-8W per node (best in class)
Cost⭐⭐⭐⭐ $120/node (affordable)
Performance⭐⭐⭐ ARM Cortex-A76 (adequate for homelab)
Availability⭐⭐⭐⭐ Good supply (unlike Pi4 shortage era)
Form Factor⭐⭐⭐⭐⭐ Compact blades (space-efficient)
Community⭐⭐⭐⭐⭐ Massive Pi community support

Alternatives Considered:

  • Intel NUC: 3-5× power, 2× cost, better performance
  • Used enterprise servers: Cheap but 10× power consumption
  • Mini PCs: Similar cost, higher power, better CPU

Why K3s over Vanilla Kubernetes?

FactorK3sVanilla K8s
Memory footprint~500MB per node~1.5GB per node
Installation complexitySimple (single binary)Complex (kubeadm)
ARM64 supportFirst-classRequires compilation
Batteries includedIngress, LB, storageRequires addons
Production readinessCNCF certifiedCNCF certified

Why local-path over Distributed Storage?

Featurelocal-pathCeph/RookLonghorn
Replication❌ None✅ 3-way✅ 3-way
Performance⭐⭐⭐⭐⭐ (direct NVMe)⭐⭐⭐ (network overhead)⭐⭐⭐ (network overhead)
Resource overhead⭐⭐⭐⭐⭐ (minimal)⭐⭐ (Ceph daemons)⭐⭐⭐ (replication)
Complexity⭐⭐⭐⭐⭐ (trivial)⭐⭐ (complex)⭐⭐⭐ (moderate)
Failure tolerance❌ Node-local✅ 1 node failure✅ 1 node failure

Decision: local-path for simplicity, performance, and resource efficiency in homelab environment where critical data is backed up externally

Next Steps

Now that infrastructure is planned:

  1. Review Tools & Technology for detailed tool selection rationale
  2. Understand Security planning from infrastructure up
  3. Proceed to Deployment to implement this plan

This infrastructure planning balances cost, power efficiency, and production-grade capabilities to create a sustainable homelab platform.