Infrastructure Planning

Hardware & Software Overview

This section provides comprehensive infrastructure planning details, including hardware specifications, cost analysis, power consumption metrics, and software stack decisions.

Hardware Infrastructure

Compute Nodes

Selected Platform: Raspberry Pi Compute Module 5 (CM5) Blades

Specification	Details	Reasoning
CPU	ARM Cortex-A76 (4 cores)	Balance of performance and power efficiency
RAM	16GB LPDDR4X (blade001-004), 8GB (blade005)	Ample for K8s control plane + workloads
Architecture	ARM64 (aarch64)	Low power consumption, modern instruction set
Form Factor	Compute Blade	Space-efficient, centralized cooling
Quantity	5 nodes	3 masters (HA) + 2 workers
Power per Node	~5-8W under load	Critical for 24/7 operation cost

Node Roles:

blade001-003: K3s masters + etcd + worker capabilities (16GB RAM each)
blade004: Worker (16GB RAM)
blade005: Worker (8GB RAM)

Total Compute Resources:

CPU: 20 cores (5 nodes × 4 cores)
RAM: 72GB (4 nodes × 16GB + 1 node × 8GB)
Power: 28-40W total cluster

Storage Infrastructure

Initial Approach (Abandoned)

Original Selection: Samsung 990 PRO NVMe SSDs

Capacity: 512GB per drive
Performance: 7,000 MB/s read, 5,000 MB/s write
Problem: Excessive heat generation in enclosed blade chassis
Outcome: Thermal throttling, reliability concerns

Final Selection

Adopted Solution: SK hynix BC711 NVMe SSDs

Capacity: 256GB per drive
Performance: 2,000-3,000 MB/s (sufficient for homelab)
Form Factor: M.2 2242 (compact size)
Temperature: Minimal heat generation
Cost: ~$20-30 per drive
Quantity: 5 drives (1 per node)

Storage Configuration:


/dev/nvme0n1 (256GB) → mounted at /var/lib/rancher/k3s

No partitioning: Single filesystem for simplicity
K3s data: etcd, containerd, kubelet all on NVMe
Local-path storage: K3s built-in local-path provisioner

Storage Architecture:

Provisioner: rancher.io/local-path (K3s built-in)
Total Raw: ~1.2TB (5 × 256GB)
Usable: ~1.2TB (no replication, node-local storage)
Performance: Direct NVMe access, low latency
Note: PVCs are node-local, no cross-node redundancy

Networking Infrastructure

MikroTik Router

Model: MikroTik Chateau LTE18 ax (LTE Cat18 with Wi-Fi 6)

Feature	Specification	Usage
Ports	5× Gigabit Ethernet + 1× SFP	WAN, networks, management
CPU	Dual-core 880 MHz MIPS	Routing, firewall, DNS
RAM	256MB	RouterOS + tables
PoE	PoE-out on port 5	Powering devices
Power	~5W	Minimal overhead

Port Assignment:

eth1: Optical WAN (fiber internet, failover to LTE)
eth2: Cluster network trunk (to Zyxel switch → blades 001-004)
eth3: Cluster network access (direct to blade005)
eth4: Home network
eth5: Management network (management workstation, 1Gbps)

Features Used:

Network segmentation (802.1Q)
Stateful firewall
DHCP server (per network)
DNS server (internal zone)
WireGuard VPN server
NAT (masquerade)
LTE failover (built-in modem for internet redundancy)

Zyxel PoE Switch

Model: 5-port Gigabit PoE+ switch

Feature	Specification
Ports	5× Gigabit PoE+
PoE Budget	60W total
Usage	Powers CM5 blades
Power	~3W (idle) + PoE consumption

Connected Devices:

blade001: PoE powered
blade002: PoE powered
blade003: PoE powered
blade004: PoE powered
Uplink: MikroTik eth2

Note: blade005 connected directly to MikroTik eth3 (separate access port)

Network Topology


┌─────────────────────────────────────────────────────────────┐
│                  Internet (Fiber Optic)                     │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────┴──────────────────────────────────┐
│         MikroTik Chateau LTE18 ax (Router)                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Firewall │ Network Routing │ DNS │ DHCP │ WireGuard │   │
│  └──────────────────────────────────────────────────────┘   │
├────────────┬──────────────┬──────────────┬──────────────────┤
│ Cluster    │  Home        │  Management  │  WireGuard VPN   │
│ Network    │  Network     │  Network     │  (Remote)        │
│ .77.0/24   │  .88.0/24    │  .100.0/24   │ .216.0/24        │
└─────┬──────┴──────────────┴──────────────┴──────────────────┘
      │
      ├─── Zyxel PoE Switch ───┬─── blade001 (.77.11)
      │                        ├─── blade002 (.77.12)
      │                        ├─── blade003 (.77.13)
      │                        └─── blade004 (.77.14)
      │
      └─── Direct Connection ─── blade005 (.77.15)

Network Design

Network	Subnet	Gateway	DNS	Purpose	Firewall Policy
Cluster	192.168.77.0/24	.77.1	.77.1	K3s cluster nodes	Allow internet, limited inbound from home network (80/443 only)
Home	192.168.88.0/24	.88.1	.88.1	Home devices	Allow internet, allow cluster:80/443 (web services)
Management	192.168.100.0/24	.100.1	.100.1	Admin workstation	Full access (all networks + internet), no inbound
WireGuard	192.168.216.0/24	.77.15	.77.1	VPN clients	Access cluster + pod/service networks, blocked from home/mgmt

Power Budget

Component	Power (W)	Quantity	Total (W)	Notes
CM5 Blades	5-8W	5	25-40W	Varies with load
MikroTik Router	~5W	1	5W	Constant
Zyxel Switch	~3W + PoE	1	28W	3W base + ~25W PoE to blades
Total			58-73W	Measured: 60-65W typical

Annual Energy Consumption:

Average: 62W × 24h × 365 days = 543 kWh/year
Cost: 543 kWh × $0.14/kWh = ~$76/year (~$6/month)

Comparison:

Single desktop PC: ~150-250W (3-4× higher)
Cloud equivalent: $200-400/month (40-80× higher cost)

Software Infrastructure

Operating System

Selection: Ubuntu Server 24.04 LTS (ARM64)

Reasoning:

Long-term support (5 years)
Excellent ARM64 support
Ansible compatibility
Systemd for service management
Well-documented

Configuration:

Minimal installation (no GUI)
Predictable network interface names (eth0)
SSH enabled for remote management
Automatic security updates disabled (manual control)

Kubernetes Distribution

Selection: K3s v1.32.4+k3s1

Alternative	Reason NOT Chosen
kubeadm (vanilla K8s)	Higher resource overhead, complex setup
MicroK8s	Snap-based (additional overhead), less flexibility
k0s	Less mature, smaller community
RKE2	More resource intensive than K3s

K3s Advantages:

Lightweight: Single binary, minimal dependencies
ARM64 optimized: First-class ARM support
Embedded etcd: No external etcd cluster needed
Batteries included: Traefik, ServiceLB, local-path (removed in favor of custom)
Production-ready: CNCF certified Kubernetes

K3s Configuration:


cluster-init: true                    # Bootstrap etcd cluster
disable:
  - traefik                           # Use ingress-nginx instead
  - servicelb                         # Use MetalLB instead
flannel-backend: none                 # Use Cilium CNI
disable-network-policy: true          # Cilium handles this
secrets-encryption: true              # Encrypt secrets at rest
kube-apiserver-arg:
  - "enable-admission-plugins=NodeRestriction,PodSecurity"
  - "audit-log-path=/var/log/kubernetes/audit.log"
  - "audit-log-maxage=30"
  - "audit-log-maxbackup=10"
  - "audit-log-maxsize=100"
  - "oidc-issuer-url=https://accounts.google.com"
  - "oidc-client-id=<google-client-id>"
  - "oidc-username-claim=email"

Note: K3s includes local-path provisioner by default, providing node-local persistent storage.

Container Network Interface (CNI)

Selection: Cilium 1.17.4

Feature	Benefit
eBPF-based	Bypasses iptables overhead, better performance
NetworkPolicy	Native support for Kubernetes NetworkPolicy
Hubble	Observability for network flows
Encryption	Optional WireGuard-based pod-to-pod encryption
kube-proxy replacement	eBPF-based service load balancing

Alternatives Considered:

Flannel: Simple but limited features
Calico: More resource intensive, iptables-based
Weave: Deprecated

Storage

Selection: K3s local-path (built-in)

Architecture:

Provisioner: rancher.io/local-path (K3s default)
Storage Location: /var/lib/rancher/k3s/storage on each node’s NVMe
Binding: WaitForFirstConsumer (binds to node where pod is scheduled)
Reclaim Policy: Delete (PV deleted when PVC deleted)

Characteristics:

Node-local: PVCs are bound to the node where first pod is scheduled
No replication: Data exists only on one node (no cross-node redundancy)
Performance: Direct NVMe access, very low latency
Simplicity: No additional operators or complexity
Limitation: Pods requiring PVC cannot move between nodes

Why local-path for Homelab:

Sufficient for stateless apps: Most apps are stateless or use external databases
Simpler operations: No distributed storage complexity
Better performance: Direct NVMe access without network overhead
Lower resource usage: No Ceph daemons consuming CPU/RAM
Acceptable risk: Critical data backed up externally (Git, cloud)

Core Platform Services

Service	Purpose	Version
MetalLB	Load balancer for bare metal	v0.14.5
ingress-nginx	HTTP(S) ingress controller	Latest
cert-manager	TLS certificate automation	v1.18.1
external-dns	DNS synchronization	v1.15.0
External Secrets Operator	External secret integration	v0.10.4
ArgoCD	GitOps continuous deployment	Latest
Gitea	Self-hosted Git server	v12.4.0
CloudNativePG	PostgreSQL operator	Latest
Victoria Metrics	Metrics and monitoring	Latest
Metabase	Data analytics and visualization	Latest

Custom Operators

Operator	Purpose	Implementation
DerivedSecrets	Derive passwords from master key using Argon2id	Shell-operator (Bash + argon2)
PartialIngress	Partial environment deployments with auto-replication	Shell-operator (Bash + Python)
Metabase CNPG	Auto-register CNPG databases in Metabase	Shell-operator (Bash + curl)
Gitea Automation	OAuth setup + GitHub sync	Helm hooks (Playwright + Bash)
Restrictive Proxy	Path-restricted HTTP proxy for MikroTik API	Node.js systemd service

Cost Analysis

Initial Capital Expenditure (CapEx)

Item	Quantity	Unit Price	Total	Notes
Raspberry Pi CM5 Blades	5	~$120	$600	8GB RAM version
NVMe SSDs (final)	5	~$25	$125	Budget M.2 2242 drives
NVMe SSDs (initial, abandoned)	2	~$80	$160	Samsung 990 PRO (too hot)
MikroTik Chateau LTE18 ax	1	~$200	$200	Router + firewall + LTE
Zyxel PoE Switch	1	~$200	$200	5-port PoE+
Cables, Power	Various		$100	Ethernet cables, power supplies
Total Initial			$1,385	One-time cost

Lessons Learned:

High-end NVMe SSDs ($160) were overkill and caused thermal issues
Budget NVMe SSDs ($125) perform adequately and run cool
Wasted: $160 on Samsung drives (repurposed for other projects)

Operational Expenditure (OpEx)

Monthly Costs:

Item	Calculation	Monthly Cost
Electricity	62W × 24h × 30d × $0.14/kWh	~$6.20
Internet	Existing connection	$0 (no additional cost)
Domain	zengarden.space	$1.00 (annual / 12)
Cloudflare	Free tier	$0
Total Monthly		~$7.20

Annual Costs:

Item	Annual Cost
Electricity	~$75
Domain	~$12
Total Annual	~$87

Total Cost of Ownership (TCO)

5-Year TCO:

Item	Cost
Initial CapEx	$1,385
5 Years OpEx	$435 (5 × $87)
Total	$1,820

Cloud Equivalent (AWS):

3× t4g.medium (2vCPU, 4GB): ~$75/month
2× t4g.small (2vCPU, 2GB): ~$40/month
500GB EBS storage: ~$50/month
Load balancer: ~$20/month
Data transfer: ~$15/month
Total: ~$200/month × 60 months = $12,000

Savings: $10,180 over 5 years (6.6× cheaper)

Cost Comparison Analysis

Break-even Point:

CapEx: $1,385
Monthly savings vs cloud: $200 - $7 = $193
Break-even: 1,385 / 193 = 7.2 months

After 1 Year:

Total cost: $1,385 + (12 × $7) = $1,469
Cloud cost: 12 × $200 = $2,400
Savings: $931

After 5 Years:

Total cost: $1,820
Cloud cost: $12,000
Savings: $10,180
ROI: 559%

Capacity Planning

Resource Allocation

Total Cluster Resources:

CPU: 20 cores (5 nodes × 4 cores)
RAM: 72GB (4 × 16GB + 1 × 8GB)
Storage: ~1.2TB (5 × 256GB NVMe)

Reserved for System:

CPU: ~2 cores (kubelet, systemd per node)
RAM: ~12GB (OS + K3s overhead, ~2-3GB per node)
Storage: ~100GB (OS, logs, etcd, container images)

Available for Applications:

CPU: ~18 cores
RAM: ~60GB
Storage: ~1.1TB (node-local PVCs)

Workload Planning

Current Workloads:

Application	CPU Request	RAM Request	Replicas	Total CPU	Total RAM
ArgoCD	200m	512Mi	4	800m	2Gi
Gitea	500m	1Gi	1	500m	1Gi
Metabase	500m	1Gi	1	500m	1Gi
Victoria Metrics	500m	2Gi	3	1.5	6Gi
Ingress	200m	256Mi	2	400m	512Mi
External DNS	50m	128Mi	1	50m	128Mi
Cert-Manager	100m	256Mi	3	300m	768Mi
Custom Operators	100m	128Mi	3	300m	384Mi
Applications	Variable	Variable	Variable	~5	~10Gi
Total				~9.35	~22Gi

Headroom: ~8.5 cores CPU, ~10GB RAM

Scaling Considerations

Horizontal Scaling (add nodes):

Limit: Power budget, network ports, physical space
Recommendation: Max 7-10 nodes before needing more infrastructure

Vertical Scaling (upgrade nodes):

RAM: CM5 available in 2GB, 4GB, 8GB, 16GB (blade005 could upgrade from 8GB to 16GB)
Storage: Upgrade NVMe to 512GB or 1TB drives (M.2 2242 form factor)
CPU: Fixed at 4 cores (cannot upgrade)

When to Scale:

CPU utilization >70% sustained
RAM utilization >80% sustained
Storage >75% used on any node

Infrastructure Design Decisions

Why Raspberry Pi CM5?

Decision Factor	Evaluation
Power Efficiency	⭐⭐⭐⭐⭐ 5-8W per node (best in class)
Cost	⭐⭐⭐⭐ $120/node (affordable)
Performance	⭐⭐⭐ ARM Cortex-A76 (adequate for homelab)
Availability	⭐⭐⭐⭐ Good supply (unlike Pi4 shortage era)
Form Factor	⭐⭐⭐⭐⭐ Compact blades (space-efficient)
Community	⭐⭐⭐⭐⭐ Massive Pi community support

Alternatives Considered:

Intel NUC: 3-5× power, 2× cost, better performance
Used enterprise servers: Cheap but 10× power consumption
Mini PCs: Similar cost, higher power, better CPU

Why K3s over Vanilla Kubernetes?

Factor	K3s	Vanilla K8s
Memory footprint	~500MB per node	~1.5GB per node
Installation complexity	Simple (single binary)	Complex (kubeadm)
ARM64 support	First-class	Requires compilation
Batteries included	Ingress, LB, storage	Requires addons
Production readiness	CNCF certified	CNCF certified

Why local-path over Distributed Storage?

Feature	local-path	Ceph/Rook	Longhorn
Replication	❌ None	✅ 3-way	✅ 3-way
Performance	⭐⭐⭐⭐⭐ (direct NVMe)	⭐⭐⭐ (network overhead)	⭐⭐⭐ (network overhead)
Resource overhead	⭐⭐⭐⭐⭐ (minimal)	⭐⭐ (Ceph daemons)	⭐⭐⭐ (replication)
Complexity	⭐⭐⭐⭐⭐ (trivial)	⭐⭐ (complex)	⭐⭐⭐ (moderate)
Failure tolerance	❌ Node-local	✅ 1 node failure	✅ 1 node failure

Decision: local-path for simplicity, performance, and resource efficiency in homelab environment where critical data is backed up externally

Next Steps

Now that infrastructure is planned:

Review Tools & Technology for detailed tool selection rationale
Understand Security planning from infrastructure up
Proceed to Deployment to implement this plan

This infrastructure planning balances cost, power efficiency, and production-grade capabilities to create a sustainable homelab platform.