Monitoring

Observability Stack

This homelab implements comprehensive monitoring to ensure system health, detect anomalies, and enable quick troubleshooting.

Architecture


┌────────────────────────────────────────────────────────────┐
│                  Metrics Collection                        │
│                                                            │
│  ┌─────────────┐    ┌──────────────┐    ┌──────────────┐   │
│  │ Node        │    │ Kubernetes   │    │ Application  │   │
│  │ Exporter    │    │ Metrics      │    │ Metrics      │   │
│  │ (per node)  │    │ (API server) │    │ (custom)     │   │
│  └──────┬──────┘    └──────┬───────┘    └──────┬───────┘   │
│         │                  │                   │           │
│         └──────────────────┴───────────────────┘           │
│                            │                               │
│                            ▼                               │
│                 ┌──────────────────────┐                   │
│                 │   vmagent (scraper)  │                   │
│                 └──────────┬───────────┘                   │
│                            │                               │
│                            ▼                               │
│                 ┌──────────────────────┐                   │
│                 │ Victoria Metrics DB  │                   │
│                 │ (time-series storage)│                   │
│                 └──────────┬───────────┘                   │
└────────────────────────────┼───────────────────────────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
         ▼                   ▼                   ▼
┌─────────────────┐ ┌─────────────────┐ ┌──────────────────┐
│    Grafana      │ │    vmalert      │ │   AlertManager   │
│  (dashboards)   │ │ (rule engine)   │ │  (alert routing) │
└─────────────────┘ └────────┬────────┘ └────────┬─────────┘
                             │                   │
                             └───────┬───────────┘
                                     ▼
                          ┌──────────────────────┐
                          │ alertmanager-gotify  │
                          │ (webhook converter)  │
                          └──────────┬───────────┘
                                     ▼
                          ┌──────────────────────┐
                          │       Gotify         │
                          │ (push notifications) │
                          └──────────────────────┘

Components

Victoria Metrics

Purpose: Prometheus-compatible metrics storage

Why Victoria Metrics over Prometheus:

50% less RAM usage
7× better compression
Faster queries
100% PromQL compatible

Components:

vmagent: Scrapes metrics from exporters
vmstorage: Time-series database
vmselect: Query engine
vmalert: Alert rule evaluation

Access:


# VMSelect (query interface)
http://vmselect.victoria-metrics.svc.cluster.local:8481

# Example query
curl 'http://vmselect.victoria-metrics.svc.cluster.local:8481/select/0/prometheus/api/v1/query?query=up'

Grafana

Purpose: Metrics visualization and dashboards

Access:


URL: https://grafana.homelab.int.zengarden.space
Username: admin
Password: <from victoria-metrics/env.yaml>

Pre-installed Dashboards:

Node Exporter Full (system metrics)
Kubernetes Cluster Monitoring
Kubernetes Pods Monitoring
Kubernetes Persistent Volumes
Victoria Metrics Dashboard
local-path Cluster Monitoring

Grafana Alert Operator

Purpose: Kubernetes operator for managing Grafana alerting resources declaratively

Why Grafana Alert Operator:

GitOps-native: Define alerts, notification policies, and mute timings as Kubernetes CRDs
Declarative: Alerts live in Git alongside application manifests
Automated: Reconciles Grafana configuration via HTTP API
Version controlled: Alert changes tracked in Git history

Architecture:


Kubernetes CRDs (GrafanaAlertRule, etc.)
    │
    ▼
Grafana Alert Operator (shell-operator + Python)
    │
    ├─> Reads CRD changes via Kubernetes API
    ├─> Reconciles via Grafana Provisioning HTTP API
    └─> Updates CRD status
    │
    ▼
Grafana Alert Manager (built-in)

Custom Resource Definitions:

GrafanaAlertRule: Define Prometheus-style alert rules
GrafanaNotificationPolicy: Route alerts to specific receivers
GrafanaMuteTiming: Schedule alert muting windows
GrafanaNotificationTemplate: Custom notification templates

Example Alert Rule:


apiVersion: monitoring.zengarden.space/v1
kind: GrafanaAlertRule
metadata:
  name: high-cpu-usage
  namespace: monitoring
spec:
  grafanaRef:
    name: grafana-service-account
    namespace: victoria-metrics
  folderUID: homelab-alerts
  ruleGroup: infrastructure
  title: High CPU Usage
  condition: B
  noDataState: NoData
  execErrState: Alerting
  for: 5m
  annotations:
    summary: "High CPU usage on {{ $labels.instance }}"
    description: "CPU usage is {{ $value }}% for more than 5 minutes."
  labels:
    severity: warning
  data:
    - refId: A
      queryType: prometheus
      datasourceUid: victoria-metrics
      expr: |
        100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
      intervalMs: 1000
      maxDataPoints: 43200
    - refId: B
      queryType: math
      datasourceUid: __expr__
      expression: $A > 85
      reducer: last

Example Notification Policy:


apiVersion: monitoring.zengarden.space/v1
kind: GrafanaNotificationPolicy
metadata:
  name: default-routing
  namespace: monitoring
spec:
  grafanaRef:
    name: grafana-service-account
    namespace: victoria-metrics
  receiver: gotify-notifications
  groupBy:
    - alertname
    - cluster
  groupWait: 30s
  groupInterval: 5m
  repeatInterval: 4h
  routes:
    - receiver: gotify-critical
      matchers:
        - severity = critical
      groupWait: 10s
      repeatInterval: 1h
    - receiver: gotify-warnings
      matchers:
        - severity = warning
      repeatInterval: 4h

Token Provisioning:

The operator uses a Grafana service account token for API authentication. A separate grafana-service-account chart provisions the token:


# Deployed automatically as part of Victoria Metrics stack
apiVersion: batch/v1
kind: Job
metadata:
  name: grafana-service-account-provision
  namespace: victoria-metrics
  annotations:
    helm.sh/hook: post-install,post-upgrade

The job creates a service account in Grafana and stores the token in a Kubernetes Secret (grafana-service-account-token), which the operator references.

Accessing Grafana Alerts:


URL: https://grafana.homelab.int.zengarden.space/alerting

Operator Logs:


# Shell-operator container (watches Kubernetes events)
kubectl -n victoria-metrics logs -l app.kubernetes.io/name=grafana-alert-operator -c shell-operator
 
# Handler service container (reconciles Grafana state)
kubectl -n victoria-metrics logs -l app.kubernetes.io/name=grafana-alert-operator -c handler-service

Benefits over Manual Configuration:

GitOps workflow: Alerts reviewed via pull requests
Namespace isolation: Teams manage their own alerts
Consistency: Alerts defined using familiar Kubernetes patterns
Auditability: Full change history in Git

AlertManager

Purpose: Alert routing, grouping, and deduplication

Note: For Grafana-managed alerts (defined via GrafanaAlertRule CRDs), use Grafana’s built-in AlertManager. The standalone AlertManager below is used for vmalert rules.

Access:


URL: https://alerts.homelab.int.zengarden.space
# No authentication by default (internal only)

Configuration:


route:
  receiver: gotify-notifications
  group_by: [alertname, cluster, namespace]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
 
receivers:
  - name: gotify-notifications
    webhook_configs:
      - url: http://alertmanager-gotify-nodejs.victoria-metrics.svc:3000/webhook
        send_resolved: true
  - name: devnull  # For silenced alerts

Gotify

Purpose: Push notifications to mobile/web

Access:


URL: https://notifications.zengarden.space
Username: admin
Password: <from victoria-metrics/env.yaml>

Mobile App:

Android: Google Play Store
iOS: Not officially available (use web UI)

Key Metrics

Cluster Health Metrics

Metric	Description	Query
Node Status	Number of Ready nodes	`kube_node_status_condition{condition="Ready",status="true"}`
Pod Status	Pods in Running state	`kube_pod_status_phase{phase="Running"}`
Pod Restarts	Container restart count	`rate(kube_pod_container_status_restarts_total[5m])`
API Server Errors	API server error rate	`rate(apiserver_request_total{code=~"5.."}[5m])`
etcd Health	etcd member health	`etcd_server_has_leader`

Resource Utilization Metrics

Metric	Description	Query
CPU Usage	Node CPU utilization	`100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`
RAM Usage	Node memory utilization	`(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100`
Disk Usage	Filesystem usage	`100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)`
Network TX	Network transmit bytes	`rate(node_network_transmit_bytes_total[5m])`
Network RX	Network receive bytes	`rate(node_network_receive_bytes_total[5m])`

Application Metrics

Metric	Description	Query
HTTP Requests	Request rate	`rate(nginx_ingress_controller_requests[5m])`
HTTP Errors	5xx error rate	`rate(nginx_ingress_controller_requests{status=~"5.."}[5m])`
Response Time	p95 latency	`histogram_quantile(0.95, rate(nginx_ingress_controller_request_duration_seconds_bucket[5m]))`
ArgoCD Sync	Out-of-sync apps	`argocd_app_info{sync_status!="Synced"}`

Storage Metrics (local-path)

Metric	Description	Query
Cluster Health	local-path health status	`ceph_health_status` (0=OK, 1=WARN, 2=ERR)
OSD Status	OSDs up/down	`ceph_osd_up`, `ceph_osd_in`
Storage Used	Used capacity	`ceph_cluster_total_used_bytes / ceph_cluster_total_bytes * 100`
IOPS	Read/write ops	`rate(ceph_pool_wr[5m])`, `rate(ceph_pool_rd[5m])`

Dashboards

Node Exporter Full Dashboard

Purpose: System-level metrics per node

Panels:

CPU usage (user, system, iowait)
Memory usage (used, cached, buffers)
Disk I/O (read/write bytes, IOPS)
Network I/O (TX/RX bytes, errors)
Filesystem usage
System load (1m, 5m, 15m)
Context switches, interrupts
NVMe temperature (if available)

How to Access:

Grafana → Dashboards → Browse
Search: “Node Exporter Full”
Select node from dropdown

Kubernetes Cluster Monitoring Dashboard

Purpose: Cluster-wide Kubernetes metrics

Panels:

Node status (Ready/NotReady)
Pod count (Running/Pending/Failed)
CPU/RAM requests vs limits vs usage
Namespace resource usage
Persistent volume claims
Top pods by CPU/RAM
Network traffic by namespace

Kubernetes Pods Monitoring Dashboard

Purpose: Per-pod metrics

Panels:

Pod CPU usage
Pod memory usage
Pod network I/O
Container restarts
Pod phase (Running/Pending/Failed)
Resource requests vs actual usage

Variables:

Namespace (dropdown)
Pod (dropdown)

Victoria Metrics Dashboard

Purpose: Monitor Victoria Metrics itself

Panels:

Ingestion rate (samples/sec)
Query rate (queries/sec)
Storage size
Memory usage
CPU usage
Slow queries

Custom Dashboard Example

Creating a custom dashboard:

Grafana → Dashboards → New Dashboard
Add Panel
Select Data Source: VictoriaMetrics
Enter PromQL query
Configure visualization (graph, gauge, stat)
Save dashboard

Example Panel: Application Response Time


histogram_quantile(0.95,
  rate(
    nginx_ingress_controller_request_duration_seconds_bucket{
      exported_namespace="my-app"
    }[5m]
  )
)

Panel Type: Time series graph Unit: seconds Legend: p95 response time

Alerts

Alert Rules

Location: Victoria Metrics vmalert configuration

Example Alert Rules:


groups:
  - name: cluster-health
    interval: 30s
    rules:
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "Node has been unreachable for more than 5 minutes."
 
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
          description: "Pod has restarted {{ $value }} times in the last 15 minutes."
 
      - alert: HighCPUUsage
        expr: |
          100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% for more than 10 minutes."
 
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}% for more than 10 minutes."
 
      - alert: DiskSpaceLow
        expr: |
          100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk usage is {{ $value }}% on root filesystem."
 
      - alert: CertificateExpiringSoon
        expr: |
          (cert_manager_certificate_expiration_timestamp_seconds - time()) / 86400 < 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Certificate {{ $labels.namespace }}/{{ $labels.name }} expiring soon"
          description: "Certificate expires in {{ $value }} days."
 
      - alert: ArgocdAppOutOfSync
        expr: argocd_app_info{sync_status!="Synced"} > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "ArgoCD application {{ $labels.name }} out of sync"
          description: "Application has been out of sync for more than 15 minutes."
 
      - alert: local-pathHealthError
        expr: ceph_health_status == 2
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "local-path cluster health is ERROR"
          description: "local-path cluster health has been in ERROR state for 5 minutes."

Alert Severity Levels

Severity	Description	Response Time	Examples
Critical	Service outage or data loss	Immediate (wake up)	Node down, etcd quorum lost, local-path health ERR
Warning	Degraded performance or upcoming issue	Within 1 hour	High CPU, certificate expiring soon, pod crash looping
Info	Informational, no action required	Review during next maintenance	Deployment updated, backup completed

Alert Routing

AlertManager routes alerts based on severity:


route:
  receiver: gotify-notifications
  routes:
    - match:
        severity: critical
      receiver: gotify-notifications
      continue: true
    - match:
        severity: warning
      receiver: gotify-notifications
      continue: true
    - match:
        alertname: InfoInhibitor
      receiver: devnull
    - match:
        alertname: Watchdog
      receiver: devnull

Gotify Integration:

Critical alerts: High priority (red notification)
Warning alerts: Medium priority (yellow notification)
Info alerts: Low priority (blue notification)

Notification Setup

Gotify Mobile App

Android:

Install Gotify app from Google Play Store
Open app → Settings → Add Server
URL: https://notifications.zengarden.space
Username: admin
Password: <from victoria-metrics/env.yaml>
Save
Create Application: “homelab-alerts”
Copy token
Update AlertManager configuration with token

Web UI:

Navigate to https://notifications.zengarden.space
Login: admin / <password>
View alerts in real-time

Custom Metrics

Exposing Custom Metrics

For applications to expose metrics:

Implement /metrics endpoint
- Use Prometheus client library
- Expose on port (e.g., 8080/metrics)

Add ServiceMonitor CRD


apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: default
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: metrics
      path: /metrics
      interval: 30s

Verify scraping


# Check vmagent targets
curl http://vmagent.victoria-metrics.svc:8429/targets

Example: Python Application Metrics


from prometheus_client import Counter, Histogram, generate_latest
from flask import Flask, Response
 
app = Flask(__name__)
 
# Define metrics
request_count = Counter('app_requests_total', 'Total requests', ['method', 'endpoint'])
request_duration = Histogram('app_request_duration_seconds', 'Request duration', ['method', 'endpoint'])
 
@app.route('/metrics')
def metrics():
    return Response(generate_latest(), mimetype='text/plain')
 
@app.route('/api/data')
@request_duration.labels(method='GET', endpoint='/api/data').time()
def get_data():
    request_count.labels(method='GET', endpoint='/api/data').inc()
    return {'data': 'example'}

Log Aggregation

Loki + Promtail

Purpose: Centralized log aggregation and querying

Architecture:


┌────────────────────────────────────────────────────────────┐
│                  Log Collection                            │
│                                                            │
│  ┌─────────────┐    ┌──────────────┐    ┌──────────────┐   │
│  │ Promtail    │    │ Promtail     │    │ Promtail     │   │
│  │ (node 1)    │    │ (node 2)     │    │ (node 3)     │   │
│  │ DaemonSet   │    │ DaemonSet    │    │ DaemonSet    │   │
│  └──────┬──────┘    └──────┬───────┘    └──────┬───────┘   │
│         │                  │                   │           │
│         │   Scrapes logs from /var/log/pods/   │           │
│         └──────────────────┴───────────────────┘           │
│                            │                               │
│                            ▼                               │
│                 ┌──────────────────────┐                   │
│                 │   Loki Gateway       │                   │
│                 └──────────┬───────────┘                   │
│                            │                               │
│                            ▼                               │
│                 ┌──────────────────────┐                   │
│                 │ Loki (SingleBinary)  │                   │
│                 │ (log storage)        │                   │
│                 └──────────┬───────────┘                   │
└────────────────────────────┼───────────────────────────────┘
                             │
                             ▼
                   ┌──────────────────┐
                   │     Grafana      │
                   │ (log exploration)│
                   └──────────────────┘

Components:

Loki:

Log aggregation system inspired by Prometheus
Label-based indexing (not full-text)
Filesystem storage (10Gi PVC)
SingleBinary deployment mode (simplified architecture)
TSDB schema for efficient storage

Promtail:

Log shipping agent (DaemonSet on each node)
Scrapes container logs from /var/log/pods/
Kubernetes service discovery
Automatic label extraction (namespace, pod, container, app)
CRI log format parsing

Access:


# Grafana with Loki datasource
URL: https://grafana.homelab.int.zengarden.space
Data Source: Loki (preconfigured)

Query Examples:


# View all logs from a namespace
{namespace="argocd"}

# Filter by pod name
{namespace="argocd", pod=~"argocd-server.*"}

# Search for error messages
{namespace="argocd"} |= "error"

# Count errors per pod
sum by (pod) (count_over_time({namespace="argocd"} |= "error" [5m]))

# Logs from specific container
{namespace="victoria-metrics", container="loki"}

# Multiple filters
{namespace="gitea", app="gitea"} |= "authentication" != "success"

LogQL Syntax:

{label="value"} - Label selector
|= - Line contains string
!= - Line does not contain
|~ "regex" - Regex match
!~ "regex" - Regex not match
| json - Parse JSON logs
| logfmt - Parse logfmt logs

Grafana Explore:

Navigate to Explore in Grafana
Select Loki data source
Use Log browser to select labels
Or write LogQL queries directly
View logs in table or stream format
Inspect individual log lines

Benefits:

Lightweight: Lower resource usage than ELK stack
Grafana integration: Unified metrics + logs UI
Label-based: Similar query language to PromQL
Cost-effective: Efficient storage (no full-text indexing)

Limitations:

Not full-text search: Must use labels for filtering (use |= for grep-like search)
Label cardinality: Avoid high-cardinality labels (e.g., don’t use log level as label)

Troubleshooting Monitoring

No Metrics in Grafana

Check vmagent scraping:


kubectl -n victoria-metrics logs -l app.kubernetes.io/name=vmagent --tail=50
 
# Check targets
kubectl -n victoria-metrics port-forward svc/vmagent 8429:8429
curl http://localhost:8429/targets | jq

Check Victoria Metrics storage:


kubectl -n victoria-metrics logs -l app.kubernetes.io/name=vmstorage --tail=50

Alerts Not Firing

Check vmalert:


kubectl -n victoria-metrics logs -l app.kubernetes.io/name=vmalert --tail=50
 
# Check rules loaded
kubectl -n victoria-metrics port-forward svc/vmalert 8880:8880
curl http://localhost:8880/api/v1/rules | jq

Check AlertManager:


kubectl -n victoria-metrics logs -l app.kubernetes.io/name=alertmanager --tail=50
 
# Check alerts
curl http://alerts.homelab.int.zengarden.space/api/v2/alerts

Gotify Not Receiving Alerts

Check alertmanager-gotify-nodejs:


kubectl -n victoria-metrics logs -l app=alertmanager-gotify-nodejs --tail=50

Verify webhook URL:


kubectl -n victoria-metrics get configmap alertmanager -o yaml | grep gotify

Test webhook manually:


curl -X POST http://alertmanager-gotify-nodejs.victoria-metrics.svc:3000/webhook \
  -H "Content-Type: application/json" \
  -d '{
    "alerts": [{
      "status": "firing",
      "labels": {"alertname": "TestAlert", "severity": "warning"},
      "annotations": {"summary": "Test alert", "description": "This is a test"}
    }]
  }'

Security Auditing

Automated Security Audit Script

Purpose: Automated security breach detection and system integrity checks

Location: system/ansible/install-security-audit/

Deployment: Ansible playbook deploys to ~/bin/security-audit.sh on all blade nodes

Installation:


cd system/ansible/install-security-audit
./install.sh

This will deploy the script to all blade nodes (blade001-blade005).

What it checks:

Authentication logs - Failed logins, invalid user attempts
User accounts - New users, UID 0 users, shell access
SSH keys - Authorized keys for all users
Network activity - Listening ports, external connections
Running processes - Suspicious patterns, /tmp execution
Scheduled tasks - Root and user crontabs
File integrity - System file modifications, SUID/SGID files
System logs - Security keywords (breach, attack, exploit, malware)
System resources - Disk, memory, CPU usage
Restrictive proxy - Unauthorized access attempts (custom homelab security)

Usage:


# Run basic security audit
ssh blade001 '~/bin/security-audit.sh'
 
# Run with verbose output
ssh blade001 '~/bin/security-audit.sh --verbose'
 
# Get JSON output for parsing
ssh blade001 '~/bin/security-audit.sh --json'

Run on all nodes:


for blade in blade001 blade002 blade003 blade004 blade005; do
  echo "=== $blade ==="
  ssh $blade '~/bin/security-audit.sh | tail -15'
done

Exit codes:

0 - System secure, no issues detected
1 - Warnings detected, review recommended
2 - Critical issues detected, immediate investigation required

Example output:


==========================================
  Homelab Security Audit
  blade001 - 2025-10-28 12:48:07
==========================================

========================================
1. Authentication Analysis
========================================
[✓] No failed login attempts in last 24 hours

========================================
2. User Account Analysis
========================================
[✓] No recently added users
[✓] Only root has UID 0

========================================
Security Audit Summary
========================================

Hostname: blade001
Timestamp: 2025-10-28 12:48:07
Status: SECURE
Critical Issues: 0
Warnings: 0

✓ System appears secure. No security breaches detected.

Automated scheduling:

To run security audits automatically, add to cron:


# Add to root crontab on each blade
# Run security audit daily at 3 AM
0 3 * * * /home/oleksiyp/bin/security-audit.sh --json > /tmp/security-audit-$(date +\%Y\%m\%d).json

Integration with monitoring:

You can export security audit results to Victoria Metrics using a custom exporter or parse JSON output with a script:


#!/bin/bash
# security-audit-exporter.sh
# Parse security audit JSON and expose as metrics
 
RESULT=$(~/bin/security-audit.sh --json)
ISSUES=$(echo "$RESULT" | jq -r '.issues')
WARNINGS=$(echo "$RESULT" | jq -r '.warnings')
STATUS=$(echo "$RESULT" | jq -r '.status')
 
# Expose as Prometheus metrics
cat <<EOF
# HELP security_audit_issues Number of critical security issues detected
# TYPE security_audit_issues gauge
security_audit_issues{hostname="$(hostname)"} $ISSUES
 
# HELP security_audit_warnings Number of security warnings detected
# TYPE security_audit_warnings gauge
security_audit_warnings{hostname="$(hostname)"} $WARNINGS
 
# HELP security_audit_status Security audit status (0=SECURE, 1=COMPROMISED)
# TYPE security_audit_status gauge
security_audit_status{hostname="$(hostname)"} $([ "$STATUS" == "SECURE" ] && echo 0 || echo 1)
EOF

Security best practices:

Run audits daily on all nodes
Review warnings promptly (within 24 hours)
Investigate critical issues immediately
Keep audit logs for compliance/forensics
Update the script as new threats emerge

Best Practices

Dashboard Design

Use template variables for namespace, pod, node selection
Show rates, not absolute counters (use rate() or irate())
Set appropriate time ranges (5m for real-time, 24h for trends)
Use percentiles for latency (p50, p95, p99)
Set thresholds and alerts on panels

Alert Design

Alert on symptoms, not causes (e.g., “service down” not “pod restarting”)
Set appropriate for duration (avoid flapping)
Include actionable annotations (what to do, where to look)
Group related alerts (avoid alert storms)
Test alerts (manually trigger to verify)

Metric Naming

Follow Prometheus naming conventions:

Counters: *_total suffix (e.g., requests_total)
Gauges: No suffix (e.g., memory_usage_bytes)
Histograms: *_bucket, *_sum, *_count (e.g., request_duration_seconds_bucket)

Next Steps

Configure custom dashboards for your applications
Set up additional alerts for specific use cases
Integrate with log aggregation (Loki) for complete observability
Review Maintenance for regular monitoring tasks

Comprehensive monitoring enables proactive issue detection and fast troubleshooting.