Skip to Content
OperationsMonitoring

Monitoring

Observability Stack

This homelab implements comprehensive monitoring to ensure system health, detect anomalies, and enable quick troubleshooting.

Architecture

┌────────────────────────────────────────────────────────────┐ │ Metrics Collection │ │ │ │ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Node │ │ Kubernetes │ │ Application │ │ │ │ Exporter │ │ Metrics │ │ Metrics │ │ │ │ (per node) │ │ (API server) │ │ (custom) │ │ │ └──────┬──────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ │ └──────────────────┴───────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────┐ │ │ │ vmagent (scraper) │ │ │ └──────────┬───────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────┐ │ │ │ Victoria Metrics DB │ │ │ │ (time-series storage)│ │ │ └──────────┬───────────┘ │ └────────────────────────────┼───────────────────────────────┘ ┌───────────────────┼───────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────────┐ │ Grafana │ │ vmalert │ │ AlertManager │ │ (dashboards) │ │ (rule engine) │ │ (alert routing) │ └─────────────────┘ └────────┬────────┘ └────────┬─────────┘ │ │ └───────┬───────────┘ ┌──────────────────────┐ │ alertmanager-gotify │ │ (webhook converter) │ └──────────┬───────────┘ ┌──────────────────────┐ │ Gotify │ │ (push notifications) │ └──────────────────────┘

Components

Victoria Metrics

Purpose: Prometheus-compatible metrics storage

Why Victoria Metrics over Prometheus:

  • 50% less RAM usage
  • 7× better compression
  • Faster queries
  • 100% PromQL compatible

Components:

  • vmagent: Scrapes metrics from exporters
  • vmstorage: Time-series database
  • vmselect: Query engine
  • vmalert: Alert rule evaluation

Access:

# VMSelect (query interface) http://vmselect.victoria-metrics.svc.cluster.local:8481 # Example query curl 'http://vmselect.victoria-metrics.svc.cluster.local:8481/select/0/prometheus/api/v1/query?query=up'

Grafana

Purpose: Metrics visualization and dashboards

Access:

URL: https://grafana.homelab.int.zengarden.space Username: admin Password: <from victoria-metrics/env.yaml>

Pre-installed Dashboards:

  • Node Exporter Full (system metrics)
  • Kubernetes Cluster Monitoring
  • Kubernetes Pods Monitoring
  • Kubernetes Persistent Volumes
  • Victoria Metrics Dashboard
  • local-path Cluster Monitoring

Grafana Alert Operator

Purpose: Kubernetes operator for managing Grafana alerting resources declaratively

Why Grafana Alert Operator:

  • GitOps-native: Define alerts, notification policies, and mute timings as Kubernetes CRDs
  • Declarative: Alerts live in Git alongside application manifests
  • Automated: Reconciles Grafana configuration via HTTP API
  • Version controlled: Alert changes tracked in Git history

Architecture:

Kubernetes CRDs (GrafanaAlertRule, etc.) Grafana Alert Operator (shell-operator + Python) ├─> Reads CRD changes via Kubernetes API ├─> Reconciles via Grafana Provisioning HTTP API └─> Updates CRD status Grafana Alert Manager (built-in)

Custom Resource Definitions:

  1. GrafanaAlertRule: Define Prometheus-style alert rules
  2. GrafanaNotificationPolicy: Route alerts to specific receivers
  3. GrafanaMuteTiming: Schedule alert muting windows
  4. GrafanaNotificationTemplate: Custom notification templates

Example Alert Rule:

apiVersion: monitoring.zengarden.space/v1 kind: GrafanaAlertRule metadata: name: high-cpu-usage namespace: monitoring spec: grafanaRef: name: grafana-service-account namespace: victoria-metrics folderUID: homelab-alerts ruleGroup: infrastructure title: High CPU Usage condition: B noDataState: NoData execErrState: Alerting for: 5m annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}% for more than 5 minutes." labels: severity: warning data: - refId: A queryType: prometheus datasourceUid: victoria-metrics expr: | 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) intervalMs: 1000 maxDataPoints: 43200 - refId: B queryType: math datasourceUid: __expr__ expression: $A > 85 reducer: last

Example Notification Policy:

apiVersion: monitoring.zengarden.space/v1 kind: GrafanaNotificationPolicy metadata: name: default-routing namespace: monitoring spec: grafanaRef: name: grafana-service-account namespace: victoria-metrics receiver: gotify-notifications groupBy: - alertname - cluster groupWait: 30s groupInterval: 5m repeatInterval: 4h routes: - receiver: gotify-critical matchers: - severity = critical groupWait: 10s repeatInterval: 1h - receiver: gotify-warnings matchers: - severity = warning repeatInterval: 4h

Token Provisioning:

The operator uses a Grafana service account token for API authentication. A separate grafana-service-account chart provisions the token:

# Deployed automatically as part of Victoria Metrics stack apiVersion: batch/v1 kind: Job metadata: name: grafana-service-account-provision namespace: victoria-metrics annotations: helm.sh/hook: post-install,post-upgrade

The job creates a service account in Grafana and stores the token in a Kubernetes Secret (grafana-service-account-token), which the operator references.

Accessing Grafana Alerts:

URL: https://grafana.homelab.int.zengarden.space/alerting

Operator Logs:

# Shell-operator container (watches Kubernetes events) kubectl -n victoria-metrics logs -l app.kubernetes.io/name=grafana-alert-operator -c shell-operator # Handler service container (reconciles Grafana state) kubectl -n victoria-metrics logs -l app.kubernetes.io/name=grafana-alert-operator -c handler-service

Benefits over Manual Configuration:

  • GitOps workflow: Alerts reviewed via pull requests
  • Namespace isolation: Teams manage their own alerts
  • Consistency: Alerts defined using familiar Kubernetes patterns
  • Auditability: Full change history in Git

AlertManager

Purpose: Alert routing, grouping, and deduplication

Note: For Grafana-managed alerts (defined via GrafanaAlertRule CRDs), use Grafana’s built-in AlertManager. The standalone AlertManager below is used for vmalert rules.

Access:

URL: https://alerts.homelab.int.zengarden.space # No authentication by default (internal only)

Configuration:

route: receiver: gotify-notifications group_by: [alertname, cluster, namespace] group_wait: 30s group_interval: 5m repeat_interval: 4h receivers: - name: gotify-notifications webhook_configs: - url: http://alertmanager-gotify-nodejs.victoria-metrics.svc:3000/webhook send_resolved: true - name: devnull # For silenced alerts

Gotify

Purpose: Push notifications to mobile/web

Access:

URL: https://notifications.zengarden.space Username: admin Password: <from victoria-metrics/env.yaml>

Mobile App:

Key Metrics

Cluster Health Metrics

MetricDescriptionQuery
Node StatusNumber of Ready nodeskube_node_status_condition{condition="Ready",status="true"}
Pod StatusPods in Running statekube_pod_status_phase{phase="Running"}
Pod RestartsContainer restart countrate(kube_pod_container_status_restarts_total[5m])
API Server ErrorsAPI server error raterate(apiserver_request_total{code=~"5.."}[5m])
etcd Healthetcd member healthetcd_server_has_leader

Resource Utilization Metrics

MetricDescriptionQuery
CPU UsageNode CPU utilization100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
RAM UsageNode memory utilization(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Disk UsageFilesystem usage100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
Network TXNetwork transmit bytesrate(node_network_transmit_bytes_total[5m])
Network RXNetwork receive bytesrate(node_network_receive_bytes_total[5m])

Application Metrics

MetricDescriptionQuery
HTTP RequestsRequest raterate(nginx_ingress_controller_requests[5m])
HTTP Errors5xx error raterate(nginx_ingress_controller_requests{status=~"5.."}[5m])
Response Timep95 latencyhistogram_quantile(0.95, rate(nginx_ingress_controller_request_duration_seconds_bucket[5m]))
ArgoCD SyncOut-of-sync appsargocd_app_info{sync_status!="Synced"}

Storage Metrics (local-path)

MetricDescriptionQuery
Cluster Healthlocal-path health statusceph_health_status (0=OK, 1=WARN, 2=ERR)
OSD StatusOSDs up/downceph_osd_up, ceph_osd_in
Storage UsedUsed capacityceph_cluster_total_used_bytes / ceph_cluster_total_bytes * 100
IOPSRead/write opsrate(ceph_pool_wr[5m]), rate(ceph_pool_rd[5m])

Dashboards

Node Exporter Full Dashboard

Purpose: System-level metrics per node

Panels:

  • CPU usage (user, system, iowait)
  • Memory usage (used, cached, buffers)
  • Disk I/O (read/write bytes, IOPS)
  • Network I/O (TX/RX bytes, errors)
  • Filesystem usage
  • System load (1m, 5m, 15m)
  • Context switches, interrupts
  • NVMe temperature (if available)

How to Access:

  1. Grafana → Dashboards → Browse
  2. Search: “Node Exporter Full”
  3. Select node from dropdown

Kubernetes Cluster Monitoring Dashboard

Purpose: Cluster-wide Kubernetes metrics

Panels:

  • Node status (Ready/NotReady)
  • Pod count (Running/Pending/Failed)
  • CPU/RAM requests vs limits vs usage
  • Namespace resource usage
  • Persistent volume claims
  • Top pods by CPU/RAM
  • Network traffic by namespace

Kubernetes Pods Monitoring Dashboard

Purpose: Per-pod metrics

Panels:

  • Pod CPU usage
  • Pod memory usage
  • Pod network I/O
  • Container restarts
  • Pod phase (Running/Pending/Failed)
  • Resource requests vs actual usage

Variables:

  • Namespace (dropdown)
  • Pod (dropdown)

Victoria Metrics Dashboard

Purpose: Monitor Victoria Metrics itself

Panels:

  • Ingestion rate (samples/sec)
  • Query rate (queries/sec)
  • Storage size
  • Memory usage
  • CPU usage
  • Slow queries

Custom Dashboard Example

Creating a custom dashboard:

  1. Grafana → Dashboards → New Dashboard
  2. Add Panel
  3. Select Data Source: VictoriaMetrics
  4. Enter PromQL query
  5. Configure visualization (graph, gauge, stat)
  6. Save dashboard

Example Panel: Application Response Time

histogram_quantile(0.95, rate( nginx_ingress_controller_request_duration_seconds_bucket{ exported_namespace="my-app" }[5m] ) )

Panel Type: Time series graph Unit: seconds Legend: p95 response time

Alerts

Alert Rules

Location: Victoria Metrics vmalert configuration

Example Alert Rules:

groups: - name: cluster-health interval: 30s rules: - alert: NodeDown expr: up{job="node-exporter"} == 0 for: 5m labels: severity: critical annotations: summary: "Node {{ $labels.instance }} is down" description: "Node has been unreachable for more than 5 minutes." - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping" description: "Pod has restarted {{ $value }} times in the last 15 minutes." - alert: HighCPUUsage expr: | 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85 for: 10m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}% for more than 10 minutes." - alert: HighMemoryUsage expr: | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90 for: 10m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value }}% for more than 10 minutes." - alert: DiskSpaceLow expr: | 100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100) > 85 for: 5m labels: severity: warning annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk usage is {{ $value }}% on root filesystem." - alert: CertificateExpiringSoon expr: | (cert_manager_certificate_expiration_timestamp_seconds - time()) / 86400 < 30 for: 1h labels: severity: warning annotations: summary: "Certificate {{ $labels.namespace }}/{{ $labels.name }} expiring soon" description: "Certificate expires in {{ $value }} days." - alert: ArgocdAppOutOfSync expr: argocd_app_info{sync_status!="Synced"} > 0 for: 15m labels: severity: warning annotations: summary: "ArgoCD application {{ $labels.name }} out of sync" description: "Application has been out of sync for more than 15 minutes." - alert: local-pathHealthError expr: ceph_health_status == 2 for: 5m labels: severity: critical annotations: summary: "local-path cluster health is ERROR" description: "local-path cluster health has been in ERROR state for 5 minutes."

Alert Severity Levels

SeverityDescriptionResponse TimeExamples
CriticalService outage or data lossImmediate (wake up)Node down, etcd quorum lost, local-path health ERR
WarningDegraded performance or upcoming issueWithin 1 hourHigh CPU, certificate expiring soon, pod crash looping
InfoInformational, no action requiredReview during next maintenanceDeployment updated, backup completed

Alert Routing

AlertManager routes alerts based on severity:

route: receiver: gotify-notifications routes: - match: severity: critical receiver: gotify-notifications continue: true - match: severity: warning receiver: gotify-notifications continue: true - match: alertname: InfoInhibitor receiver: devnull - match: alertname: Watchdog receiver: devnull

Gotify Integration:

  • Critical alerts: High priority (red notification)
  • Warning alerts: Medium priority (yellow notification)
  • Info alerts: Low priority (blue notification)

Notification Setup

Gotify Mobile App

Android:

  1. Install Gotify app from Google Play Store
  2. Open app → Settings → Add Server
  3. URL: https://notifications.zengarden.space
  4. Username: admin
  5. Password: <from victoria-metrics/env.yaml>
  6. Save
  7. Create Application: “homelab-alerts”
  8. Copy token
  9. Update AlertManager configuration with token

Web UI:

  1. Navigate to https://notifications.zengarden.space
  2. Login: admin / <password>
  3. View alerts in real-time

Custom Metrics

Exposing Custom Metrics

For applications to expose metrics:

  1. Implement /metrics endpoint

    • Use Prometheus client library
    • Expose on port (e.g., 8080/metrics)
  2. Add ServiceMonitor CRD

    apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: my-app namespace: default spec: selector: matchLabels: app: my-app endpoints: - port: metrics path: /metrics interval: 30s
  3. Verify scraping

    # Check vmagent targets curl http://vmagent.victoria-metrics.svc:8429/targets

Example: Python Application Metrics

from prometheus_client import Counter, Histogram, generate_latest from flask import Flask, Response app = Flask(__name__) # Define metrics request_count = Counter('app_requests_total', 'Total requests', ['method', 'endpoint']) request_duration = Histogram('app_request_duration_seconds', 'Request duration', ['method', 'endpoint']) @app.route('/metrics') def metrics(): return Response(generate_latest(), mimetype='text/plain') @app.route('/api/data') @request_duration.labels(method='GET', endpoint='/api/data').time() def get_data(): request_count.labels(method='GET', endpoint='/api/data').inc() return {'data': 'example'}

Log Aggregation

Loki + Promtail

Purpose: Centralized log aggregation and querying

Architecture:

┌────────────────────────────────────────────────────────────┐ │ Log Collection │ │ │ │ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Promtail │ │ Promtail │ │ Promtail │ │ │ │ (node 1) │ │ (node 2) │ │ (node 3) │ │ │ │ DaemonSet │ │ DaemonSet │ │ DaemonSet │ │ │ └──────┬──────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ │ │ Scrapes logs from /var/log/pods/ │ │ │ └──────────────────┴───────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────┐ │ │ │ Loki Gateway │ │ │ └──────────┬───────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────┐ │ │ │ Loki (SingleBinary) │ │ │ │ (log storage) │ │ │ └──────────┬───────────┘ │ └────────────────────────────┼───────────────────────────────┘ ┌──────────────────┐ │ Grafana │ │ (log exploration)│ └──────────────────┘

Components:

Loki:

  • Log aggregation system inspired by Prometheus
  • Label-based indexing (not full-text)
  • Filesystem storage (10Gi PVC)
  • SingleBinary deployment mode (simplified architecture)
  • TSDB schema for efficient storage

Promtail:

  • Log shipping agent (DaemonSet on each node)
  • Scrapes container logs from /var/log/pods/
  • Kubernetes service discovery
  • Automatic label extraction (namespace, pod, container, app)
  • CRI log format parsing

Access:

# Grafana with Loki datasource URL: https://grafana.homelab.int.zengarden.space Data Source: Loki (preconfigured)

Query Examples:

# View all logs from a namespace {namespace="argocd"} # Filter by pod name {namespace="argocd", pod=~"argocd-server.*"} # Search for error messages {namespace="argocd"} |= "error" # Count errors per pod sum by (pod) (count_over_time({namespace="argocd"} |= "error" [5m])) # Logs from specific container {namespace="victoria-metrics", container="loki"} # Multiple filters {namespace="gitea", app="gitea"} |= "authentication" != "success"

LogQL Syntax:

  • {label="value"} - Label selector
  • |= - Line contains string
  • != - Line does not contain
  • |~ "regex" - Regex match
  • !~ "regex" - Regex not match
  • | json - Parse JSON logs
  • | logfmt - Parse logfmt logs

Grafana Explore:

  1. Navigate to Explore in Grafana
  2. Select Loki data source
  3. Use Log browser to select labels
  4. Or write LogQL queries directly
  5. View logs in table or stream format
  6. Inspect individual log lines

Benefits:

  • Lightweight: Lower resource usage than ELK stack
  • Grafana integration: Unified metrics + logs UI
  • Label-based: Similar query language to PromQL
  • Cost-effective: Efficient storage (no full-text indexing)

Limitations:

  • Not full-text search: Must use labels for filtering (use |= for grep-like search)
  • Label cardinality: Avoid high-cardinality labels (e.g., don’t use log level as label)

Troubleshooting Monitoring

No Metrics in Grafana

Check vmagent scraping:

kubectl -n victoria-metrics logs -l app.kubernetes.io/name=vmagent --tail=50 # Check targets kubectl -n victoria-metrics port-forward svc/vmagent 8429:8429 curl http://localhost:8429/targets | jq

Check Victoria Metrics storage:

kubectl -n victoria-metrics logs -l app.kubernetes.io/name=vmstorage --tail=50

Alerts Not Firing

Check vmalert:

kubectl -n victoria-metrics logs -l app.kubernetes.io/name=vmalert --tail=50 # Check rules loaded kubectl -n victoria-metrics port-forward svc/vmalert 8880:8880 curl http://localhost:8880/api/v1/rules | jq

Check AlertManager:

kubectl -n victoria-metrics logs -l app.kubernetes.io/name=alertmanager --tail=50 # Check alerts curl http://alerts.homelab.int.zengarden.space/api/v2/alerts

Gotify Not Receiving Alerts

Check alertmanager-gotify-nodejs:

kubectl -n victoria-metrics logs -l app=alertmanager-gotify-nodejs --tail=50

Verify webhook URL:

kubectl -n victoria-metrics get configmap alertmanager -o yaml | grep gotify

Test webhook manually:

curl -X POST http://alertmanager-gotify-nodejs.victoria-metrics.svc:3000/webhook \ -H "Content-Type: application/json" \ -d '{ "alerts": [{ "status": "firing", "labels": {"alertname": "TestAlert", "severity": "warning"}, "annotations": {"summary": "Test alert", "description": "This is a test"} }] }'

Security Auditing

Automated Security Audit Script

Purpose: Automated security breach detection and system integrity checks

Location: system/ansible/install-security-audit/

Deployment: Ansible playbook deploys to ~/bin/security-audit.sh on all blade nodes

Installation:

cd system/ansible/install-security-audit ./install.sh

This will deploy the script to all blade nodes (blade001-blade005).

What it checks:

  1. Authentication logs - Failed logins, invalid user attempts
  2. User accounts - New users, UID 0 users, shell access
  3. SSH keys - Authorized keys for all users
  4. Network activity - Listening ports, external connections
  5. Running processes - Suspicious patterns, /tmp execution
  6. Scheduled tasks - Root and user crontabs
  7. File integrity - System file modifications, SUID/SGID files
  8. System logs - Security keywords (breach, attack, exploit, malware)
  9. System resources - Disk, memory, CPU usage
  10. Restrictive proxy - Unauthorized access attempts (custom homelab security)

Usage:

# Run basic security audit ssh blade001 '~/bin/security-audit.sh' # Run with verbose output ssh blade001 '~/bin/security-audit.sh --verbose' # Get JSON output for parsing ssh blade001 '~/bin/security-audit.sh --json'

Run on all nodes:

for blade in blade001 blade002 blade003 blade004 blade005; do echo "=== $blade ===" ssh $blade '~/bin/security-audit.sh | tail -15' done

Exit codes:

  • 0 - System secure, no issues detected
  • 1 - Warnings detected, review recommended
  • 2 - Critical issues detected, immediate investigation required

Example output:

========================================== Homelab Security Audit blade001 - 2025-10-28 12:48:07 ========================================== ======================================== 1. Authentication Analysis ======================================== [✓] No failed login attempts in last 24 hours ======================================== 2. User Account Analysis ======================================== [✓] No recently added users [✓] Only root has UID 0 ======================================== Security Audit Summary ======================================== Hostname: blade001 Timestamp: 2025-10-28 12:48:07 Status: SECURE Critical Issues: 0 Warnings: 0 ✓ System appears secure. No security breaches detected.

Automated scheduling:

To run security audits automatically, add to cron:

# Add to root crontab on each blade # Run security audit daily at 3 AM 0 3 * * * /home/oleksiyp/bin/security-audit.sh --json > /tmp/security-audit-$(date +\%Y\%m\%d).json

Integration with monitoring:

You can export security audit results to Victoria Metrics using a custom exporter or parse JSON output with a script:

#!/bin/bash # security-audit-exporter.sh # Parse security audit JSON and expose as metrics RESULT=$(~/bin/security-audit.sh --json) ISSUES=$(echo "$RESULT" | jq -r '.issues') WARNINGS=$(echo "$RESULT" | jq -r '.warnings') STATUS=$(echo "$RESULT" | jq -r '.status') # Expose as Prometheus metrics cat <<EOF # HELP security_audit_issues Number of critical security issues detected # TYPE security_audit_issues gauge security_audit_issues{hostname="$(hostname)"} $ISSUES # HELP security_audit_warnings Number of security warnings detected # TYPE security_audit_warnings gauge security_audit_warnings{hostname="$(hostname)"} $WARNINGS # HELP security_audit_status Security audit status (0=SECURE, 1=COMPROMISED) # TYPE security_audit_status gauge security_audit_status{hostname="$(hostname)"} $([ "$STATUS" == "SECURE" ] && echo 0 || echo 1) EOF

Security best practices:

  • Run audits daily on all nodes
  • Review warnings promptly (within 24 hours)
  • Investigate critical issues immediately
  • Keep audit logs for compliance/forensics
  • Update the script as new threats emerge

Best Practices

Dashboard Design

  1. Use template variables for namespace, pod, node selection
  2. Show rates, not absolute counters (use rate() or irate())
  3. Set appropriate time ranges (5m for real-time, 24h for trends)
  4. Use percentiles for latency (p50, p95, p99)
  5. Set thresholds and alerts on panels

Alert Design

  1. Alert on symptoms, not causes (e.g., “service down” not “pod restarting”)
  2. Set appropriate for duration (avoid flapping)
  3. Include actionable annotations (what to do, where to look)
  4. Group related alerts (avoid alert storms)
  5. Test alerts (manually trigger to verify)

Metric Naming

Follow Prometheus naming conventions:

  • Counters: *_total suffix (e.g., requests_total)
  • Gauges: No suffix (e.g., memory_usage_bytes)
  • Histograms: *_bucket, *_sum, *_count (e.g., request_duration_seconds_bucket)

Next Steps

  • Configure custom dashboards for your applications
  • Set up additional alerts for specific use cases
  • Integrate with log aggregation (Loki) for complete observability
  • Review Maintenance for regular monitoring tasks

Comprehensive monitoring enables proactive issue detection and fast troubleshooting.