Monitoring
Observability Stack
This homelab implements comprehensive monitoring to ensure system health, detect anomalies, and enable quick troubleshooting.
Architecture
┌────────────────────────────────────────────────────────────┐
│ Metrics Collection │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Node │ │ Kubernetes │ │ Application │ │
│ │ Exporter │ │ Metrics │ │ Metrics │ │
│ │ (per node) │ │ (API server) │ │ (custom) │ │
│ └──────┬──────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └──────────────────┴───────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ vmagent (scraper) │ │
│ └──────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Victoria Metrics DB │ │
│ │ (time-series storage)│ │
│ └──────────┬───────────┘ │
└────────────────────────────┼───────────────────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌──────────────────┐
│ Grafana │ │ vmalert │ │ AlertManager │
│ (dashboards) │ │ (rule engine) │ │ (alert routing) │
└─────────────────┘ └────────┬────────┘ └────────┬─────────┘
│ │
└───────┬───────────┘
▼
┌──────────────────────┐
│ alertmanager-gotify │
│ (webhook converter) │
└──────────┬───────────┘
▼
┌──────────────────────┐
│ Gotify │
│ (push notifications) │
└──────────────────────┘Components
Victoria Metrics
Purpose: Prometheus-compatible metrics storage
Why Victoria Metrics over Prometheus:
- 50% less RAM usage
- 7× better compression
- Faster queries
- 100% PromQL compatible
Components:
- vmagent: Scrapes metrics from exporters
- vmstorage: Time-series database
- vmselect: Query engine
- vmalert: Alert rule evaluation
Access:
# VMSelect (query interface)
http://vmselect.victoria-metrics.svc.cluster.local:8481
# Example query
curl 'http://vmselect.victoria-metrics.svc.cluster.local:8481/select/0/prometheus/api/v1/query?query=up'Grafana
Purpose: Metrics visualization and dashboards
Access:
URL: https://grafana.homelab.int.zengarden.space
Username: admin
Password: <from victoria-metrics/env.yaml>Pre-installed Dashboards:
- Node Exporter Full (system metrics)
- Kubernetes Cluster Monitoring
- Kubernetes Pods Monitoring
- Kubernetes Persistent Volumes
- Victoria Metrics Dashboard
- local-path Cluster Monitoring
Grafana Alert Operator
Purpose: Kubernetes operator for managing Grafana alerting resources declaratively
Why Grafana Alert Operator:
- GitOps-native: Define alerts, notification policies, and mute timings as Kubernetes CRDs
- Declarative: Alerts live in Git alongside application manifests
- Automated: Reconciles Grafana configuration via HTTP API
- Version controlled: Alert changes tracked in Git history
Architecture:
Kubernetes CRDs (GrafanaAlertRule, etc.)
│
▼
Grafana Alert Operator (shell-operator + Python)
│
├─> Reads CRD changes via Kubernetes API
├─> Reconciles via Grafana Provisioning HTTP API
└─> Updates CRD status
│
▼
Grafana Alert Manager (built-in)Custom Resource Definitions:
- GrafanaAlertRule: Define Prometheus-style alert rules
- GrafanaNotificationPolicy: Route alerts to specific receivers
- GrafanaMuteTiming: Schedule alert muting windows
- GrafanaNotificationTemplate: Custom notification templates
Example Alert Rule:
apiVersion: monitoring.zengarden.space/v1
kind: GrafanaAlertRule
metadata:
name: high-cpu-usage
namespace: monitoring
spec:
grafanaRef:
name: grafana-service-account
namespace: victoria-metrics
folderUID: homelab-alerts
ruleGroup: infrastructure
title: High CPU Usage
condition: B
noDataState: NoData
execErrState: Alerting
for: 5m
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}% for more than 5 minutes."
labels:
severity: warning
data:
- refId: A
queryType: prometheus
datasourceUid: victoria-metrics
expr: |
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
intervalMs: 1000
maxDataPoints: 43200
- refId: B
queryType: math
datasourceUid: __expr__
expression: $A > 85
reducer: lastExample Notification Policy:
apiVersion: monitoring.zengarden.space/v1
kind: GrafanaNotificationPolicy
metadata:
name: default-routing
namespace: monitoring
spec:
grafanaRef:
name: grafana-service-account
namespace: victoria-metrics
receiver: gotify-notifications
groupBy:
- alertname
- cluster
groupWait: 30s
groupInterval: 5m
repeatInterval: 4h
routes:
- receiver: gotify-critical
matchers:
- severity = critical
groupWait: 10s
repeatInterval: 1h
- receiver: gotify-warnings
matchers:
- severity = warning
repeatInterval: 4hToken Provisioning:
The operator uses a Grafana service account token for API authentication. A separate grafana-service-account chart provisions the token:
# Deployed automatically as part of Victoria Metrics stack
apiVersion: batch/v1
kind: Job
metadata:
name: grafana-service-account-provision
namespace: victoria-metrics
annotations:
helm.sh/hook: post-install,post-upgradeThe job creates a service account in Grafana and stores the token in a Kubernetes Secret (grafana-service-account-token), which the operator references.
Accessing Grafana Alerts:
URL: https://grafana.homelab.int.zengarden.space/alertingOperator Logs:
# Shell-operator container (watches Kubernetes events)
kubectl -n victoria-metrics logs -l app.kubernetes.io/name=grafana-alert-operator -c shell-operator
# Handler service container (reconciles Grafana state)
kubectl -n victoria-metrics logs -l app.kubernetes.io/name=grafana-alert-operator -c handler-serviceBenefits over Manual Configuration:
- GitOps workflow: Alerts reviewed via pull requests
- Namespace isolation: Teams manage their own alerts
- Consistency: Alerts defined using familiar Kubernetes patterns
- Auditability: Full change history in Git
AlertManager
Purpose: Alert routing, grouping, and deduplication
Note: For Grafana-managed alerts (defined via GrafanaAlertRule CRDs), use Grafana’s built-in AlertManager. The standalone AlertManager below is used for vmalert rules.
Access:
URL: https://alerts.homelab.int.zengarden.space
# No authentication by default (internal only)Configuration:
route:
receiver: gotify-notifications
group_by: [alertname, cluster, namespace]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: gotify-notifications
webhook_configs:
- url: http://alertmanager-gotify-nodejs.victoria-metrics.svc:3000/webhook
send_resolved: true
- name: devnull # For silenced alertsGotify
Purpose: Push notifications to mobile/web
Access:
URL: https://notifications.zengarden.space
Username: admin
Password: <from victoria-metrics/env.yaml>Mobile App:
- Android: Google Play Store
- iOS: Not officially available (use web UI)
Key Metrics
Cluster Health Metrics
| Metric | Description | Query |
|---|---|---|
| Node Status | Number of Ready nodes | kube_node_status_condition{condition="Ready",status="true"} |
| Pod Status | Pods in Running state | kube_pod_status_phase{phase="Running"} |
| Pod Restarts | Container restart count | rate(kube_pod_container_status_restarts_total[5m]) |
| API Server Errors | API server error rate | rate(apiserver_request_total{code=~"5.."}[5m]) |
| etcd Health | etcd member health | etcd_server_has_leader |
Resource Utilization Metrics
| Metric | Description | Query |
|---|---|---|
| CPU Usage | Node CPU utilization | 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) |
| RAM Usage | Node memory utilization | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 |
| Disk Usage | Filesystem usage | 100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100) |
| Network TX | Network transmit bytes | rate(node_network_transmit_bytes_total[5m]) |
| Network RX | Network receive bytes | rate(node_network_receive_bytes_total[5m]) |
Application Metrics
| Metric | Description | Query |
|---|---|---|
| HTTP Requests | Request rate | rate(nginx_ingress_controller_requests[5m]) |
| HTTP Errors | 5xx error rate | rate(nginx_ingress_controller_requests{status=~"5.."}[5m]) |
| Response Time | p95 latency | histogram_quantile(0.95, rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) |
| ArgoCD Sync | Out-of-sync apps | argocd_app_info{sync_status!="Synced"} |
Storage Metrics (local-path)
| Metric | Description | Query |
|---|---|---|
| Cluster Health | local-path health status | ceph_health_status (0=OK, 1=WARN, 2=ERR) |
| OSD Status | OSDs up/down | ceph_osd_up, ceph_osd_in |
| Storage Used | Used capacity | ceph_cluster_total_used_bytes / ceph_cluster_total_bytes * 100 |
| IOPS | Read/write ops | rate(ceph_pool_wr[5m]), rate(ceph_pool_rd[5m]) |
Dashboards
Node Exporter Full Dashboard
Purpose: System-level metrics per node
Panels:
- CPU usage (user, system, iowait)
- Memory usage (used, cached, buffers)
- Disk I/O (read/write bytes, IOPS)
- Network I/O (TX/RX bytes, errors)
- Filesystem usage
- System load (1m, 5m, 15m)
- Context switches, interrupts
- NVMe temperature (if available)
How to Access:
- Grafana → Dashboards → Browse
- Search: “Node Exporter Full”
- Select node from dropdown
Kubernetes Cluster Monitoring Dashboard
Purpose: Cluster-wide Kubernetes metrics
Panels:
- Node status (Ready/NotReady)
- Pod count (Running/Pending/Failed)
- CPU/RAM requests vs limits vs usage
- Namespace resource usage
- Persistent volume claims
- Top pods by CPU/RAM
- Network traffic by namespace
Kubernetes Pods Monitoring Dashboard
Purpose: Per-pod metrics
Panels:
- Pod CPU usage
- Pod memory usage
- Pod network I/O
- Container restarts
- Pod phase (Running/Pending/Failed)
- Resource requests vs actual usage
Variables:
- Namespace (dropdown)
- Pod (dropdown)
Victoria Metrics Dashboard
Purpose: Monitor Victoria Metrics itself
Panels:
- Ingestion rate (samples/sec)
- Query rate (queries/sec)
- Storage size
- Memory usage
- CPU usage
- Slow queries
Custom Dashboard Example
Creating a custom dashboard:
- Grafana → Dashboards → New Dashboard
- Add Panel
- Select Data Source: VictoriaMetrics
- Enter PromQL query
- Configure visualization (graph, gauge, stat)
- Save dashboard
Example Panel: Application Response Time
histogram_quantile(0.95,
rate(
nginx_ingress_controller_request_duration_seconds_bucket{
exported_namespace="my-app"
}[5m]
)
)Panel Type: Time series graph Unit: seconds Legend: p95 response time
Alerts
Alert Rules
Location: Victoria Metrics vmalert configuration
Example Alert Rules:
groups:
- name: cluster-health
interval: 30s
rules:
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is down"
description: "Node has been unreachable for more than 5 minutes."
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod has restarted {{ $value }} times in the last 15 minutes."
- alert: HighCPUUsage
expr: |
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}% for more than 10 minutes."
- alert: HighMemoryUsage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}% for more than 10 minutes."
- alert: DiskSpaceLow
expr: |
100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk usage is {{ $value }}% on root filesystem."
- alert: CertificateExpiringSoon
expr: |
(cert_manager_certificate_expiration_timestamp_seconds - time()) / 86400 < 30
for: 1h
labels:
severity: warning
annotations:
summary: "Certificate {{ $labels.namespace }}/{{ $labels.name }} expiring soon"
description: "Certificate expires in {{ $value }} days."
- alert: ArgocdAppOutOfSync
expr: argocd_app_info{sync_status!="Synced"} > 0
for: 15m
labels:
severity: warning
annotations:
summary: "ArgoCD application {{ $labels.name }} out of sync"
description: "Application has been out of sync for more than 15 minutes."
- alert: local-pathHealthError
expr: ceph_health_status == 2
for: 5m
labels:
severity: critical
annotations:
summary: "local-path cluster health is ERROR"
description: "local-path cluster health has been in ERROR state for 5 minutes."Alert Severity Levels
| Severity | Description | Response Time | Examples |
|---|---|---|---|
| Critical | Service outage or data loss | Immediate (wake up) | Node down, etcd quorum lost, local-path health ERR |
| Warning | Degraded performance or upcoming issue | Within 1 hour | High CPU, certificate expiring soon, pod crash looping |
| Info | Informational, no action required | Review during next maintenance | Deployment updated, backup completed |
Alert Routing
AlertManager routes alerts based on severity:
route:
receiver: gotify-notifications
routes:
- match:
severity: critical
receiver: gotify-notifications
continue: true
- match:
severity: warning
receiver: gotify-notifications
continue: true
- match:
alertname: InfoInhibitor
receiver: devnull
- match:
alertname: Watchdog
receiver: devnullGotify Integration:
- Critical alerts: High priority (red notification)
- Warning alerts: Medium priority (yellow notification)
- Info alerts: Low priority (blue notification)
Notification Setup
Gotify Mobile App
Android:
- Install Gotify app from Google Play Store
- Open app → Settings → Add Server
- URL:
https://notifications.zengarden.space - Username:
admin - Password:
<from victoria-metrics/env.yaml> - Save
- Create Application: “homelab-alerts”
- Copy token
- Update AlertManager configuration with token
Web UI:
- Navigate to
https://notifications.zengarden.space - Login:
admin/<password> - View alerts in real-time
Custom Metrics
Exposing Custom Metrics
For applications to expose metrics:
-
Implement
/metricsendpoint- Use Prometheus client library
- Expose on port (e.g., 8080/metrics)
-
Add ServiceMonitor CRD
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: my-app namespace: default spec: selector: matchLabels: app: my-app endpoints: - port: metrics path: /metrics interval: 30s -
Verify scraping
# Check vmagent targets curl http://vmagent.victoria-metrics.svc:8429/targets
Example: Python Application Metrics
from prometheus_client import Counter, Histogram, generate_latest
from flask import Flask, Response
app = Flask(__name__)
# Define metrics
request_count = Counter('app_requests_total', 'Total requests', ['method', 'endpoint'])
request_duration = Histogram('app_request_duration_seconds', 'Request duration', ['method', 'endpoint'])
@app.route('/metrics')
def metrics():
return Response(generate_latest(), mimetype='text/plain')
@app.route('/api/data')
@request_duration.labels(method='GET', endpoint='/api/data').time()
def get_data():
request_count.labels(method='GET', endpoint='/api/data').inc()
return {'data': 'example'}Log Aggregation
Loki + Promtail
Purpose: Centralized log aggregation and querying
Architecture:
┌────────────────────────────────────────────────────────────┐
│ Log Collection │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Promtail │ │ Promtail │ │ Promtail │ │
│ │ (node 1) │ │ (node 2) │ │ (node 3) │ │
│ │ DaemonSet │ │ DaemonSet │ │ DaemonSet │ │
│ └──────┬──────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ │ Scrapes logs from /var/log/pods/ │ │
│ └──────────────────┴───────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Loki Gateway │ │
│ └──────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Loki (SingleBinary) │ │
│ │ (log storage) │ │
│ └──────────┬───────────┘ │
└────────────────────────────┼───────────────────────────────┘
│
▼
┌──────────────────┐
│ Grafana │
│ (log exploration)│
└──────────────────┘Components:
Loki:
- Log aggregation system inspired by Prometheus
- Label-based indexing (not full-text)
- Filesystem storage (10Gi PVC)
- SingleBinary deployment mode (simplified architecture)
- TSDB schema for efficient storage
Promtail:
- Log shipping agent (DaemonSet on each node)
- Scrapes container logs from
/var/log/pods/ - Kubernetes service discovery
- Automatic label extraction (namespace, pod, container, app)
- CRI log format parsing
Access:
# Grafana with Loki datasource
URL: https://grafana.homelab.int.zengarden.space
Data Source: Loki (preconfigured)Query Examples:
# View all logs from a namespace
{namespace="argocd"}
# Filter by pod name
{namespace="argocd", pod=~"argocd-server.*"}
# Search for error messages
{namespace="argocd"} |= "error"
# Count errors per pod
sum by (pod) (count_over_time({namespace="argocd"} |= "error" [5m]))
# Logs from specific container
{namespace="victoria-metrics", container="loki"}
# Multiple filters
{namespace="gitea", app="gitea"} |= "authentication" != "success"LogQL Syntax:
{label="value"}- Label selector|=- Line contains string!=- Line does not contain|~ "regex"- Regex match!~ "regex"- Regex not match| json- Parse JSON logs| logfmt- Parse logfmt logs
Grafana Explore:
- Navigate to Explore in Grafana
- Select Loki data source
- Use Log browser to select labels
- Or write LogQL queries directly
- View logs in table or stream format
- Inspect individual log lines
Benefits:
- Lightweight: Lower resource usage than ELK stack
- Grafana integration: Unified metrics + logs UI
- Label-based: Similar query language to PromQL
- Cost-effective: Efficient storage (no full-text indexing)
Limitations:
- Not full-text search: Must use labels for filtering (use
|=for grep-like search) - Label cardinality: Avoid high-cardinality labels (e.g., don’t use log level as label)
Troubleshooting Monitoring
No Metrics in Grafana
Check vmagent scraping:
kubectl -n victoria-metrics logs -l app.kubernetes.io/name=vmagent --tail=50
# Check targets
kubectl -n victoria-metrics port-forward svc/vmagent 8429:8429
curl http://localhost:8429/targets | jqCheck Victoria Metrics storage:
kubectl -n victoria-metrics logs -l app.kubernetes.io/name=vmstorage --tail=50Alerts Not Firing
Check vmalert:
kubectl -n victoria-metrics logs -l app.kubernetes.io/name=vmalert --tail=50
# Check rules loaded
kubectl -n victoria-metrics port-forward svc/vmalert 8880:8880
curl http://localhost:8880/api/v1/rules | jqCheck AlertManager:
kubectl -n victoria-metrics logs -l app.kubernetes.io/name=alertmanager --tail=50
# Check alerts
curl http://alerts.homelab.int.zengarden.space/api/v2/alertsGotify Not Receiving Alerts
Check alertmanager-gotify-nodejs:
kubectl -n victoria-metrics logs -l app=alertmanager-gotify-nodejs --tail=50Verify webhook URL:
kubectl -n victoria-metrics get configmap alertmanager -o yaml | grep gotifyTest webhook manually:
curl -X POST http://alertmanager-gotify-nodejs.victoria-metrics.svc:3000/webhook \
-H "Content-Type: application/json" \
-d '{
"alerts": [{
"status": "firing",
"labels": {"alertname": "TestAlert", "severity": "warning"},
"annotations": {"summary": "Test alert", "description": "This is a test"}
}]
}'Security Auditing
Automated Security Audit Script
Purpose: Automated security breach detection and system integrity checks
Location: system/ansible/install-security-audit/
Deployment: Ansible playbook deploys to ~/bin/security-audit.sh on all blade nodes
Installation:
cd system/ansible/install-security-audit
./install.shThis will deploy the script to all blade nodes (blade001-blade005).
What it checks:
- Authentication logs - Failed logins, invalid user attempts
- User accounts - New users, UID 0 users, shell access
- SSH keys - Authorized keys for all users
- Network activity - Listening ports, external connections
- Running processes - Suspicious patterns, /tmp execution
- Scheduled tasks - Root and user crontabs
- File integrity - System file modifications, SUID/SGID files
- System logs - Security keywords (breach, attack, exploit, malware)
- System resources - Disk, memory, CPU usage
- Restrictive proxy - Unauthorized access attempts (custom homelab security)
Usage:
# Run basic security audit
ssh blade001 '~/bin/security-audit.sh'
# Run with verbose output
ssh blade001 '~/bin/security-audit.sh --verbose'
# Get JSON output for parsing
ssh blade001 '~/bin/security-audit.sh --json'Run on all nodes:
for blade in blade001 blade002 blade003 blade004 blade005; do
echo "=== $blade ==="
ssh $blade '~/bin/security-audit.sh | tail -15'
doneExit codes:
0- System secure, no issues detected1- Warnings detected, review recommended2- Critical issues detected, immediate investigation required
Example output:
==========================================
Homelab Security Audit
blade001 - 2025-10-28 12:48:07
==========================================
========================================
1. Authentication Analysis
========================================
[✓] No failed login attempts in last 24 hours
========================================
2. User Account Analysis
========================================
[✓] No recently added users
[✓] Only root has UID 0
========================================
Security Audit Summary
========================================
Hostname: blade001
Timestamp: 2025-10-28 12:48:07
Status: SECURE
Critical Issues: 0
Warnings: 0
✓ System appears secure. No security breaches detected.Automated scheduling:
To run security audits automatically, add to cron:
# Add to root crontab on each blade
# Run security audit daily at 3 AM
0 3 * * * /home/oleksiyp/bin/security-audit.sh --json > /tmp/security-audit-$(date +\%Y\%m\%d).jsonIntegration with monitoring:
You can export security audit results to Victoria Metrics using a custom exporter or parse JSON output with a script:
#!/bin/bash
# security-audit-exporter.sh
# Parse security audit JSON and expose as metrics
RESULT=$(~/bin/security-audit.sh --json)
ISSUES=$(echo "$RESULT" | jq -r '.issues')
WARNINGS=$(echo "$RESULT" | jq -r '.warnings')
STATUS=$(echo "$RESULT" | jq -r '.status')
# Expose as Prometheus metrics
cat <<EOF
# HELP security_audit_issues Number of critical security issues detected
# TYPE security_audit_issues gauge
security_audit_issues{hostname="$(hostname)"} $ISSUES
# HELP security_audit_warnings Number of security warnings detected
# TYPE security_audit_warnings gauge
security_audit_warnings{hostname="$(hostname)"} $WARNINGS
# HELP security_audit_status Security audit status (0=SECURE, 1=COMPROMISED)
# TYPE security_audit_status gauge
security_audit_status{hostname="$(hostname)"} $([ "$STATUS" == "SECURE" ] && echo 0 || echo 1)
EOFSecurity best practices:
- Run audits daily on all nodes
- Review warnings promptly (within 24 hours)
- Investigate critical issues immediately
- Keep audit logs for compliance/forensics
- Update the script as new threats emerge
Best Practices
Dashboard Design
- Use template variables for namespace, pod, node selection
- Show rates, not absolute counters (use
rate()orirate()) - Set appropriate time ranges (5m for real-time, 24h for trends)
- Use percentiles for latency (p50, p95, p99)
- Set thresholds and alerts on panels
Alert Design
- Alert on symptoms, not causes (e.g., “service down” not “pod restarting”)
- Set appropriate
forduration (avoid flapping) - Include actionable annotations (what to do, where to look)
- Group related alerts (avoid alert storms)
- Test alerts (manually trigger to verify)
Metric Naming
Follow Prometheus naming conventions:
- Counters:
*_totalsuffix (e.g.,requests_total) - Gauges: No suffix (e.g.,
memory_usage_bytes) - Histograms:
*_bucket,*_sum,*_count(e.g.,request_duration_seconds_bucket)
Next Steps
- Configure custom dashboards for your applications
- Set up additional alerts for specific use cases
- Integrate with log aggregation (Loki) for complete observability
- Review Maintenance for regular monitoring tasks
Comprehensive monitoring enables proactive issue detection and fast troubleshooting.