--- title: Monitoring Pipeline created: 2026-04-28 updated: 2026-04-29 type: concept tags: [concept, monitoring, alerting, docker] sources: [../../homelab/architecture.md] --- # Monitoring Pipeline Prometheus-based monitoring with Loki log aggregation, Grafana dashboards, and Telegram alerting via Hermes Gateway watchdog. All monitoring services run on [[ubuntu]]. ## Metrics Pipeline ``` Node Exporters (all hosts: ubuntu, grizzley, ice, proxmox, truenas, panda) → Prometheus (ubuntu:9090) → Grafana (ubuntu:3000) → Alertmanager (ubuntu:9093) → Hermes Gateway webhook → Telegram (@AigentZeroHermes) ``` **Alert routing:** - Alertmanager receives Prometheus alerts - Routes to Hermes Gateway webhook (POST to gateway endpoint) - Gateway sends Telegram to: topic 1033 "Cron Jobs" in AigentZeroHermes (-1003820156994) - Bot token: `836803270:AAH-Ac5Y` ## Log Pipeline ``` Docker containers (all hosts) → Promtail (Docker socket service discovery) → Loki (ubuntu:3100) → Grafana dashboards ``` Promtail runs as a Docker container on [[ubuntu]], reading container logs via the Docker socket. ## Scrape Targets Prometheus monitors: ubuntu (local), proxmox, truenas, grizzley, ice, panda. Scrape endpoints: - `prometheus` (9090) — Prometheus itself - `node-exporter` (9100) — host hardware metrics - `blackbox-exporter` (9115) — HTTP/TCP/ICMP probing - `cadvisor` (8080) — container metrics - `loki` (3100) — log metrics - Traefik instances (8080/metrics) ## Blackbox Exporter Targets 15+ HTTPS probe targets configured. See `homelab/ubuntu/docker/monitoring/` for the blackbox exporter config. ## Alert Rules Prometheus alert rules → Alertmanager → Hermes Gateway → Telegram. Key alerts: - `ContainerLogError` — Container logging errors detected by Promtail - `ServiceDown` — Blackbox-probed service unavailable - `JellyfinDown` — Jellyfin health check failed - `TraefikDown` — Traefik not responding See [[homelab-servicedown-triage]] and [[homelab-containerlogerror-triage]] skills for triage procedures. ## Hermes Gateway Watchdog Hermes Gateway is monitored by a watchdog script on both [[ice]] and [[grizzley]]: ``` /home/bear/hermes-gateway-watchdog.sh ``` Runs via **system cron** (not systemd user service) on both hosts: 1. Checks if hermes-gateway is responsive 2. On failure: direct restart → tmux+OpenCode rescue if still down 3. Sends Telegram notification on failure to topic 1033 "Cron Jobs" (bot: `836803270:AAH-Ac5Y`) **Note:** On [[grizzley]], the systemd override for the watchdog is deployed directly to `/etc/systemd/system/` (not tracked in the homelab repo — it's a system unit). ## External Uptime Monitoring - **Uptime Kuma** (grizzley:3001) — external/internal availability checks - **Blackbox Exporter** (ubuntu:9115) — 15+ HTTPS probe targets ## Dashboards - Grafana (ubuntu:3000) — metrics dashboards - Loki + Grafana — log exploration - Prometheus (ubuntu:9090) — expression browser, alertmanager ## Related - [[ubuntu]] — Hosts Prometheus, Grafana, Loki, Alertmanager - [[grizzley]] — Hosts Hermes Agent, Telegram webhook, Uptime Kuma - [[hermes-gateway]] — AI gateway with watchdog pattern - [[traefik]] — Traefik metrics