Files
hermes-ice/homelab/concepts/monitoring-pipeline.md
Hermes Agent e4d91aadf9 Initial commit: homelab infrastructure wiki
- Full Obsidian vault content
- Host configs (ice, grizzley, ubuntu, proxmox, truenas, panda, hyte)
- Media stack documentation
- Traefik HA setup
- Automation scripts
- Bachelor party planning
2026-05-24 16:08:40 -07:00

3.2 KiB

title, created, updated, type, tags, sources
title created updated type tags sources
Monitoring Pipeline 2026-04-28 2026-04-29 concept
concept
monitoring
alerting
docker
../../homelab/architecture.md

Monitoring Pipeline

Prometheus-based monitoring with Loki log aggregation, Grafana dashboards, and Telegram alerting via Hermes Gateway watchdog. All monitoring services run on ubuntu.

Metrics Pipeline

Node Exporters (all hosts: ubuntu, grizzley, ice, proxmox, truenas, panda)
    → Prometheus (ubuntu:9090)
    → Grafana (ubuntu:3000)
    → Alertmanager (ubuntu:9093)
    → Hermes Gateway webhook
    → Telegram (@AigentZeroHermes)

Alert routing:

  • Alertmanager receives Prometheus alerts
  • Routes to Hermes Gateway webhook (POST to gateway endpoint)
  • Gateway sends Telegram to: topic 1033 "Cron Jobs" in AigentZeroHermes (-1003820156994)
  • Bot token: 836803270:AAH-Ac5Y

Log Pipeline

Docker containers (all hosts)
    → Promtail (Docker socket service discovery)
    → Loki (ubuntu:3100)
    → Grafana dashboards

Promtail runs as a Docker container on ubuntu, reading container logs via the Docker socket.

Scrape Targets

Prometheus monitors: ubuntu (local), proxmox, truenas, grizzley, ice, panda.

Scrape endpoints:

  • prometheus (9090) — Prometheus itself
  • node-exporter (9100) — host hardware metrics
  • blackbox-exporter (9115) — HTTP/TCP/ICMP probing
  • cadvisor (8080) — container metrics
  • loki (3100) — log metrics
  • Traefik instances (8080/metrics)

Blackbox Exporter Targets

15+ HTTPS probe targets configured. See homelab/ubuntu/docker/monitoring/ for the blackbox exporter config.

Alert Rules

Prometheus alert rules → Alertmanager → Hermes Gateway → Telegram.

Key alerts:

  • ContainerLogError — Container logging errors detected by Promtail
  • ServiceDown — Blackbox-probed service unavailable
  • JellyfinDown — Jellyfin health check failed
  • TraefikDown — Traefik not responding

See homelab-servicedown-triage and homelab-containerlogerror-triage skills for triage procedures.

Hermes Gateway Watchdog

Hermes Gateway is monitored by a watchdog script on both ice and grizzley:

/home/bear/hermes-gateway-watchdog.sh

Runs via system cron (not systemd user service) on both hosts:

  1. Checks if hermes-gateway is responsive
  2. On failure: direct restart → tmux+OpenCode rescue if still down
  3. Sends Telegram notification on failure to topic 1033 "Cron Jobs" (bot: 836803270:AAH-Ac5Y)

Note: On grizzley, the systemd override for the watchdog is deployed directly to /etc/systemd/system/ (not tracked in the homelab repo — it's a system unit).

External Uptime Monitoring

  • Uptime Kuma (grizzley:3001) — external/internal availability checks
  • Blackbox Exporter (ubuntu:9115) — 15+ HTTPS probe targets

Dashboards

  • Grafana (ubuntu:3000) — metrics dashboards
  • Loki + Grafana — log exploration
  • Prometheus (ubuntu:9090) — expression browser, alertmanager
  • ubuntu — Hosts Prometheus, Grafana, Loki, Alertmanager
  • grizzley — Hosts Hermes Agent, Telegram webhook, Uptime Kuma
  • hermes-gateway — AI gateway with watchdog pattern
  • traefik — Traefik metrics