- Full Obsidian vault content - Host configs (ice, grizzley, ubuntu, proxmox, truenas, panda, hyte) - Media stack documentation - Traefik HA setup - Automation scripts - Bachelor party planning
3.2 KiB
title, created, updated, type, tags, sources
| title | created | updated | type | tags | sources | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Monitoring Pipeline | 2026-04-28 | 2026-04-29 | concept |
|
|
Monitoring Pipeline
Prometheus-based monitoring with Loki log aggregation, Grafana dashboards, and Telegram alerting via Hermes Gateway watchdog. All monitoring services run on ubuntu.
Metrics Pipeline
Node Exporters (all hosts: ubuntu, grizzley, ice, proxmox, truenas, panda)
→ Prometheus (ubuntu:9090)
→ Grafana (ubuntu:3000)
→ Alertmanager (ubuntu:9093)
→ Hermes Gateway webhook
→ Telegram (@AigentZeroHermes)
Alert routing:
- Alertmanager receives Prometheus alerts
- Routes to Hermes Gateway webhook (POST to gateway endpoint)
- Gateway sends Telegram to: topic 1033 "Cron Jobs" in AigentZeroHermes (-1003820156994)
- Bot token:
836803270:AAH-Ac5Y
Log Pipeline
Docker containers (all hosts)
→ Promtail (Docker socket service discovery)
→ Loki (ubuntu:3100)
→ Grafana dashboards
Promtail runs as a Docker container on ubuntu, reading container logs via the Docker socket.
Scrape Targets
Prometheus monitors: ubuntu (local), proxmox, truenas, grizzley, ice, panda.
Scrape endpoints:
prometheus(9090) — Prometheus itselfnode-exporter(9100) — host hardware metricsblackbox-exporter(9115) — HTTP/TCP/ICMP probingcadvisor(8080) — container metricsloki(3100) — log metrics- Traefik instances (8080/metrics)
Blackbox Exporter Targets
15+ HTTPS probe targets configured. See homelab/ubuntu/docker/monitoring/ for the blackbox exporter config.
Alert Rules
Prometheus alert rules → Alertmanager → Hermes Gateway → Telegram.
Key alerts:
ContainerLogError— Container logging errors detected by PromtailServiceDown— Blackbox-probed service unavailableJellyfinDown— Jellyfin health check failedTraefikDown— Traefik not responding
See homelab-servicedown-triage and homelab-containerlogerror-triage skills for triage procedures.
Hermes Gateway Watchdog
Hermes Gateway is monitored by a watchdog script on both ice and grizzley:
/home/bear/hermes-gateway-watchdog.sh
Runs via system cron (not systemd user service) on both hosts:
- Checks if hermes-gateway is responsive
- On failure: direct restart → tmux+OpenCode rescue if still down
- Sends Telegram notification on failure to topic 1033 "Cron Jobs" (bot:
836803270:AAH-Ac5Y)
Note: On grizzley, the systemd override for the watchdog is deployed directly to /etc/systemd/system/ (not tracked in the homelab repo — it's a system unit).
External Uptime Monitoring
- Uptime Kuma (grizzley:3001) — external/internal availability checks
- Blackbox Exporter (ubuntu:9115) — 15+ HTTPS probe targets
Dashboards
- Grafana (ubuntu:3000) — metrics dashboards
- Loki + Grafana — log exploration
- Prometheus (ubuntu:9090) — expression browser, alertmanager
Related
- ubuntu — Hosts Prometheus, Grafana, Loki, Alertmanager
- grizzley — Hosts Hermes Agent, Telegram webhook, Uptime Kuma
- hermes-gateway — AI gateway with watchdog pattern
- traefik — Traefik metrics