---
title: Monitoring Pipeline
created: 2026-04-28
updated: 2026-04-29
type: concept
tags: [concept, monitoring, alerting, docker]
sources: [../../homelab/architecture.md]
---

# Monitoring Pipeline

Prometheus-based monitoring with Loki log aggregation, Grafana dashboards, and Telegram alerting via Hermes Gateway watchdog. All monitoring services run on [[ubuntu]].

## Metrics Pipeline

```
Node Exporters (all hosts: ubuntu, grizzley, ice, proxmox, truenas, panda)
    → Prometheus (ubuntu:9090)
    → Grafana (ubuntu:3000)
    → Alertmanager (ubuntu:9093)
    → Hermes Gateway webhook
    → Telegram (@AigentZeroHermes)
```

**Alert routing:**
- Alertmanager receives Prometheus alerts
- Routes to Hermes Gateway webhook (POST to gateway endpoint)
- Gateway sends Telegram to: topic 1033 "Cron Jobs" in AigentZeroHermes (-1003820156994)
- Bot token: `836803270:AAH-Ac5Y`

## Log Pipeline

```
Docker containers (all hosts)
    → Promtail (Docker socket service discovery)
    → Loki (ubuntu:3100)
    → Grafana dashboards
```

Promtail runs as a Docker container on [[ubuntu]], reading container logs via the Docker socket.

## Scrape Targets

Prometheus monitors: ubuntu (local), proxmox, truenas, grizzley, ice, panda.

Scrape endpoints:
- `prometheus` (9090) — Prometheus itself
- `node-exporter` (9100) — host hardware metrics
- `blackbox-exporter` (9115) — HTTP/TCP/ICMP probing
- `cadvisor` (8080) — container metrics
- `loki` (3100) — log metrics
- Traefik instances (8080/metrics)

## Blackbox Exporter Targets

15+ HTTPS probe targets configured. See `homelab/ubuntu/docker/monitoring/` for the blackbox exporter config.

## Alert Rules

Prometheus alert rules → Alertmanager → Hermes Gateway → Telegram.

Key alerts:
- `ContainerLogError` — Container logging errors detected by Promtail
- `ServiceDown` — Blackbox-probed service unavailable
- `JellyfinDown` — Jellyfin health check failed
- `TraefikDown` — Traefik not responding

See [[homelab-servicedown-triage]] and [[homelab-containerlogerror-triage]] skills for triage procedures.

## Hermes Gateway Watchdog

Hermes Gateway is monitored by a watchdog script on both [[ice]] and [[grizzley]]:

```
/home/bear/hermes-gateway-watchdog.sh
```

Runs via **system cron** (not systemd user service) on both hosts:
1. Checks if hermes-gateway is responsive
2. On failure: direct restart → tmux+OpenCode rescue if still down
3. Sends Telegram notification on failure to topic 1033 "Cron Jobs" (bot: `836803270:AAH-Ac5Y`)

**Note:** On [[grizzley]], the systemd override for the watchdog is deployed directly to `/etc/systemd/system/` (not tracked in the homelab repo — it's a system unit).

## External Uptime Monitoring

- **Uptime Kuma** (grizzley:3001) — external/internal availability checks
- **Blackbox Exporter** (ubuntu:9115) — 15+ HTTPS probe targets

## Dashboards

- Grafana (ubuntu:3000) — metrics dashboards
- Loki + Grafana — log exploration
- Prometheus (ubuntu:9090) — expression browser, alertmanager

## Related

- [[ubuntu]] — Hosts Prometheus, Grafana, Loki, Alertmanager
- [[grizzley]] — Hosts Hermes Agent, Telegram webhook, Uptime Kuma
- [[hermes-gateway]] — AI gateway with watchdog pattern
- [[traefik]] — Traefik metrics