Initial commit: homelab infrastructure wiki
- Full Obsidian vault content - Host configs (ice, grizzley, ubuntu, proxmox, truenas, panda, hyte) - Media stack documentation - Traefik HA setup - Automation scripts - Bachelor party planning
This commit is contained in:
101
homelab/concepts/monitoring-pipeline.md
Normal file
101
homelab/concepts/monitoring-pipeline.md
Normal file
@@ -0,0 +1,101 @@
|
||||
---
|
||||
title: Monitoring Pipeline
|
||||
created: 2026-04-28
|
||||
updated: 2026-04-29
|
||||
type: concept
|
||||
tags: [concept, monitoring, alerting, docker]
|
||||
sources: [../../homelab/architecture.md]
|
||||
---
|
||||
|
||||
# Monitoring Pipeline
|
||||
|
||||
Prometheus-based monitoring with Loki log aggregation, Grafana dashboards, and Telegram alerting via Hermes Gateway watchdog. All monitoring services run on [[ubuntu]].
|
||||
|
||||
## Metrics Pipeline
|
||||
|
||||
```
|
||||
Node Exporters (all hosts: ubuntu, grizzley, ice, proxmox, truenas, panda)
|
||||
→ Prometheus (ubuntu:9090)
|
||||
→ Grafana (ubuntu:3000)
|
||||
→ Alertmanager (ubuntu:9093)
|
||||
→ Hermes Gateway webhook
|
||||
→ Telegram (@AigentZeroHermes)
|
||||
```
|
||||
|
||||
**Alert routing:**
|
||||
- Alertmanager receives Prometheus alerts
|
||||
- Routes to Hermes Gateway webhook (POST to gateway endpoint)
|
||||
- Gateway sends Telegram to: topic 1033 "Cron Jobs" in AigentZeroHermes (-1003820156994)
|
||||
- Bot token: `836803270:AAH-Ac5Y`
|
||||
|
||||
## Log Pipeline
|
||||
|
||||
```
|
||||
Docker containers (all hosts)
|
||||
→ Promtail (Docker socket service discovery)
|
||||
→ Loki (ubuntu:3100)
|
||||
→ Grafana dashboards
|
||||
```
|
||||
|
||||
Promtail runs as a Docker container on [[ubuntu]], reading container logs via the Docker socket.
|
||||
|
||||
## Scrape Targets
|
||||
|
||||
Prometheus monitors: ubuntu (local), proxmox, truenas, grizzley, ice, panda.
|
||||
|
||||
Scrape endpoints:
|
||||
- `prometheus` (9090) — Prometheus itself
|
||||
- `node-exporter` (9100) — host hardware metrics
|
||||
- `blackbox-exporter` (9115) — HTTP/TCP/ICMP probing
|
||||
- `cadvisor` (8080) — container metrics
|
||||
- `loki` (3100) — log metrics
|
||||
- Traefik instances (8080/metrics)
|
||||
|
||||
## Blackbox Exporter Targets
|
||||
|
||||
15+ HTTPS probe targets configured. See `homelab/ubuntu/docker/monitoring/` for the blackbox exporter config.
|
||||
|
||||
## Alert Rules
|
||||
|
||||
Prometheus alert rules → Alertmanager → Hermes Gateway → Telegram.
|
||||
|
||||
Key alerts:
|
||||
- `ContainerLogError` — Container logging errors detected by Promtail
|
||||
- `ServiceDown` — Blackbox-probed service unavailable
|
||||
- `JellyfinDown` — Jellyfin health check failed
|
||||
- `TraefikDown` — Traefik not responding
|
||||
|
||||
See [[homelab-servicedown-triage]] and [[homelab-containerlogerror-triage]] skills for triage procedures.
|
||||
|
||||
## Hermes Gateway Watchdog
|
||||
|
||||
Hermes Gateway is monitored by a watchdog script on both [[ice]] and [[grizzley]]:
|
||||
|
||||
```
|
||||
/home/bear/hermes-gateway-watchdog.sh
|
||||
```
|
||||
|
||||
Runs via **system cron** (not systemd user service) on both hosts:
|
||||
1. Checks if hermes-gateway is responsive
|
||||
2. On failure: direct restart → tmux+OpenCode rescue if still down
|
||||
3. Sends Telegram notification on failure to topic 1033 "Cron Jobs" (bot: `836803270:AAH-Ac5Y`)
|
||||
|
||||
**Note:** On [[grizzley]], the systemd override for the watchdog is deployed directly to `/etc/systemd/system/` (not tracked in the homelab repo — it's a system unit).
|
||||
|
||||
## External Uptime Monitoring
|
||||
|
||||
- **Uptime Kuma** (grizzley:3001) — external/internal availability checks
|
||||
- **Blackbox Exporter** (ubuntu:9115) — 15+ HTTPS probe targets
|
||||
|
||||
## Dashboards
|
||||
|
||||
- Grafana (ubuntu:3000) — metrics dashboards
|
||||
- Loki + Grafana — log exploration
|
||||
- Prometheus (ubuntu:9090) — expression browser, alertmanager
|
||||
|
||||
## Related
|
||||
|
||||
- [[ubuntu]] — Hosts Prometheus, Grafana, Loki, Alertmanager
|
||||
- [[grizzley]] — Hosts Hermes Agent, Telegram webhook, Uptime Kuma
|
||||
- [[hermes-gateway]] — AI gateway with watchdog pattern
|
||||
- [[traefik]] — Traefik metrics
|
||||
Reference in New Issue
Block a user