# Skill: infra-monitor Monitors server health and watches all Agent OS skills for staleness or errors. Runs on a cron schedule on 172.27.40.3. ## Inputs Reads before executing: - `../../identity.md` - `../../brain.md` - `../../memory/persistent.md` - `learnings.md` (this skill's improvement notes) ## What to check ### Docker health (on 172.27.40.3) - All expected containers are running (not exited/restarting) - Flag any container that has restarted more than 3 times in the last hour - Expected containers (grouped by criticality): **Critical (alert immediately if down):** - nginx-proxy-manager (reverse proxy — everything depends on this) - gitea (all code + docs) - citadel-mcp (central tool server) - raven-notify (notification hub) - open-webui (chat UI) - vaultwarden (password manager) **Important (alert after 15 min down):** - headscale (VPN) - grafana (monitoring dashboards) - influxdb (time-series data) - portainer (Docker management) - uptime-kuma (HTTP monitoring) - maester-reports (CSF compliance) - jon-snow (orchestrator) - tarly-backup (backup monitoring) - directus + directus-db + directus-redis (CRM) **Normal (report in daily digest only):** - hodor-gateway, sam-research, qyburn-coder, searxng - homarr, headplane, headscale-ui - plane-* (all Plane containers) - netbox-* (all NetBox containers) - nocodb, bni-scheduler, inventree-*, wetty, term-dash - rustdesk-hbbs, rustdesk-hbbr - iventoy, agent-sites ### Service reachability Lightweight HTTP check (curl, timeout 5s) on each internal URL: - http://172.27.40.3:9443 (Portainer) - http://172.27.40.3:3002 (Uptime Kuma) - http://172.27.40.3:3000 (Gitea) - http://172.27.40.3:3010 (Open WebUI) - http://172.27.40.3:7575 (Homarr) - http://172.27.40.3:8300 (Citadel MCP) - http://172.27.40.3:8400 (Raven) - http://172.27.40.3:8800 (Maester) - http://172.27.40.3:8900 (Jon Snow) - http://172.27.40.3:3020 (Grafana) - http://172.27.40.3:8100 (NetBox) - http://172.27.40.3:8850 (Directus) - http://172.27.40.20:11434 (Ollama on NxM-AI) ### Agent watchdog For each agent log at `../../logs//last-run.json`: - Check modification time — flag if older than expected schedule - Check `status` field — flag if not "success" - Expected agents and max staleness: - bran-changelog: 25 hours (daily) - varys-monitor: 20 minutes (every 15 min) - trmm-frappe-sync: 35 minutes (every 30 min) - tarly-backup: 25 hours (daily) - raven-notify: 25 hours (event-driven, check status only) - citadel-mcp, sam-research, qyburn-coder, jon-snow: check status only (on-demand) ### System resources (on 172.27.40.3) - Disk usage on / — warn if >80%, critical if >90% - Memory usage — flag if >85% - Docker disk usage (`docker system df`) — warn if reclaimable > 10GB ### Remote hosts (optional, best-effort) - Ping 172.27.40.20 (Ollama host) - Ping 172.27.40.30 (Hermes Native VM) - Ping 172.27.40.2 (Proxmox) ## Output Write a digest to `last-output.md` in this format: - Summary line: X healthy, Y warnings, Z critical - Section per category: Docker, Services, Agent Watchdog, System, Remote Hosts - Each item: ✓ OK / ⚠ Warning / ✗ Critical + one line detail Also write machine-readable output to `../../logs/infra-monitor/last-run.json`. Pass anomalies to `context/handoff.md` for Raven notification. ## Wrap-up After writing output: 1. Update `learnings.md` with anything that went wrong or could be improved 2. Append a one-line log entry to `../../logs/infra-monitor.log`: `YYYY-MM-DD HH:MM | status | summary` 3. Update `../../memory/notes-from-last-run.md` ## Schedule - **Heartbeat:** every hour — checks Docker + Ollama + critical services only (fast, <30s) - **Full digest:** daily at 07:00 — all checks including remote hosts and disk usage