agent-os/skills/infra-monitor/skill.md

# Skill: infra-monitor

Monitors server health and watches all Agent OS skills for staleness or errors. Runs on a cron schedule on 172.27.40.3.

## Inputs

Reads before executing:
- `../../identity.md`
- `../../brain.md`
- `../../memory/persistent.md`
- `learnings.md` (this skill's improvement notes)

## What to check

### Docker health (on 172.27.40.3)
- All expected containers are running (not exited/restarting)
- Flag any container that has restarted more than 3 times in the last hour
- Expected containers (grouped by criticality):

**Critical (alert immediately if down):**
- nginx-proxy-manager (reverse proxy — everything depends on this)
- gitea (all code + docs)
- citadel-mcp (central tool server)
- raven-notify (notification hub)
- open-webui (chat UI)
- vaultwarden (password manager)

**Important (alert after 15 min down):**
- headscale (VPN)
- grafana (monitoring dashboards)
- influxdb (time-series data)
- portainer (Docker management)
- uptime-kuma (HTTP monitoring)
- maester-reports (CSF compliance)
- jon-snow (orchestrator)
- tarly-backup (backup monitoring)
- directus + directus-db + directus-redis (CRM)

**Normal (report in daily digest only):**
- hodor-gateway, sam-research, qyburn-coder, searxng
- homarr, headplane, headscale-ui
- plane-* (all Plane containers)
- netbox-* (all NetBox containers)
- nocodb, bni-scheduler, inventree-*, wetty, term-dash
- rustdesk-hbbs, rustdesk-hbbr
- iventoy, agent-sites

### Service reachability
Lightweight HTTP check (curl, timeout 5s) on each internal URL:
- http://172.27.40.3:9443 (Portainer)
- http://172.27.40.3:3002 (Uptime Kuma)
- http://172.27.40.3:3000 (Gitea)
- http://172.27.40.3:3010 (Open WebUI)
- http://172.27.40.3:7575 (Homarr)
- http://172.27.40.3:8300 (Citadel MCP)
- http://172.27.40.3:8400 (Raven)
- http://172.27.40.3:8800 (Maester)
- http://172.27.40.3:8900 (Jon Snow)
- http://172.27.40.3:3020 (Grafana)
- http://172.27.40.3:8100 (NetBox)
- http://172.27.40.3:8850 (Directus)
- http://172.27.40.20:11434 (Ollama on NxM-AI)

### Agent watchdog
For each agent log at `../../logs/<agent>/last-run.json`:
- Check modification time — flag if older than expected schedule
- Check `status` field — flag if not "success"
- Expected agents and max staleness:
  - bran-changelog: 25 hours (daily)
  - varys-monitor: 20 minutes (every 15 min)
  - trmm-frappe-sync: 35 minutes (every 30 min)
  - tarly-backup: 25 hours (daily)
  - raven-notify: 25 hours (event-driven, check status only)
  - citadel-mcp, sam-research, qyburn-coder, jon-snow: check status only (on-demand)

### System resources (on 172.27.40.3)
- Disk usage on / — warn if >80%, critical if >90%
- Memory usage — flag if >85%
- Docker disk usage (`docker system df`) — warn if reclaimable > 10GB

### Remote hosts (optional, best-effort)
- Ping 172.27.40.20 (Ollama host)
- Ping 172.27.40.30 (Hermes Native VM)
- Ping 172.27.40.2 (Proxmox)

## Output

Write a digest to `last-output.md` in this format:
- Summary line: X healthy, Y warnings, Z critical
- Section per category: Docker, Services, Agent Watchdog, System, Remote Hosts
- Each item: ✓ OK / ⚠ Warning / ✗ Critical + one line detail

Also write machine-readable output to `../../logs/infra-monitor/last-run.json`.

Pass anomalies to `context/handoff.md` for Raven notification.

## Wrap-up

After writing output:
1. Update `learnings.md` with anything that went wrong or could be improved
2. Append a one-line log entry to `../../logs/infra-monitor.log`: `YYYY-MM-DD HH:MM | status | summary`
3. Update `../../memory/notes-from-last-run.md`

## Schedule

- **Heartbeat:** every hour — checks Docker + Ollama + critical services only (fast, <30s)
- **Full digest:** daily at 07:00 — all checks including remote hosts and disk usage