# Skill: infra-monitor Monitors server health and watches all Agent OS skills for staleness or errors. Runs on a cron schedule on 172.27.40.3. ## Inputs Reads before executing: - `../../identity.md` - `../../brain.md` - `../../memory/persistent.md` - `learnings.md` (this skill's improvement notes) ## What to check ### Docker health (on 172.27.40.3) - All expected containers are running (not exited/restarting) - Flag any container that has restarted more than 3 times in the last hour - Expected containers: portainer, nginx-proxy-manager, uptime-kuma, gitea, headscale, netbird, vaultwarden, flowise, plane, zabbix, homarr ### Service reachability Lightweight HTTP check (curl, timeout 5s) on each internal URL: - http://172.27.40.3:9443 (Portainer) - http://172.27.40.3:3002 (Uptime Kuma) - http://172.27.40.3:3000 (Gitea) - http://172.27.40.3:3010 (Flowise) - http://172.27.40.3:7575 (Homarr) - http://172.27.6.139:11434 (Ollama) ### Agent watchdog For each skill directory under `../../skills/`: - Check `last-output.md` modification time — flag if older than expected schedule - Check `../../logs//` for ERROR entries in last run - Report: healthy / stale / erroring ### System resources (on 172.27.40.3) - Disk usage on / — warn if >80%, critical if >90% - Memory usage — flag if >85% ## Output Write a digest to `last-output.md` in this format: - Summary line: X healthy, Y warnings, Z critical - Section per category: Docker, Services, Agent Watchdog, System - Each item: ✓ OK / ⚠ Warning / ✗ Critical + one line detail Pass anomalies to `context/handoff.md` for notification skill (future). ## Wrap-up After writing output: 1. Update `learnings.md` with anything that went wrong or could be improved 2. Append a one-line log entry to `../../logs/infra-monitor.log`: `YYYY-MM-DD HH:MM | status | summary` 3. Update `../../memory/notes-from-last-run.md` ## Schedule - **Heartbeat:** every hour — checks Docker + Ollama only (fast, <30s) - **Full digest:** daily at 07:00 — all checks