2.0 KiB
2.0 KiB
Skill: infra-monitor
Monitors server health and watches all Agent OS skills for staleness or errors. Runs on a cron schedule on 172.27.40.3.
Inputs
Reads before executing:
../../identity.md../../brain.md../../memory/persistent.mdlearnings.md(this skill's improvement notes)
What to check
Docker health (on 172.27.40.3)
- All expected containers are running (not exited/restarting)
- Flag any container that has restarted more than 3 times in the last hour
- Expected containers: portainer, nginx-proxy-manager, uptime-kuma, gitea, headscale, netbird, vaultwarden, flowise, plane, zabbix, homarr
Service reachability
Lightweight HTTP check (curl, timeout 5s) on each internal URL:
- http://172.27.40.3:9443 (Portainer)
- http://172.27.40.3:3002 (Uptime Kuma)
- http://172.27.40.3:3000 (Gitea)
- http://172.27.40.3:3010 (Flowise)
- http://172.27.40.3:7575 (Homarr)
- http://172.27.6.139:11434 (Ollama)
Agent watchdog
For each skill directory under ../../skills/:
- Check
last-output.mdmodification time — flag if older than expected schedule - Check
../../logs/<skill-name>/for ERROR entries in last run - Report: healthy / stale / erroring
System resources (on 172.27.40.3)
- Disk usage on / — warn if >80%, critical if >90%
- Memory usage — flag if >85%
Output
Write a digest to last-output.md in this format:
- Summary line: X healthy, Y warnings, Z critical
- Section per category: Docker, Services, Agent Watchdog, System
- Each item: ✓ OK / ⚠ Warning / ✗ Critical + one line detail
Pass anomalies to context/handoff.md for notification skill (future).
Wrap-up
After writing output:
- Update
learnings.mdwith anything that went wrong or could be improved - Append a one-line log entry to
../../logs/infra-monitor.log:YYYY-MM-DD HH:MM | status | summary - Update
../../memory/notes-from-last-run.md
Schedule
- Heartbeat: every hour — checks Docker + Ollama only (fast, <30s)
- Full digest: daily at 07:00 — all checks