Files
agent-os/skills/infra-monitor/skill.md
T

2.0 KiB

Skill: infra-monitor

Monitors server health and watches all Agent OS skills for staleness or errors. Runs on a cron schedule on 172.27.40.3.

Inputs

Reads before executing:

  • ../../identity.md
  • ../../brain.md
  • ../../memory/persistent.md
  • learnings.md (this skill's improvement notes)

What to check

Docker health (on 172.27.40.3)

  • All expected containers are running (not exited/restarting)
  • Flag any container that has restarted more than 3 times in the last hour
  • Expected containers: portainer, nginx-proxy-manager, uptime-kuma, gitea, headscale, netbird, vaultwarden, flowise, plane, zabbix, homarr

Service reachability

Lightweight HTTP check (curl, timeout 5s) on each internal URL:

Agent watchdog

For each skill directory under ../../skills/:

  • Check last-output.md modification time — flag if older than expected schedule
  • Check ../../logs/<skill-name>/ for ERROR entries in last run
  • Report: healthy / stale / erroring

System resources (on 172.27.40.3)

  • Disk usage on / — warn if >80%, critical if >90%
  • Memory usage — flag if >85%

Output

Write a digest to last-output.md in this format:

  • Summary line: X healthy, Y warnings, Z critical
  • Section per category: Docker, Services, Agent Watchdog, System
  • Each item: ✓ OK / ⚠ Warning / ✗ Critical + one line detail

Pass anomalies to context/handoff.md for notification skill (future).

Wrap-up

After writing output:

  1. Update learnings.md with anything that went wrong or could be improved
  2. Append a one-line log entry to ../../logs/infra-monitor.log: YYYY-MM-DD HH:MM | status | summary
  3. Update ../../memory/notes-from-last-run.md

Schedule

  • Heartbeat: every hour — checks Docker + Ollama only (fast, <30s)
  • Full digest: daily at 07:00 — all checks