Files
agent-os/skills/infra-monitor/skill.md
T
Claude Code 6cebab9a4a docs: comprehensive update — bring all Agent OS docs current for LLM onboarding
All files were 5-7 weeks stale. Updated brain.md (complete service/agent/VPN/cron
inventory), identity.md (current expertise + infra context), CLAUDE.md (full agent
ecosystem table, Citadel tool registry, gotchas), README.md (LLM quick-start guide),
all memory files (current projects, decisions, constraints, persistent facts), and
infra-monitor skill.md (current container list with criticality tiers).

Also fixed: git remote switched from HTTP+embedded-token to SSH, removed references
to decommissioned services (Netbird, WireGuard, Flowise, Zabbix), corrected Ollama
IP (172.27.40.20), TrueNAS IP (172.27.40.220), and added 20+ services/agents that
were built since the last commit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 17:15:45 +00:00

3.7 KiB

Skill: infra-monitor

Monitors server health and watches all Agent OS skills for staleness or errors. Runs on a cron schedule on 172.27.40.3.

Inputs

Reads before executing:

  • ../../identity.md
  • ../../brain.md
  • ../../memory/persistent.md
  • learnings.md (this skill's improvement notes)

What to check

Docker health (on 172.27.40.3)

  • All expected containers are running (not exited/restarting)
  • Flag any container that has restarted more than 3 times in the last hour
  • Expected containers (grouped by criticality):

Critical (alert immediately if down):

  • nginx-proxy-manager (reverse proxy — everything depends on this)
  • gitea (all code + docs)
  • citadel-mcp (central tool server)
  • raven-notify (notification hub)
  • open-webui (chat UI)
  • vaultwarden (password manager)

Important (alert after 15 min down):

  • headscale (VPN)
  • grafana (monitoring dashboards)
  • influxdb (time-series data)
  • portainer (Docker management)
  • uptime-kuma (HTTP monitoring)
  • maester-reports (CSF compliance)
  • jon-snow (orchestrator)
  • tarly-backup (backup monitoring)
  • directus + directus-db + directus-redis (CRM)

Normal (report in daily digest only):

  • hodor-gateway, sam-research, qyburn-coder, searxng
  • homarr, headplane, headscale-ui
  • plane-* (all Plane containers)
  • netbox-* (all NetBox containers)
  • nocodb, bni-scheduler, inventree-*, wetty, term-dash
  • rustdesk-hbbs, rustdesk-hbbr
  • iventoy, agent-sites

Service reachability

Lightweight HTTP check (curl, timeout 5s) on each internal URL:

Agent watchdog

For each agent log at ../../logs/<agent>/last-run.json:

  • Check modification time — flag if older than expected schedule
  • Check status field — flag if not "success"
  • Expected agents and max staleness:
    • bran-changelog: 25 hours (daily)
    • varys-monitor: 20 minutes (every 15 min)
    • trmm-frappe-sync: 35 minutes (every 30 min)
    • tarly-backup: 25 hours (daily)
    • raven-notify: 25 hours (event-driven, check status only)
    • citadel-mcp, sam-research, qyburn-coder, jon-snow: check status only (on-demand)

System resources (on 172.27.40.3)

  • Disk usage on / — warn if >80%, critical if >90%
  • Memory usage — flag if >85%
  • Docker disk usage (docker system df) — warn if reclaimable > 10GB

Remote hosts (optional, best-effort)

  • Ping 172.27.40.20 (Ollama host)
  • Ping 172.27.40.30 (Hermes Native VM)
  • Ping 172.27.40.2 (Proxmox)

Output

Write a digest to last-output.md in this format:

  • Summary line: X healthy, Y warnings, Z critical
  • Section per category: Docker, Services, Agent Watchdog, System, Remote Hosts
  • Each item: ✓ OK / ⚠ Warning / ✗ Critical + one line detail

Also write machine-readable output to ../../logs/infra-monitor/last-run.json.

Pass anomalies to context/handoff.md for Raven notification.

Wrap-up

After writing output:

  1. Update learnings.md with anything that went wrong or could be improved
  2. Append a one-line log entry to ../../logs/infra-monitor.log: YYYY-MM-DD HH:MM | status | summary
  3. Update ../../memory/notes-from-last-run.md

Schedule

  • Heartbeat: every hour — checks Docker + Ollama + critical services only (fast, <30s)
  • Full digest: daily at 07:00 — all checks including remote hosts and disk usage