Files

T

Claude Code 6cebab9a4a docs: comprehensive update — bring all Agent OS docs current for LLM onboarding

All files were 5-7 weeks stale. Updated brain.md (complete service/agent/VPN/cron
inventory), identity.md (current expertise + infra context), CLAUDE.md (full agent
ecosystem table, Citadel tool registry, gotchas), README.md (LLM quick-start guide),
all memory files (current projects, decisions, constraints, persistent facts), and
infra-monitor skill.md (current container list with criticality tiers).

Also fixed: git remote switched from HTTP+embedded-token to SSH, removed references
to decommissioned services (Netbird, WireGuard, Flowise, Zabbix), corrected Ollama
IP (172.27.40.20), TrueNAS IP (172.27.40.220), and added 20+ services/agents that
were built since the last commit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-06-19 17:15:45 +00:00

3.7 KiB

Raw Blame History

Skill: infra-monitor

Monitors server health and watches all Agent OS skills for staleness or errors. Runs on a cron schedule on 172.27.40.3.

Inputs

Reads before executing:

../../identity.md
../../brain.md
../../memory/persistent.md
learnings.md (this skill's improvement notes)

What to check

Docker health (on 172.27.40.3)

All expected containers are running (not exited/restarting)
Flag any container that has restarted more than 3 times in the last hour
Expected containers (grouped by criticality):

Critical (alert immediately if down):

nginx-proxy-manager (reverse proxy — everything depends on this)
gitea (all code + docs)
citadel-mcp (central tool server)
raven-notify (notification hub)
open-webui (chat UI)
vaultwarden (password manager)

Important (alert after 15 min down):

headscale (VPN)
grafana (monitoring dashboards)
influxdb (time-series data)
portainer (Docker management)
uptime-kuma (HTTP monitoring)
maester-reports (CSF compliance)
jon-snow (orchestrator)
tarly-backup (backup monitoring)
directus + directus-db + directus-redis (CRM)

Normal (report in daily digest only):

hodor-gateway, sam-research, qyburn-coder, searxng
homarr, headplane, headscale-ui
plane-* (all Plane containers)
netbox-* (all NetBox containers)
nocodb, bni-scheduler, inventree-*, wetty, term-dash
rustdesk-hbbs, rustdesk-hbbr
iventoy, agent-sites

Service reachability

Lightweight HTTP check (curl, timeout 5s) on each internal URL:

http://172.27.40.3:9443 (Portainer)
http://172.27.40.3:3002 (Uptime Kuma)
http://172.27.40.3:3000 (Gitea)
http://172.27.40.3:3010 (Open WebUI)
http://172.27.40.3:7575 (Homarr)
http://172.27.40.3:8300 (Citadel MCP)
http://172.27.40.3:8400 (Raven)
http://172.27.40.3:8800 (Maester)
http://172.27.40.3:8900 (Jon Snow)
http://172.27.40.3:3020 (Grafana)
http://172.27.40.3:8100 (NetBox)
http://172.27.40.3:8850 (Directus)
http://172.27.40.20:11434 (Ollama on NxM-AI)

Agent watchdog

For each agent log at ../../logs/<agent>/last-run.json:

Check modification time — flag if older than expected schedule
Check status field — flag if not "success"
Expected agents and max staleness:
- bran-changelog: 25 hours (daily)
- varys-monitor: 20 minutes (every 15 min)
- trmm-frappe-sync: 35 minutes (every 30 min)
- tarly-backup: 25 hours (daily)
- raven-notify: 25 hours (event-driven, check status only)
- citadel-mcp, sam-research, qyburn-coder, jon-snow: check status only (on-demand)

System resources (on 172.27.40.3)

Disk usage on / — warn if >80%, critical if >90%
Memory usage — flag if >85%
Docker disk usage (docker system df) — warn if reclaimable > 10GB

Remote hosts (optional, best-effort)

Ping 172.27.40.20 (Ollama host)
Ping 172.27.40.30 (Hermes Native VM)
Ping 172.27.40.2 (Proxmox)

Output

Write a digest to last-output.md in this format:

Summary line: X healthy, Y warnings, Z critical
Section per category: Docker, Services, Agent Watchdog, System, Remote Hosts
Each item: ✓ OK / ⚠ Warning / ✗ Critical + one line detail

Also write machine-readable output to ../../logs/infra-monitor/last-run.json.

Pass anomalies to context/handoff.md for Raven notification.

Wrap-up

After writing output:

Update learnings.md with anything that went wrong or could be improved
Append a one-line log entry to ../../logs/infra-monitor.log: YYYY-MM-DD HH:MM | status | summary
Update ../../memory/notes-from-last-run.md

Schedule

Heartbeat: every hour — checks Docker + Ollama + critical services only (fast, <30s)
Full digest: daily at 07:00 — all checks including remote hosts and disk usage

3.7 KiB Raw Blame History