docs: comprehensive update — bring all Agent OS docs current for LLM onboarding

All files were 5-7 weeks stale. Updated brain.md (complete service/agent/VPN/cron inventory), identity.md (current expertise + infra context), CLAUDE.md (full agent ecosystem table, Citadel tool registry, gotchas), README.md (LLM quick-start guide), all memory files (current projects, decisions, constraints, persistent facts), and infra-monitor skill.md (current container list with criticality tiers). Also fixed: git remote switched from HTTP+embedded-token to SSH, removed references to decommissioned services (Netbird, WireGuard, Flowise, Zabbix), corrected Ollama IP (172.27.40.20), TrueNAS IP (172.27.40.220), and added 20+ services/agents that were built since the last commit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 17:15:11 +00:00
parent 638b2edd56
commit 6cebab9a4a
9 changed files with 427 additions and 128 deletions
@@ -15,35 +15,84 @@ Reads before executing:
 ### Docker health (on 172.27.40.3)
 - All expected containers are running (not exited/restarting)
 - Flag any container that has restarted more than 3 times in the last hour
- Expected containers: portainer, nginx-proxy-manager, uptime-kuma, gitea, headscale, netbird, vaultwarden, flowise, plane, zabbix, homarr
+- Expected containers (grouped by criticality):
+
+**Critical (alert immediately if down):**
+- nginx-proxy-manager (reverse proxy — everything depends on this)
+- gitea (all code + docs)
+- citadel-mcp (central tool server)
+- raven-notify (notification hub)
+- open-webui (chat UI)
+- vaultwarden (password manager)
+
+**Important (alert after 15 min down):**
+- headscale (VPN)
+- grafana (monitoring dashboards)
+- influxdb (time-series data)
+- portainer (Docker management)
+- uptime-kuma (HTTP monitoring)
+- maester-reports (CSF compliance)
+- jon-snow (orchestrator)
+- tarly-backup (backup monitoring)
+- directus + directus-db + directus-redis (CRM)
+
+**Normal (report in daily digest only):**
+- hodor-gateway, sam-research, qyburn-coder, searxng
+- homarr, headplane, headscale-ui
+- plane-* (all Plane containers)
+- netbox-* (all NetBox containers)
+- nocodb, bni-scheduler, inventree-*, wetty, term-dash
+- rustdesk-hbbs, rustdesk-hbbr
+- iventoy, agent-sites

 ### Service reachability
 Lightweight HTTP check (curl, timeout 5s) on each internal URL:
 - http://172.27.40.3:9443 (Portainer)
 - http://172.27.40.3:3002 (Uptime Kuma)
 - http://172.27.40.3:3000 (Gitea)
- http://172.27.40.3:3010 (Flowise)
+- http://172.27.40.3:3010 (Open WebUI)
 - http://172.27.40.3:7575 (Homarr)
- http://172.27.6.139:11434 (Ollama)
+- http://172.27.40.3:8300 (Citadel MCP)
+- http://172.27.40.3:8400 (Raven)
+- http://172.27.40.3:8800 (Maester)
+- http://172.27.40.3:8900 (Jon Snow)
+- http://172.27.40.3:3020 (Grafana)
+- http://172.27.40.3:8100 (NetBox)
+- http://172.27.40.3:8850 (Directus)
+- http://172.27.40.20:11434 (Ollama on NxM-AI)

 ### Agent watchdog
-For each skill directory under `../../skills/`:
- Check `last-output.md` modification time — flag if older than expected schedule
- Check `../../logs/<skill-name>/` for ERROR entries in last run
- Report: healthy / stale / erroring
+For each agent log at `../../logs/<agent>/last-run.json`:
+- Check modification time — flag if older than expected schedule
+- Check `status` field — flag if not "success"
+- Expected agents and max staleness:
+  - bran-changelog: 25 hours (daily)
+  - varys-monitor: 20 minutes (every 15 min)
+  - trmm-frappe-sync: 35 minutes (every 30 min)
+  - tarly-backup: 25 hours (daily)
+  - raven-notify: 25 hours (event-driven, check status only)
+  - citadel-mcp, sam-research, qyburn-coder, jon-snow: check status only (on-demand)

 ### System resources (on 172.27.40.3)
 - Disk usage on / — warn if >80%, critical if >90%
 - Memory usage — flag if >85%
+- Docker disk usage (`docker system df`) — warn if reclaimable > 10GB
+
+### Remote hosts (optional, best-effort)
+- Ping 172.27.40.20 (Ollama host)
+- Ping 172.27.40.30 (Hermes Native VM)
+- Ping 172.27.40.2 (Proxmox)

 ## Output

 Write a digest to `last-output.md` in this format:
 - Summary line: X healthy, Y warnings, Z critical
- Section per category: Docker, Services, Agent Watchdog, System
+- Section per category: Docker, Services, Agent Watchdog, System, Remote Hosts
 - Each item: ✓ OK / ⚠ Warning / ✗ Critical + one line detail

-Pass anomalies to `context/handoff.md` for notification skill (future).
+Also write machine-readable output to `../../logs/infra-monitor/last-run.json`.
+
+Pass anomalies to `context/handoff.md` for Raven notification.

 ## Wrap-up

@@ -54,5 +103,5 @@ After writing output:

 ## Schedule

- **Heartbeat:** every hour — checks Docker + Ollama only (fast, <30s)
- **Full digest:** daily at 07:00 — all checks
+- **Heartbeat:** every hour — checks Docker + Ollama + critical services only (fast, <30s)
+- **Full digest:** daily at 07:00 — all checks including remote hosts and disk usage