docs: comprehensive update — bring all Agent OS docs current for LLM onboarding
All files were 5-7 weeks stale. Updated brain.md (complete service/agent/VPN/cron inventory), identity.md (current expertise + infra context), CLAUDE.md (full agent ecosystem table, Citadel tool registry, gotchas), README.md (LLM quick-start guide), all memory files (current projects, decisions, constraints, persistent facts), and infra-monitor skill.md (current container list with criticality tiers). Also fixed: git remote switched from HTTP+embedded-token to SSH, removed references to decommissioned services (Netbird, WireGuard, Flowise, Zabbix), corrected Ollama IP (172.27.40.20), TrueNAS IP (172.27.40.220), and added 20+ services/agents that were built since the last commit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -15,35 +15,84 @@ Reads before executing:
|
||||
### Docker health (on 172.27.40.3)
|
||||
- All expected containers are running (not exited/restarting)
|
||||
- Flag any container that has restarted more than 3 times in the last hour
|
||||
- Expected containers: portainer, nginx-proxy-manager, uptime-kuma, gitea, headscale, netbird, vaultwarden, flowise, plane, zabbix, homarr
|
||||
- Expected containers (grouped by criticality):
|
||||
|
||||
**Critical (alert immediately if down):**
|
||||
- nginx-proxy-manager (reverse proxy — everything depends on this)
|
||||
- gitea (all code + docs)
|
||||
- citadel-mcp (central tool server)
|
||||
- raven-notify (notification hub)
|
||||
- open-webui (chat UI)
|
||||
- vaultwarden (password manager)
|
||||
|
||||
**Important (alert after 15 min down):**
|
||||
- headscale (VPN)
|
||||
- grafana (monitoring dashboards)
|
||||
- influxdb (time-series data)
|
||||
- portainer (Docker management)
|
||||
- uptime-kuma (HTTP monitoring)
|
||||
- maester-reports (CSF compliance)
|
||||
- jon-snow (orchestrator)
|
||||
- tarly-backup (backup monitoring)
|
||||
- directus + directus-db + directus-redis (CRM)
|
||||
|
||||
**Normal (report in daily digest only):**
|
||||
- hodor-gateway, sam-research, qyburn-coder, searxng
|
||||
- homarr, headplane, headscale-ui
|
||||
- plane-* (all Plane containers)
|
||||
- netbox-* (all NetBox containers)
|
||||
- nocodb, bni-scheduler, inventree-*, wetty, term-dash
|
||||
- rustdesk-hbbs, rustdesk-hbbr
|
||||
- iventoy, agent-sites
|
||||
|
||||
### Service reachability
|
||||
Lightweight HTTP check (curl, timeout 5s) on each internal URL:
|
||||
- http://172.27.40.3:9443 (Portainer)
|
||||
- http://172.27.40.3:3002 (Uptime Kuma)
|
||||
- http://172.27.40.3:3000 (Gitea)
|
||||
- http://172.27.40.3:3010 (Flowise)
|
||||
- http://172.27.40.3:3010 (Open WebUI)
|
||||
- http://172.27.40.3:7575 (Homarr)
|
||||
- http://172.27.6.139:11434 (Ollama)
|
||||
- http://172.27.40.3:8300 (Citadel MCP)
|
||||
- http://172.27.40.3:8400 (Raven)
|
||||
- http://172.27.40.3:8800 (Maester)
|
||||
- http://172.27.40.3:8900 (Jon Snow)
|
||||
- http://172.27.40.3:3020 (Grafana)
|
||||
- http://172.27.40.3:8100 (NetBox)
|
||||
- http://172.27.40.3:8850 (Directus)
|
||||
- http://172.27.40.20:11434 (Ollama on NxM-AI)
|
||||
|
||||
### Agent watchdog
|
||||
For each skill directory under `../../skills/`:
|
||||
- Check `last-output.md` modification time — flag if older than expected schedule
|
||||
- Check `../../logs/<skill-name>/` for ERROR entries in last run
|
||||
- Report: healthy / stale / erroring
|
||||
For each agent log at `../../logs/<agent>/last-run.json`:
|
||||
- Check modification time — flag if older than expected schedule
|
||||
- Check `status` field — flag if not "success"
|
||||
- Expected agents and max staleness:
|
||||
- bran-changelog: 25 hours (daily)
|
||||
- varys-monitor: 20 minutes (every 15 min)
|
||||
- trmm-frappe-sync: 35 minutes (every 30 min)
|
||||
- tarly-backup: 25 hours (daily)
|
||||
- raven-notify: 25 hours (event-driven, check status only)
|
||||
- citadel-mcp, sam-research, qyburn-coder, jon-snow: check status only (on-demand)
|
||||
|
||||
### System resources (on 172.27.40.3)
|
||||
- Disk usage on / — warn if >80%, critical if >90%
|
||||
- Memory usage — flag if >85%
|
||||
- Docker disk usage (`docker system df`) — warn if reclaimable > 10GB
|
||||
|
||||
### Remote hosts (optional, best-effort)
|
||||
- Ping 172.27.40.20 (Ollama host)
|
||||
- Ping 172.27.40.30 (Hermes Native VM)
|
||||
- Ping 172.27.40.2 (Proxmox)
|
||||
|
||||
## Output
|
||||
|
||||
Write a digest to `last-output.md` in this format:
|
||||
- Summary line: X healthy, Y warnings, Z critical
|
||||
- Section per category: Docker, Services, Agent Watchdog, System
|
||||
- Section per category: Docker, Services, Agent Watchdog, System, Remote Hosts
|
||||
- Each item: ✓ OK / ⚠ Warning / ✗ Critical + one line detail
|
||||
|
||||
Pass anomalies to `context/handoff.md` for notification skill (future).
|
||||
Also write machine-readable output to `../../logs/infra-monitor/last-run.json`.
|
||||
|
||||
Pass anomalies to `context/handoff.md` for Raven notification.
|
||||
|
||||
## Wrap-up
|
||||
|
||||
@@ -54,5 +103,5 @@ After writing output:
|
||||
|
||||
## Schedule
|
||||
|
||||
- **Heartbeat:** every hour — checks Docker + Ollama only (fast, <30s)
|
||||
- **Full digest:** daily at 07:00 — all checks
|
||||
- **Heartbeat:** every hour — checks Docker + Ollama + critical services only (fast, <30s)
|
||||
- **Full digest:** daily at 07:00 — all checks including remote hosts and disk usage
|
||||
|
||||
Reference in New Issue
Block a user