docs: comprehensive update — bring all Agent OS docs current for LLM onboarding

All files were 5-7 weeks stale. Updated brain.md (complete service/agent/VPN/cron
inventory), identity.md (current expertise + infra context), CLAUDE.md (full agent
ecosystem table, Citadel tool registry, gotchas), README.md (LLM quick-start guide),
all memory files (current projects, decisions, constraints, persistent facts), and
infra-monitor skill.md (current container list with criticality tiers).

Also fixed: git remote switched from HTTP+embedded-token to SSH, removed references
to decommissioned services (Netbird, WireGuard, Flowise, Zabbix), corrected Ollama
IP (172.27.40.20), TrueNAS IP (172.27.40.220), and added 20+ services/agents that
were built since the last commit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Claude Code
2026-06-19 17:15:11 +00:00
parent 638b2edd56
commit 6cebab9a4a
9 changed files with 427 additions and 128 deletions
+18 -14
View File
@@ -1,7 +1,7 @@
# Active Projects
Current in-flight work. Update at the end of each session.
Last updated: 2026-05-16
Last updated: 2026-06-19
---
@@ -12,8 +12,8 @@ Phases 1 (NFS + mount) and 2 (identity interview) are complete.
**Phase 3 goal:** Docker container state monitoring + system resources. Complements Varys (HTTP reachability) — do not duplicate.
Pre-work before implementing:
- [ ] Update `skills/infra-monitor/skill.md` — container list is stale (has Flowise, missing Open WebUI + all new agents: citadel, varys, bran, sam, raven, qyburn, hodor, searxng, monitoring, bni-scheduler, nocodb)
- [ ] Correct Ollama URL in skill.md: now `http://172.27.40.20:11434` (moved from 172.27.6.139)
- [ ] Update `skills/infra-monitor/skill.md` — container list is stale (references Flowise/Netbird, missing 20+ current services)
- [ ] Correct Ollama URL in skill.md: now `http://172.27.40.20:11434` (moved from 172.27.6.139 → 172.27.40.20)
- [ ] Decide implementation: Docker one-shot container (consistent with bran/varys pattern) vs host cron + shell script
Implementation tasks:
@@ -26,23 +26,27 @@ Implementation tasks:
- [ ] Hourly heartbeat cron on 172.27.40.3
- [ ] Daily 07:00 full digest cron
- [ ] Notification channel: Raven (confirmed live at http://raven-notify:8400)
- [ ] Home Assistant integration (172.27.10.6) — optional, revisit after Phase 3
## Agent OS — Phase 5: Future Skills (Future)
- backup-monitor: TrueNAS migrated to new hardware (172.27.40.220) — skill ready to build
- Netbird/Headscale peer health: Netbird API at http://172.22.0.11:80/api/
- backup-monitor: extend Tarly with deeper TrueNAS integration
- Daily log digest: summarise /opt/agent-os/logs/ via Ollama
---
## Gitea Documentation Repos
- [x] nxm-infrastructure repo — Obsidian vault imported, CLAUDE.md added 2026-05-16
- [x] nexum-projects repo — Obsidian vault imported (on Kubuntu)
- [x] agent-os repo — scaffolding created, CLAUDE.md is global symlink
## Active Infrastructure Projects
| Project | Status | Next Step |
|---|---|---|
| **Monitoring** | bezhuis+mwp+coetzee alerts live | CPU/mem/WAN/ping Grafana rules pending |
| **OpenVPN S2S** | bezhuis/mwp/coetzee DONE | fwlaw pending |
| **Tarly Backup** | Hub working | bezhuis/mwp/coetzee API key fix (backup privilege) |
| **Directus CRM** | LIVE, 12 clients seeded | Manual data enrichment (contacts, renewals) |
| **InvenTree** | LIVE (testing) | SSL cert, production use |
| **Mailcow** | MAIL-1+2 done | Blocked on Mimecast (MAIL-3→9) |
| **Vexis** | nexum-private-customer-setup + office-install done | ESET/Evolve creds or standard-setup next |
| **Maester Phase 2** | Phase 1 live | Hermes narrative + .docx generation |
---
## Pending: Gitea SSH Key (security debt)
Server remote uses HTTP with embedded token. Before rotating:
1. Add SSH key for `nxm@172.27.40.3` to Gitea (Admin → Settings → SSH Keys)
2. `cd /opt/agent-os && git remote set-url origin gitea-local:admin/agent-os.git`
## Gitea SSH Key — DONE
Server remote switched from HTTP+token to SSH (`gitea-local:admin/agent-os.git`) on 2026-06-19.
+33 -5
View File
@@ -1,13 +1,41 @@
# Constraints
Hard limits agents must respect. Never work around these without explicit user confirmation.
Last updated: 2026-04-30
Last updated: 2026-06-19
---
- Never take destructive or irreversible action without explicit confirmation (delete, overwrite, drop, reset, force push)
- Never store credentials in output files, logs, or generated markdown — reference their location instead
- Never skip git hooks or bypass signing
- TrueNAS is on new hardware — use 172.27.40.220 (Servers40) for services, 172.27.6.221 for management/API
## Destructive actions
- Never delete or overwrite files without explicit confirmation
- Never restart or stop services without explicit confirmation
- Never drop, reset, or modify databases without explicit confirmation
- Never force push to git or bypass hooks
- Never run `pfctl` commands on OPNsense (risk of locking out remote access)
## Credentials
- All credentials live in `~/.nxm-keys` (chmod 600) — ONLY location
- Never store credentials in output files, logs, generated markdown, .env files, or code
- Reference the file location, never the values
- TrueNAS IPs: 172.27.40.220 (Servers40 data) / 172.27.6.221 (management/API)
## Infrastructure
- Linux server (172.27.40.3) has no GPU — never schedule LLM inference to run locally there
- Ollama runs on 172.27.40.20 (Windows 11 Pro) — not on the Docker host
- Docker Compose only — no Kubernetes, no Swarm
- Docker proxy network (172.22.0.0/16) cannot reach OPNsense API at 172.27.6.1 — always run OPNsense API scripts from the host
- NPM handles SSL termination — internal services always use HTTP
## Agent-specific
- **maester-reports:** restart clears in-memory cache → re-parses all evidence PDFs via Claude Opus vision (Anthropic API cost). Avoid unnecessary restarts.
- **NocoDB:** RvDM personal birthday DB ONLY — never suggest for any Nexum project. Nexum data layer = Directus.
- **Open WebUI → Citadel MCP:** auth_type must be `none`. Empty bearer key generates illegal header → silent connection failure.
- **Qyburn task specs:** never embed code in the description field — use plain English only (14b models explain code instead of writing it)
## External communication
- Never send any external message (email, webhook, Discord notification) without explicit confirmation
- Notifications always route through Raven (http://raven-notify:8400)
- Never expose services publicly without confirming NPM + Cloudflare + firewall implications
## Naming
- S2S = always suggest Site-to-Site VPN (not Road Warrior) for permanent infrastructure endpoints
- Use `.50+` IP range for non-firewall infrastructure devices on S2S tunnels
+32 -6
View File
@@ -1,18 +1,44 @@
# Persistent Memory
Facts that don't expire. If you'd have to re-explain it to a new agent every time, it belongs here.
Last updated: 2026-04-30
Last updated: 2026-06-19
---
## Infrastructure decisions
- RustDesk is self-hosted on 172.27.40.3 — clients connect to local server not public relay
- Netbird signal+management both route through NPM on port 443 — exposedAddress in /opt/stacks/netbird/config.yaml must be https://netbird.nxm.co.za:443 (caddy-netbird on :8443 exists but is not used externally)
- NPM handles all SSL termination — internal services use HTTP, NPM adds HTTPS
- Headscale v0.28: all write operations require numeric user ID, not username
- Tailscale on Windows overrides DNS — disconnect before testing split DNS changes
- Servers running Tailscale must run `sudo tailscale set --accept-dns=false` before joining Netbird
- Docker Compose only — no Kubernetes, no Swarm
- Docker → OPNsense API: HTTP 400 from Docker proxy network — always run OPNsense API scripts from the host
- All internal subdomains: gray-cloud CNAME → opnsense.nxm.co.za in Cloudflare. Proxied = 523 error.
- OPNsense split DNS: all subdomains resolve to 172.27.40.3 internally via Unbound host overrides
## Decommissioned services (do not reference)
- **Netbird:** Fully removed from server 2026-05-28. Orphaned clients on mwp/coetzee/b0qxxx/fwlaw firewalls pending removal.
- **WireGuard (N2W):** Fully removed 2026-05-30. Replaced by OpenVPN S2S.
- **Flowise:** Replaced by Open WebUI 2026-05-01.
- **Zabbix:** No longer running (monitoring moved to Grafana + InfluxDB + Telegraf).
## Agent OS build state
- Phase 1-2 (file structure + NFS + identity interview): not yet started
- First skill to build: infra-monitor (Docker health + agent watchdog)
- Notifications target: Home Assistant at 172.27.10.6
- Phase 1-2 complete (file structure + identity interview)
- Phase 3 (infra-monitor skill): spec written but stale, not yet implemented
- Notifications target: Raven at http://raven-notify:8400 (Discord + Gmail)
- All agent logs write to `/opt/agent-os/logs/<agent>/last-run.json`
## Credential policy
- All API keys and passwords: `~/.nxm-keys` (chmod 600)
- Never write credential values into output, logs, docs, or config files
- Reference credential location instead
## VPN topology
- **Headscale** (self-hosted Tailscale): remote access for admin devices
- **OpenVPN S2S:** site-to-site for client firewalls (bezhuis/mwp/coetzee done, fwlaw pending)
- Hub tunnel IPs: bezhuis=172.16.17.2, mwp=172.16.17.3, coetzee=172.16.17.4
## Ollama
- Host: 172.27.40.20 (Windows 11 Pro, NxM-AI), Vulkan GPU
- Models: gemma4, llama3.1:8b, phi4
- Auto-starts via Scheduled Task (S4U + AtStartup)
- Used by: hodor-gateway, sam-research, qyburn-coder, Open WebUI
+19 -6
View File
@@ -1,11 +1,24 @@
# Recent Decisions
Decisions made in the last 30 days that affect current work. Archive when no longer relevant.
Last updated: 2026-04-30
Decisions made in the last 60 days that affect current work. Archive when no longer relevant.
Last updated: 2026-06-19
---
- **2026-04-30:** Chose Gitea (self-hosted git) over Obsidian for documentation — AI-writable, browser-accessible, version controlled
- **2026-04-30:** Agent OS files to live on 172.27.40.3 at /opt/agent-os/, accessed from Kubuntu via NFS
- **2026-04-29:** Chose Syncthing-free approach for Obsidian migration — NFS for Linux, SMB for Windows
- **2026-04-29:** infra-monitor will be first Agent OS skill — covers Docker health and agent watchdog in one skill
- **2026-06-19:** Agent OS git remote switched from HTTP+token to SSH (gitea-local:admin/agent-os.git) — security debt resolved
- **2026-06-19:** Comprehensive Agent OS documentation update — brain.md, identity.md, all memory files brought current for LLM onboarding
- **2026-06-18:** Coetzee OpenVPN S2S complete — monitoring-only + hub-side NAT for Active Backup to Synology DS423+
- **2026-06-18:** Tarly backup service live — OPNsense config backups to TrueNAS NFS, Proxmox monitoring
- **2026-06-17:** Directus CRM live — 6 collections, 12 clients seeded from TRMM, 5 Citadel MCP tools
- **2026-06-17:** MWP Netbird fully removed, WireGuard spoke cleaned
- **2026-06-12:** NxM-AI (Kubuntu) migrated to Windows 11 Pro — same IP 172.27.40.20, Ollama via Scheduled Task
- **2026-06-12:** Vexis office-install/uninstall scripts live-tested, windows-update scripts done
- **2026-06-11:** Workshop20 → Servers40 firewall rules (1677-1680) for TRMM + Vexis access
- **2026-06-10:** Frappe Helpdesk live — TRMM→HD sync, Citadel tools, Vexis wired
- **2026-06-10:** trmm_confirm_with_user proven working (incl. response-parsing bug fix)
- **2026-05-30:** WireGuard fully removed, replaced by OpenVPN S2S
- **2026-05-29:** Maester reports Phase 1 live — 8 automated CSF controls, Grafana dashboard
- **2026-05-28:** Netbird fully removed from server
- **2026-05-28:** ZenArmor → Grafana pipeline all 3 phases complete
- **2026-05-27:** Jon Snow Phase 3 complete — approval gate, Discord approve/reject
- **2026-04-30:** Agent OS architecture: plain markdown files at /opt/agent-os/, Gitea-tracked, cron-scheduled