docs: comprehensive update — bring all Agent OS docs current for LLM onboarding

All files were 5-7 weeks stale. Updated brain.md (complete service/agent/VPN/cron
inventory), identity.md (current expertise + infra context), CLAUDE.md (full agent
ecosystem table, Citadel tool registry, gotchas), README.md (LLM quick-start guide),
all memory files (current projects, decisions, constraints, persistent facts), and
infra-monitor skill.md (current container list with criticality tiers).

Also fixed: git remote switched from HTTP+embedded-token to SSH, removed references
to decommissioned services (Netbird, WireGuard, Flowise, Zabbix), corrected Ollama
IP (172.27.40.20), TrueNAS IP (172.27.40.220), and added 20+ services/agents that
were built since the last commit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Claude Code
2026-06-19 17:15:11 +00:00
parent 638b2edd56
commit 6cebab9a4a
9 changed files with 427 additions and 128 deletions
+110 -33
View File
@@ -1,28 +1,84 @@
# Agent OS — Project CLAUDE.md
## What This Project Is
Personal Agentic Operating System. Tool-agnostic AI foundation for scheduled skills, monitoring, and automation.
- Runtime: `/opt/agent-os/` on 172.27.40.3
- Gitea: `git.nxm.co.za/admin/agent-os`
- Edit clone (server): `/home/nxm/Documents/agent-os/` (clone pending)
Personal Agentic Operating System for NxM / Nexum SA infrastructure. Tool-agnostic AI foundation for scheduled skills, monitoring, and automation. Plain markdown files — no databases, no vendor lock-in.
- **Runtime:** `/opt/agent-os/` on 172.27.40.3
- **Gitea:** `git.nxm.co.za/admin/agent-os` (SSH: `gitea-local:admin/agent-os.git`)
- **Owner:** Jaco Bezuidenhout, Nexum SA (PTY) Ltd
## Current Phase
| Phase | Status |
|---|---|
| 1 — NFS export + Kubuntu mount | DONE 2026-05-01 (NFS no longer needed — consolidated to server) |
| 2 — Identity interview → identity.md populated | ✓ DONE 2026-05-01 |
| **3 — infra-monitor skill** | **NEXT** |
| 1 — NFS export + mount | DONE 2026-05-01 (NFS no longer needed — consolidated to server) |
| 2 — Identity interview → identity.md | DONE 2026-05-01 |
| 3 — infra-monitor skill | NEXT (spec at `skills/infra-monitor/skill.md`, needs update) |
| 4 — Cron scheduling (hourly heartbeat + daily digest) | Pending Phase 3 |
| 5 — Future skills (backup monitor, peer health, log digest) | Future |
| 5 — Future skills (backup monitor, log digest) | Future |
## Live Agent Ecosystem (as of 2026-06-19)
All agents run as Docker containers on 172.27.40.3 unless noted. Every agent writes to `/opt/agent-os/logs/<agent>/last-run.json`.
### Always-on agents
| Agent | Port | Stack Path | Role |
|---|---|---|---|
| citadel-mcp | 8300 | `/opt/stacks/citadel-mcp/` | MCP tool server (37 tools: Docker, Plane, TRMM, Directus, files, web search) |
| raven-notify | 8400 | `/opt/stacks/raven-notify/` | Notification hub — Discord webhook + Gmail SMTP |
| sam-research | 8500 | `/opt/stacks/sam-research/` | SearXNG + Ollama research agent |
| qyburn-coder | 8700 | `/opt/stacks/qyburn-coder/` | LLM coding agent with approve/reject workflow |
| maester-reports | 8800 | `/opt/stacks/maester-reports/` | NIST CSF compliance reports (⚠ restart = Anthropic API cost) |
| jon-snow | 8900 | `/opt/stacks/jon-snow/` | Chief of staff orchestrator, HMAC approval gate |
| hodor-gateway | 8200 | `/opt/stacks/hodor-gateway/` | Simple Ollama gateway (POST /ask) |
| tarly-backup | 8750 | `/opt/stacks/tarly-backup/` | Backup monitoring — OPNsense configs + Proxmox |
| hermes-cloud | 8643 | `/opt/stacks/hermes-cloud/` | Claude-sonnet brain, Citadel MCP wired |
| hermes-native | VM 108 | native on 172.27.40.30 | Primary conversational agent — OpenRouter, Honcho memory, WhatsApp, dashboard at hermes.nxm.co.za:9119 |
### Scheduled/one-shot agents
| Agent | Schedule | Stack Path | Role |
|---|---|---|---|
| bran-changelog | Daily 06:00 | `/opt/stacks/bran-changelog/` | Git changelog generator |
| varys-monitor | Every 15 min | `/opt/stacks/varys-monitor/` | HTTP reachability checks for all services |
### Support agents (via Hermes Native, VM 108)
| Agent | Role |
|---|---|
| vexis (workshop profile) | Nexum workshop agent — TRMM script execution on client devices |
### Integrations running as cron (not standalone agents)
| Job | Schedule | Script |
|---|---|---|
| ovpn-status.py | Every 1 min | `/opt/stacks/monitoring/ovpn-status.py` |
| trmm-frappe-sync.py | Every 30 min | `/opt/stacks/monitoring/trmm-frappe-sync.py` |
| zenarmor-pull.py | Daily 06:00 | `/opt/stacks/monitoring/zenarmor-pull.py` |
| hub-backup.sh | Daily 02:05 | `/opt/stacks/tarly-backup/hub-backup.sh` |
## Citadel MCP Tool Registry (37 tools)
The central tool server that other agents call via MCP protocol:
**File operations:** read_file, write_file, list_files, delete_file, propose_file_change
**Docker:** docker_list_containers, docker_container_stats, docker_stack_list, docker_rebuild
**Plane (project management):** plane_add_issue, plane_get_issues, plane_list_projects, plane_create_project, plane_create_page, plane_update_issue
**TRMM (remote management):** trmm_list_agents, trmm_get_agent, trmm_list_scripts, trmm_add_script, trmm_delete_script, trmm_run_script, trmm_confirm_with_user, trmm_sync_now
**Directus (CRM):** directus_list_clients, directus_get_client, directus_get_client_services, directus_get_renewals, directus_upcoming_renewals
**Other:** get_agent_status, get_agent_output, list_agents, qyburn_task, qyburn_status, qyburn_approve, sam_research, web_search, proxmox_backup_status
## Agent Web Pages
Static HTML dashboards served at `agents.nxm.co.za/<name>/` from `/opt/sites/`:
- agents-dashboard, bran, changelog, citadel, hermes, hermes-native, hodor, jon-snow, qyburn, raven, sam, security-review, setup, stock, swarm, tarly, varys, workflow-test
## Phase 3 — infra-monitor (NEXT)
Skill scaffold at `skills/infra-monitor/skill.md`. Ready to implement after spec update.
Skill scaffold at `skills/infra-monitor/skill.md`. **Spec is stale — needs update before building.**
**Goal:** Docker container state + system resource checks. Complements Varys (HTTP reachability) — do not duplicate.
**Before building:**
- Update `skills/infra-monitor/skill.md` — container list is stale (has Flowise, missing Open WebUI + all new agents)
- Correct Ollama URL: now `http://172.27.40.20:11434` (migrated from 172.27.6.139)
- Update `skills/infra-monitor/skill.md` — container list is stale (references Flowise/Netbird, missing 20+ services)
- Ollama URL is now `http://172.27.40.20:11434`
- Decide: Docker one-shot container (consistent with bran/varys) or host cron + shell script?
**Output targets:**
@@ -30,38 +86,59 @@ Skill scaffold at `skills/infra-monitor/skill.md`. Ready to implement after spec
- `/opt/agent-os/logs/infra-monitor/last-run.json` — machine-readable, read by Varys watchdog
- Raven alert on critical: `http://raven-notify:8400`
**Schedule:** hourly heartbeat (Docker + Ollama only) + daily 07:00 full digest
## Directory Structure
```
/opt/agent-os/
├── CLAUDE.md ← this file (project brief, tracked in Gitea)
├── identity.md ← populated Phase 2
├── brain.md
├── CLAUDE.md ← this file (project brief)
├── README.md ← onboarding for new LLMs
├── identity.md ← who the user is, hard limits
├── brain.md ← all infra facts, IPs, services, decisions
├── memory/
│ ├── active-projects.md ← update at end of each session
│ ├── persistent.md
│ ├── recent-decisions.md
│ ├── constraints.md
│ └── notes-from-last-run.md
├── context/
│ ├── active-projects.md ← what's in flight right now
│ ├── persistent.md ← facts that never expire
│ ├── recent-decisions.md ← decisions from last 30 days
│ ├── constraints.md ← hard limits agents must respect
│ └── notes-from-last-run.md ← cleared each session
├── claude-code/
│ └── memory/ ← Claude Code's persistent memory files (symlinked)
├── skills/
│ └── infra-monitor/ ← Phase 3 target
│ ├── skill.md ← spec (stale container list — update before building)
│ └── infra-monitor/ ← Phase 3 target (not yet built)
│ ├── skill.md
│ ├── learnings.md
│ ├── eval.json
│ ├── last-output.md
│ └── context/handoff.md
└── logs/
└── logs/ ← all agent log outputs
├── bran-changelog/
├── citadel-mcp/
├── jon-snow/
├── qyburn-coder/
├── raven-notify/
├── sam-research/
├── tarly-backup/
├── trmm-frappe-sync/
└── varys-monitor/
```
## Architecture
- LLM inference: Kubuntu Ollama at `http://172.27.40.20:11434`
- All agent output: `/opt/sites/<name>/` served at agents.nxm.co.za
- Log standard: `/opt/agent-os/logs/<skill>/last-run.json`
- Notifications: Raven at `http://raven-notify:8400`
## Pending — Gitea SSH Key (security debt)
Server remote uses HTTP with embedded token. Before next token rotation:
1. Add SSH key for `nxm@172.27.40.3` to Gitea (Admin → Settings → SSH Keys)
2. `cd /opt/agent-os && git remote set-url origin gitea-local:admin/agent-os.git`
- **LLM inference:** Ollama at `http://172.27.40.20:11434` (gemma4, llama3.1:8b, phi4) + Anthropic API (Claude Code, Hermes)
- **Agent output pages:** `/opt/sites/<name>/` served at agents.nxm.co.za
- **Log standard:** `/opt/agent-os/logs/<agent>/last-run.json`
- **Notifications:** Raven at `http://raven-notify:8400` (Discord + Gmail)
- **Task tracking:** Plane at plane.nxm.co.za
- **Client CRM:** Directus at directus.nxm.co.za
- **Client devices:** Tactical RMM at 172.27.40.4 (45 agents, 13 clients)
- **Helpdesk:** Frappe Helpdesk at helpdesk.nxm.co.za (VM 109)
- **Credentials:** `~/.nxm-keys` (chmod 600) — ONLY place credential values live
## Key Gotchas
- **maester-reports restart = Anthropic API cost** — cache is in-memory only
- **Open WebUI → Citadel MCP:** auth_type must be `none` (empty bearer key = silent failure)
- **Docker → OPNsense API:** Docker proxy network can't reach 172.27.6.1 (HTTP 400) — run from host
- **Headscale v0.28:** all write operations require numeric user ID, not username
- **Vaultwarden:** requires HTTPS — use vault.nxm.co.za, not LAN IP
- **Tailscale on Windows:** overrides DNS — disconnect when testing split DNS
- **NPM forward scheme:** HTTP even for HTTPS external — NPM handles SSL termination
- **NocoDB:** RvDM personal birthday DB only — never use for Nexum projects
+38 -6
View File
@@ -10,14 +10,41 @@ Every agent interaction reads from and writes back to files in this repo. No dat
| Layer | File(s) | Purpose |
|---|---|---|
| Identity | `identity.md` | Who you are, communication style, values |
| Identity | `identity.md` | Who the user is, communication style, values, hard limits |
| Context | `context/` | Dated, task-specific working files |
| Brain | `brain.md` | Persistent facts — infra, people, decisions |
| Brain | `brain.md` | Persistent facts — infra, services, IPs, standing decisions |
| Memory | `memory/` | Short and long-term session notes |
| Skills | `skills/` | Repeatable workflows, each self-improving |
| Processes | `skills/*/context/handoff.md` | Output passed between chained skills |
| Automation | cron on 172.27.40.3 | Scheduled skill execution |
## Quick start for a new LLM
If you are an LLM reading this repo for the first time:
1. **Read `identity.md`** — who you're working for, hard limits, communication style
2. **Read `brain.md`** — all infrastructure facts: IPs, services, ports, agents, standing decisions
3. **Read `memory/active-projects.md`** — what's currently in flight
4. **Read `memory/constraints.md`** — things you must never do
5. **Read `CLAUDE.md`** — project status and architecture details
Do NOT take any action without reading `identity.md` first. The hard limits there are non-negotiable.
## Live agent ecosystem
The NxM infrastructure runs 12+ named agents across Docker containers and VMs. Every agent writes logs to `/opt/agent-os/logs/<agent>/last-run.json` and most publish web dashboards to `agents.nxm.co.za/<agent>/`.
Key agents:
- **Citadel MCP** (port 8300) — central tool server, 37 tools covering Docker, Plane, TRMM, Directus, file ops, web search
- **Raven** (port 8400) — notification hub (Discord + Gmail), all alerts route through here
- **Jon Snow** (port 8900) — chief of staff orchestrator with approval gates
- **Maester** (port 8800) — NIST CSF compliance reporting
- **Hermes Native** (VM 108) — primary conversational agent with WhatsApp + Honcho memory
- **Tarly** (port 8750) — backup monitoring (OPNsense configs + Proxmox)
- **Vexis** (via Hermes, VM 108) — workshop/TRMM scripting agent for client devices
See `brain.md` for the complete agent table with ports and schedules.
## Adding a new skill
1. Create `skills/<skill-name>/skill.md` — what the skill does and how
@@ -28,10 +55,11 @@ Every agent interaction reads from and writes back to files in this repo. No dat
## Runtime
- Files live on server: `/opt/agent-os/` (cloned from this repo)
- LLM inference: Ollama at `http://172.27.6.139:11434`
- Scheduled jobs: cron on `172.27.40.3`
- Local editing: `/home/nxm/Documents/agent-os/` on Kubuntu (this machine)
- **Server:** `/opt/agent-os/` on 172.27.40.3 (Ubuntu, Docker host)
- **Repo:** `git.nxm.co.za/admin/agent-os` (SSH: `gitea-local:admin/agent-os.git`)
- **LLM inference:** Ollama at `http://172.27.40.20:11434` (local) or Anthropic API (Claude Code/Hermes)
- **Scheduled jobs:** cron on 172.27.40.3
- **Agent web pages:** `/opt/sites/<name>/` → agents.nxm.co.za
## Infra reference
@@ -39,3 +67,7 @@ Cross-repo links to supporting documentation:
- [IP & Port Map](https://git.nxm.co.za/admin/nxm-infrastructure/src/branch/main/Quick%20Reference/IP%20%26%20Port%20Map.md)
- [Docker Stacks](https://git.nxm.co.za/admin/nxm-infrastructure/src/branch/main/Quick%20Reference/Docker%20Stacks.md)
- [Network Overview](https://git.nxm.co.za/admin/nxm-infrastructure/src/branch/main/Infrastructure/Network%20Overview.md)
## Credential policy
All API keys and passwords live in `~/.nxm-keys` (chmod 600). Never write credential values into code, config files, logs, or documentation. Reference the file location instead.
+99 -36
View File
@@ -1,64 +1,127 @@
# Brain
Core facts read by all skills. Keep under 1000 words. Update when infrastructure changes.
Last updated: 2026-04-30
Core facts read by all skills. Keep under 1500 words. Update when infrastructure changes.
Last updated: 2026-06-19
---
## Infrastructure
**Primary server:** 172.27.40.3 — Ubuntu Server LTS, Docker host
**Kubuntu desktop:** 172.27.6.139 — NxM-AI, runs Ollama
**TrueNAS NAS:** 172.27.40.220 (Servers40), management: 172.27.6.221
**Firewall:** OPNsense at 172.27.6.1
**Primary server:** 172.27.40.3 — Ubuntu Server LTS, Docker host, all agent runtimes
**Ollama inference host:** 172.27.40.20 — Windows 11 Pro (NxM-AI), Vulkan GPU, Scheduled Task auto-start
**TrueNAS NAS:** 172.27.40.220 (data) / 172.27.6.221 (mgmt) — 35.6 TB, NFS shares for ISOs + Proxmox backups
**Firewall:** OPNsense at 172.27.6.1 (mgmt UI, not routed gateway)
**Proxmox VE:** 172.27.40.2 — PVE 9.1.1, 2× Xeon Gold 6138 (80 vCPUs), 252 GB RAM
**Hermes Native VM:** 172.27.40.30 (VM 108) — dedicated agent VM, Honcho memory, WhatsApp connected
**Tactical RMM:** 172.27.40.4 (VM 101) — remote management for all Nexum clients
**Home Assistant:** 172.27.10.6 (VM 100) — IoT automation
**Synology DS423+:** 172.27.40.80 — Coetzee off-site backup NAS, Active Backup via S2S
**VLANs:**
| VLAN | Name | Subnet |
|---|---|---|
| 40 | Servers40 | 172.27.40.0/24 |
| 20 | Workshop20 | 172.27.20.0/24 |
| 10 | IoT10 | 172.27.10.0/24 |
| VLAN | Name | Subnet | Gateway |
|---|---|---|---|
| 40 | Servers40 | 172.27.40.0/24 | 172.27.40.1 |
| 20 | Workshop20 | 172.27.20.0/24 | 172.27.20.1 |
| 10 | IoT10 | 172.27.10.0/24 | 172.27.10.1 |
## Key Services (172.27.40.3)
| Service | Port | URL |
| Service | Port | URL | Role |
|---|---|---|---|
| Portainer | 9443 | https://172.27.40.3:9443 | Docker management |
| Nginx Proxy Manager | 80/81/443 | http://172.27.40.3:81 | Reverse proxy, SSL termination |
| Uptime Kuma | 3002 | kuma.nxm.co.za | HTTP monitoring |
| Gitea | 3000 | git.nxm.co.za | Self-hosted git, all docs + code |
| Headscale | 8080 | headscale.nxm.co.za | VPN (self-hosted Tailscale) |
| Vaultwarden | 8222 | vault.nxm.co.za | Password manager |
| Open WebUI | 3010 | chat.nxm.co.za | Chat UI for Ollama + MCP |
| Plane | 8095 | plane.nxm.co.za | Project/task tracking |
| Homarr | 7575 | http://172.27.40.3:7575 | Dashboard |
| Grafana | 3020 | grafana.nxm.co.za | Monitoring dashboards |
| InfluxDB | 8086 | internal | Time-series DB for monitoring |
| NetBox | 8100 | netbox.nxm.co.za | IPAM, network documentation |
| NocoDB | 8150 | rvd.nxm.co.za | RvDM birthday DB (personal, NOT Nexum) |
| InvenTree | 8160 | inventree.nxm.co.za | IT stock + BOM tracking (testing) |
| Directus | 8850 | directus.nxm.co.za | Nexum client CRM |
| Nextcloud | — | — | Phone backup |
| Wetty | 8450/8451 | terminal.nxm.co.za / term.nxm.co.za | Web SSH terminal |
| RustDesk | 21115-21119 | internal | Self-hosted remote desktop relay |
| SearXNG | 8600 | internal | Search backend for sam + citadel |
| iVentoy | 26000 | internal | PXE boot server |
## AI / Agent Stack
**LLM inference:**
- **Ollama** on 172.27.40.20:11434 — models: gemma4, llama3.1:8b, phi4
- **Claude Code** on 172.27.40.3 — primary AI assistant (Anthropic API)
- **Hermes Native** on 172.27.40.30 — OpenRouter, Honcho memory, WhatsApp
- **Hermes Cloud** on 172.27.40.3:8643 — claude-sonnet-4-6, Citadel MCP wired
**Named agents (all Docker on 172.27.40.3 unless noted):**
| Agent | Port | Role | Schedule |
|---|---|---|---|
| hodor-gateway | 8200 | Simple Ollama gateway (POST /ask) | On-demand |
| citadel-mcp | 8300 | MCP SSE+HTTP server, 37 tools | Always-on |
| raven-notify | 8400 | Discord + Gmail notifications | Always-on |
| sam-research | 8500 | SearXNG + Ollama research | On-demand |
| qyburn-coder | 8700 | LLM coding agent (approve/reject) | On-demand |
| maester-reports | 8800 | NIST CSF compliance reports | On-demand |
| jon-snow | 8900 | Chief of staff orchestrator | Always-on |
| bran-changelog | — | Git changelog generator | Daily 06:00 |
| varys-monitor | — | Service HTTP reachability checks | Cron every 15 min |
| tarly-backup | 8750 | OPNsense config + Proxmox backup monitor | Daily 04:00 SAST |
| hermes-cloud | 8643 | Claude-powered conversational agent | Always-on |
| hermes-native | VM 108 | Primary Hermes agent (WhatsApp) | Always-on |
| vexis (workshop) | VM 108 | Nexum workshop agent (TRMM scripts) | On-demand via Hermes |
**Citadel MCP tools (37):** file ops, Docker management, Plane issues/projects/pages, TRMM (agents/scripts/confirm), Directus CRM, Proxmox backups, Qyburn task/approve, Sam research, web search, propose_file_change.
## Cron Jobs (172.27.40.3)
| Schedule | Job | Log |
|---|---|---|
| Portainer | 9443 | https://172.27.40.3:9443 |
| Nginx Proxy Manager | 80/81/443 | http://172.27.40.3:81 |
| Uptime Kuma | 3002 | http://172.27.40.3:3002 |
| Gitea | 3000 | https://git.nxm.co.za |
| Headscale | 8080 | https://headscale.nxm.co.za |
| Netbird | 3479/udp | https://netbird.nxm.co.za |
| Vaultwarden | 8222 | https://vault.nxm.co.za |
| Flowise | 3010 | http://172.27.40.3:3010 |
| Plane | 8095 | https://plane.nxm.co.za |
| Zabbix | 8091 | https://zabbix.nxm.co.za |
| Homarr | 7575 | http://172.27.40.3:7575 |
| Daily 06:00 | bran-changelog/run.sh | logs/bran.log |
| Daily 06:00 | zenarmor-pull.py | monitoring/logs/zenarmor-pull.log |
| Daily 02:05 | tarly hub-backup.sh | logs/tarly-backup/hub-backup.log |
| Every 1 min | ovpn-status.py | logs/ovpn-status.log |
| Every 30 min | trmm-frappe-sync.py | logs/trmm-frappe-sync.log |
## AI Stack
## OpenVPN S2S Sites
- **Ollama** on 172.27.6.139:11434 (bound to 0.0.0.0)
- **Models:** gemma4, qwen2.5-coder:7b
- **Flowise** on 172.27.40.3:3010 — visual agent/flow builder
- **Claude Code** — primary AI assistant, runs on Kubuntu
| Site | Tunnel IP | Status | Notes |
|---|---|---|---|
| bezhuis | 172.16.17.2 | COMPLETE | NAT + DNS overrides, LAN access live |
| mwp | 172.16.17.3 | COMPLETE | Monitoring live |
| coetzee | 172.16.17.4 | COMPLETE | Monitoring-only + Active Backup to Synology |
| fwlaw | — | PENDING | Awaiting migration |
## Agent OS Runtime
- Files: `/opt/agent-os/` on 172.27.40.3
- Local edit path: `/home/nxm/Documents/agent-os/` on 172.27.6.139
- Repo: `https://git.nxm.co.za/admin/agent-os`
- Repo: `git.nxm.co.za/admin/agent-os` (SSH remote: `gitea-local:admin/agent-os.git`)
- Scheduled jobs: cron on 172.27.40.3
- LLM calls: `http://172.27.6.139:11434`
- LLM calls: `http://172.27.40.20:11434` (Ollama) or Anthropic API (Claude Code / Hermes)
- Agent web pages: `/opt/sites/<name>/` served at agents.nxm.co.za
## Key Paths on Server
- Docker stacks: `/opt/stacks/`
- Agent OS: `/opt/agent-os/`
- Agent web pages: `/opt/sites/`
- Credentials: `~/.nxm-keys` (chmod 600) — NEVER write values elsewhere
- SSH keys: `~/.ssh/` (ED25519)
- NxM infrastructure docs: `/home/nxm/Documents/NxM Linux Server/`
- Nexum project docs: `/home/nxm/Documents/Nexum Projects/`
## Standing Decisions
- TrueNAS will move to a dedicated server — avoid hardcoding 172.27.40.5 in automation
- NPM handles all SSL termination — internal services use HTTP, NPM adds HTTPS
- NFS preferred for Linux-to-Linux file sharing
- Docker Compose only (no Kubernetes)
- All destructive actions require explicit confirmation before execution
- NPM handles all SSL termination — internal services use HTTP
- Docker Compose only (no Kubernetes, no Swarm)
- All destructive actions require explicit confirmation
- Credentials only in `~/.nxm-keys` — never in output, logs, or config files
- Netbird fully removed (2026-05-28) — VPN is Headscale + OpenVPN S2S
- WireGuard fully removed (2026-05-30) — replaced by OpenVPN S2S
- Open WebUI → Citadel MCP: auth_type must be `none` (empty bearer = silent failure)
- Docker → OPNsense API: run from host, never from inside a container (HTTP 400)
- NocoDB = RvDM personal only — never use for Nexum projects
- Nexum client data layer = Directus CRM
+18 -11
View File
@@ -1,6 +1,6 @@
# Identity
> **Status: COMPLETE** — Interview completed 2026-05-01.
> **Status: COMPLETE** — Interview completed 2026-05-01, updated 2026-06-19.
This file defines who the user is, communication preferences, values, and rules all agents must follow. Every skill reads this file before executing.
@@ -11,7 +11,9 @@ This file defines who the user is, communication preferences, values, and rules
- **Name:** Jaco Bezuidenhout
- **Company:** Nexum SA (PTY) Ltd — Mossel Bay, South Africa
- **Role:** Business owner, IT admin, network engineer
- **Primary focus:** Network monitoring for early problem detection; IT infrastructure management for clients
- **Primary focus:** Network monitoring, NIST CSF compliance reporting, IT infrastructure management for clients
- **Domain expertise:** VLANs, inter-VLAN routing, firewall rules (OPNsense), split DNS, VPN (Headscale/OpenVPN S2S), Docker Compose, Ubuntu Server admin, reverse proxy (NPM), IPAM (NetBox), monitoring (Grafana/Uptime Kuma/InfluxDB)
- **Not expert in:** Kubernetes, cloud platforms (AWS/Azure/GCP), advanced Python (learning), application development
---
@@ -19,9 +21,10 @@ This file defines who the user is, communication preferences, values, and rules
Priority order:
1. **Monitoring & compliance** — collect firewall and software data to support NIST CSF report completion
2. **Coding** — scripting, automation, tooling
3. **Summarising**distil logs, changelogs, reports into concise output
4. **General automation** — recurring tasks, scheduled jobs
2. **Client management** — TRMM remote management, Directus CRM, Frappe Helpdesk ticketing
3. **Coding**scripting, automation, tooling
4. **Summarising** — distil logs, changelogs, reports into concise output
5. **General automation** — recurring tasks, scheduled jobs, backups
---
@@ -48,7 +51,7 @@ Priority order:
- Send any external message (email, webhook, notification)
- Push to git or any remote repository
- Drop, reset, or modify databases
- **Never use a cloud-hosted LLM** (OpenAI, Anthropic API, Google, etc.) unless explicitly instructed. All inference stays on local Ollama (172.27.6.139:11434).
- Expose any service publicly without confirming NPM + Cloudflare + firewall implications
---
@@ -56,13 +59,17 @@ Priority order:
- Depends on the task — choose the format that fits the output type.
- **Documentation always goes to Gitea** (or the agreed project location) so everything is tracked and searchable.
- **Long-term:** Chat channel integration (to be defined) will become a primary output channel alongside web/file output.
- **Notifications route through Raven** (Discord + Gmail) at `http://raven-notify:8400`
- **Agent web output** goes to `/opt/sites/<name>/` served at agents.nxm.co.za
---
## Infrastructure Context
- Local LLM: Ollama at `http://172.27.6.139:11434` (gemma4, qwen2.5-coder:7b)
- Server: Ubuntu at `172.27.40.3` — Docker host, all agent runtimes
- Git: Gitea at `https://git.nxm.co.za` — all code and docs live here
- Agent OS runtime: `/opt/agent-os/` on 172.27.40.3, mounted at `/mnt/agent-os` on Kubuntu
- **Ollama:** `http://172.27.40.20:11434` — Windows 11 Pro (NxM-AI), models: gemma4, llama3.1:8b, phi4
- **Server:** Ubuntu at `172.27.40.3` — Docker host, all agent runtimes
- **Hermes Native:** VM 108 at `172.27.40.30` — OpenRouter LLM, Honcho memory, WhatsApp connected
- **Git:** Gitea at `https://git.nxm.co.za` — all code and docs
- **Agent OS runtime:** `/opt/agent-os/` on 172.27.40.3
- **Credentials:** `~/.nxm-keys` (chmod 600) — API keys for NPM, OPNsense, Proxmox, TrueNAS, Plane, Gitea, NetBox
- **Claude Code:** installed on 172.27.40.3, primary AI assistant
+18 -14
View File
@@ -1,7 +1,7 @@
# Active Projects
Current in-flight work. Update at the end of each session.
Last updated: 2026-05-16
Last updated: 2026-06-19
---
@@ -12,8 +12,8 @@ Phases 1 (NFS + mount) and 2 (identity interview) are complete.
**Phase 3 goal:** Docker container state monitoring + system resources. Complements Varys (HTTP reachability) — do not duplicate.
Pre-work before implementing:
- [ ] Update `skills/infra-monitor/skill.md` — container list is stale (has Flowise, missing Open WebUI + all new agents: citadel, varys, bran, sam, raven, qyburn, hodor, searxng, monitoring, bni-scheduler, nocodb)
- [ ] Correct Ollama URL in skill.md: now `http://172.27.40.20:11434` (moved from 172.27.6.139)
- [ ] Update `skills/infra-monitor/skill.md` — container list is stale (references Flowise/Netbird, missing 20+ current services)
- [ ] Correct Ollama URL in skill.md: now `http://172.27.40.20:11434` (moved from 172.27.6.139 → 172.27.40.20)
- [ ] Decide implementation: Docker one-shot container (consistent with bran/varys pattern) vs host cron + shell script
Implementation tasks:
@@ -26,23 +26,27 @@ Implementation tasks:
- [ ] Hourly heartbeat cron on 172.27.40.3
- [ ] Daily 07:00 full digest cron
- [ ] Notification channel: Raven (confirmed live at http://raven-notify:8400)
- [ ] Home Assistant integration (172.27.10.6) — optional, revisit after Phase 3
## Agent OS — Phase 5: Future Skills (Future)
- backup-monitor: TrueNAS migrated to new hardware (172.27.40.220) — skill ready to build
- Netbird/Headscale peer health: Netbird API at http://172.22.0.11:80/api/
- backup-monitor: extend Tarly with deeper TrueNAS integration
- Daily log digest: summarise /opt/agent-os/logs/ via Ollama
---
## Gitea Documentation Repos
- [x] nxm-infrastructure repo — Obsidian vault imported, CLAUDE.md added 2026-05-16
- [x] nexum-projects repo — Obsidian vault imported (on Kubuntu)
- [x] agent-os repo — scaffolding created, CLAUDE.md is global symlink
## Active Infrastructure Projects
| Project | Status | Next Step |
|---|---|---|
| **Monitoring** | bezhuis+mwp+coetzee alerts live | CPU/mem/WAN/ping Grafana rules pending |
| **OpenVPN S2S** | bezhuis/mwp/coetzee DONE | fwlaw pending |
| **Tarly Backup** | Hub working | bezhuis/mwp/coetzee API key fix (backup privilege) |
| **Directus CRM** | LIVE, 12 clients seeded | Manual data enrichment (contacts, renewals) |
| **InvenTree** | LIVE (testing) | SSL cert, production use |
| **Mailcow** | MAIL-1+2 done | Blocked on Mimecast (MAIL-3→9) |
| **Vexis** | nexum-private-customer-setup + office-install done | ESET/Evolve creds or standard-setup next |
| **Maester Phase 2** | Phase 1 live | Hermes narrative + .docx generation |
---
## Pending: Gitea SSH Key (security debt)
Server remote uses HTTP with embedded token. Before rotating:
1. Add SSH key for `nxm@172.27.40.3` to Gitea (Admin → Settings → SSH Keys)
2. `cd /opt/agent-os && git remote set-url origin gitea-local:admin/agent-os.git`
## Gitea SSH Key — DONE
Server remote switched from HTTP+token to SSH (`gitea-local:admin/agent-os.git`) on 2026-06-19.
+33 -5
View File
@@ -1,13 +1,41 @@
# Constraints
Hard limits agents must respect. Never work around these without explicit user confirmation.
Last updated: 2026-04-30
Last updated: 2026-06-19
---
- Never take destructive or irreversible action without explicit confirmation (delete, overwrite, drop, reset, force push)
- Never store credentials in output files, logs, or generated markdown — reference their location instead
- Never skip git hooks or bypass signing
- TrueNAS is on new hardware — use 172.27.40.220 (Servers40) for services, 172.27.6.221 for management/API
## Destructive actions
- Never delete or overwrite files without explicit confirmation
- Never restart or stop services without explicit confirmation
- Never drop, reset, or modify databases without explicit confirmation
- Never force push to git or bypass hooks
- Never run `pfctl` commands on OPNsense (risk of locking out remote access)
## Credentials
- All credentials live in `~/.nxm-keys` (chmod 600) — ONLY location
- Never store credentials in output files, logs, generated markdown, .env files, or code
- Reference the file location, never the values
- TrueNAS IPs: 172.27.40.220 (Servers40 data) / 172.27.6.221 (management/API)
## Infrastructure
- Linux server (172.27.40.3) has no GPU — never schedule LLM inference to run locally there
- Ollama runs on 172.27.40.20 (Windows 11 Pro) — not on the Docker host
- Docker Compose only — no Kubernetes, no Swarm
- Docker proxy network (172.22.0.0/16) cannot reach OPNsense API at 172.27.6.1 — always run OPNsense API scripts from the host
- NPM handles SSL termination — internal services always use HTTP
## Agent-specific
- **maester-reports:** restart clears in-memory cache → re-parses all evidence PDFs via Claude Opus vision (Anthropic API cost). Avoid unnecessary restarts.
- **NocoDB:** RvDM personal birthday DB ONLY — never suggest for any Nexum project. Nexum data layer = Directus.
- **Open WebUI → Citadel MCP:** auth_type must be `none`. Empty bearer key generates illegal header → silent connection failure.
- **Qyburn task specs:** never embed code in the description field — use plain English only (14b models explain code instead of writing it)
## External communication
- Never send any external message (email, webhook, Discord notification) without explicit confirmation
- Notifications always route through Raven (http://raven-notify:8400)
- Never expose services publicly without confirming NPM + Cloudflare + firewall implications
## Naming
- S2S = always suggest Site-to-Site VPN (not Road Warrior) for permanent infrastructure endpoints
- Use `.50+` IP range for non-firewall infrastructure devices on S2S tunnels
+32 -6
View File
@@ -1,18 +1,44 @@
# Persistent Memory
Facts that don't expire. If you'd have to re-explain it to a new agent every time, it belongs here.
Last updated: 2026-04-30
Last updated: 2026-06-19
---
## Infrastructure decisions
- RustDesk is self-hosted on 172.27.40.3 — clients connect to local server not public relay
- Netbird signal+management both route through NPM on port 443 — exposedAddress in /opt/stacks/netbird/config.yaml must be https://netbird.nxm.co.za:443 (caddy-netbird on :8443 exists but is not used externally)
- NPM handles all SSL termination — internal services use HTTP, NPM adds HTTPS
- Headscale v0.28: all write operations require numeric user ID, not username
- Tailscale on Windows overrides DNS — disconnect before testing split DNS changes
- Servers running Tailscale must run `sudo tailscale set --accept-dns=false` before joining Netbird
- Docker Compose only — no Kubernetes, no Swarm
- Docker → OPNsense API: HTTP 400 from Docker proxy network — always run OPNsense API scripts from the host
- All internal subdomains: gray-cloud CNAME → opnsense.nxm.co.za in Cloudflare. Proxied = 523 error.
- OPNsense split DNS: all subdomains resolve to 172.27.40.3 internally via Unbound host overrides
## Decommissioned services (do not reference)
- **Netbird:** Fully removed from server 2026-05-28. Orphaned clients on mwp/coetzee/b0qxxx/fwlaw firewalls pending removal.
- **WireGuard (N2W):** Fully removed 2026-05-30. Replaced by OpenVPN S2S.
- **Flowise:** Replaced by Open WebUI 2026-05-01.
- **Zabbix:** No longer running (monitoring moved to Grafana + InfluxDB + Telegraf).
## Agent OS build state
- Phase 1-2 (file structure + NFS + identity interview): not yet started
- First skill to build: infra-monitor (Docker health + agent watchdog)
- Notifications target: Home Assistant at 172.27.10.6
- Phase 1-2 complete (file structure + identity interview)
- Phase 3 (infra-monitor skill): spec written but stale, not yet implemented
- Notifications target: Raven at http://raven-notify:8400 (Discord + Gmail)
- All agent logs write to `/opt/agent-os/logs/<agent>/last-run.json`
## Credential policy
- All API keys and passwords: `~/.nxm-keys` (chmod 600)
- Never write credential values into output, logs, docs, or config files
- Reference credential location instead
## VPN topology
- **Headscale** (self-hosted Tailscale): remote access for admin devices
- **OpenVPN S2S:** site-to-site for client firewalls (bezhuis/mwp/coetzee done, fwlaw pending)
- Hub tunnel IPs: bezhuis=172.16.17.2, mwp=172.16.17.3, coetzee=172.16.17.4
## Ollama
- Host: 172.27.40.20 (Windows 11 Pro, NxM-AI), Vulkan GPU
- Models: gemma4, llama3.1:8b, phi4
- Auto-starts via Scheduled Task (S4U + AtStartup)
- Used by: hodor-gateway, sam-research, qyburn-coder, Open WebUI
+19 -6
View File
@@ -1,11 +1,24 @@
# Recent Decisions
Decisions made in the last 30 days that affect current work. Archive when no longer relevant.
Last updated: 2026-04-30
Decisions made in the last 60 days that affect current work. Archive when no longer relevant.
Last updated: 2026-06-19
---
- **2026-04-30:** Chose Gitea (self-hosted git) over Obsidian for documentation — AI-writable, browser-accessible, version controlled
- **2026-04-30:** Agent OS files to live on 172.27.40.3 at /opt/agent-os/, accessed from Kubuntu via NFS
- **2026-04-29:** Chose Syncthing-free approach for Obsidian migration — NFS for Linux, SMB for Windows
- **2026-04-29:** infra-monitor will be first Agent OS skill — covers Docker health and agent watchdog in one skill
- **2026-06-19:** Agent OS git remote switched from HTTP+token to SSH (gitea-local:admin/agent-os.git) — security debt resolved
- **2026-06-19:** Comprehensive Agent OS documentation update — brain.md, identity.md, all memory files brought current for LLM onboarding
- **2026-06-18:** Coetzee OpenVPN S2S complete — monitoring-only + hub-side NAT for Active Backup to Synology DS423+
- **2026-06-18:** Tarly backup service live — OPNsense config backups to TrueNAS NFS, Proxmox monitoring
- **2026-06-17:** Directus CRM live — 6 collections, 12 clients seeded from TRMM, 5 Citadel MCP tools
- **2026-06-17:** MWP Netbird fully removed, WireGuard spoke cleaned
- **2026-06-12:** NxM-AI (Kubuntu) migrated to Windows 11 Pro — same IP 172.27.40.20, Ollama via Scheduled Task
- **2026-06-12:** Vexis office-install/uninstall scripts live-tested, windows-update scripts done
- **2026-06-11:** Workshop20 → Servers40 firewall rules (1677-1680) for TRMM + Vexis access
- **2026-06-10:** Frappe Helpdesk live — TRMM→HD sync, Citadel tools, Vexis wired
- **2026-06-10:** trmm_confirm_with_user proven working (incl. response-parsing bug fix)
- **2026-05-30:** WireGuard fully removed, replaced by OpenVPN S2S
- **2026-05-29:** Maester reports Phase 1 live — 8 automated CSF controls, Grafana dashboard
- **2026-05-28:** Netbird fully removed from server
- **2026-05-28:** ZenArmor → Grafana pipeline all 3 phases complete
- **2026-05-27:** Jon Snow Phase 3 complete — approval gate, Discord approve/reject
- **2026-04-30:** Agent OS architecture: plain markdown files at /opt/agent-os/, Gitea-tracked, cron-scheduled
+60 -11
View File
@@ -15,35 +15,84 @@ Reads before executing:
### Docker health (on 172.27.40.3)
- All expected containers are running (not exited/restarting)
- Flag any container that has restarted more than 3 times in the last hour
- Expected containers: portainer, nginx-proxy-manager, uptime-kuma, gitea, headscale, netbird, vaultwarden, flowise, plane, zabbix, homarr
- Expected containers (grouped by criticality):
**Critical (alert immediately if down):**
- nginx-proxy-manager (reverse proxy — everything depends on this)
- gitea (all code + docs)
- citadel-mcp (central tool server)
- raven-notify (notification hub)
- open-webui (chat UI)
- vaultwarden (password manager)
**Important (alert after 15 min down):**
- headscale (VPN)
- grafana (monitoring dashboards)
- influxdb (time-series data)
- portainer (Docker management)
- uptime-kuma (HTTP monitoring)
- maester-reports (CSF compliance)
- jon-snow (orchestrator)
- tarly-backup (backup monitoring)
- directus + directus-db + directus-redis (CRM)
**Normal (report in daily digest only):**
- hodor-gateway, sam-research, qyburn-coder, searxng
- homarr, headplane, headscale-ui
- plane-* (all Plane containers)
- netbox-* (all NetBox containers)
- nocodb, bni-scheduler, inventree-*, wetty, term-dash
- rustdesk-hbbs, rustdesk-hbbr
- iventoy, agent-sites
### Service reachability
Lightweight HTTP check (curl, timeout 5s) on each internal URL:
- http://172.27.40.3:9443 (Portainer)
- http://172.27.40.3:3002 (Uptime Kuma)
- http://172.27.40.3:3000 (Gitea)
- http://172.27.40.3:3010 (Flowise)
- http://172.27.40.3:3010 (Open WebUI)
- http://172.27.40.3:7575 (Homarr)
- http://172.27.6.139:11434 (Ollama)
- http://172.27.40.3:8300 (Citadel MCP)
- http://172.27.40.3:8400 (Raven)
- http://172.27.40.3:8800 (Maester)
- http://172.27.40.3:8900 (Jon Snow)
- http://172.27.40.3:3020 (Grafana)
- http://172.27.40.3:8100 (NetBox)
- http://172.27.40.3:8850 (Directus)
- http://172.27.40.20:11434 (Ollama on NxM-AI)
### Agent watchdog
For each skill directory under `../../skills/`:
- Check `last-output.md` modification time — flag if older than expected schedule
- Check `../../logs/<skill-name>/` for ERROR entries in last run
- Report: healthy / stale / erroring
For each agent log at `../../logs/<agent>/last-run.json`:
- Check modification time — flag if older than expected schedule
- Check `status` field — flag if not "success"
- Expected agents and max staleness:
- bran-changelog: 25 hours (daily)
- varys-monitor: 20 minutes (every 15 min)
- trmm-frappe-sync: 35 minutes (every 30 min)
- tarly-backup: 25 hours (daily)
- raven-notify: 25 hours (event-driven, check status only)
- citadel-mcp, sam-research, qyburn-coder, jon-snow: check status only (on-demand)
### System resources (on 172.27.40.3)
- Disk usage on / — warn if >80%, critical if >90%
- Memory usage — flag if >85%
- Docker disk usage (`docker system df`) — warn if reclaimable > 10GB
### Remote hosts (optional, best-effort)
- Ping 172.27.40.20 (Ollama host)
- Ping 172.27.40.30 (Hermes Native VM)
- Ping 172.27.40.2 (Proxmox)
## Output
Write a digest to `last-output.md` in this format:
- Summary line: X healthy, Y warnings, Z critical
- Section per category: Docker, Services, Agent Watchdog, System
- Section per category: Docker, Services, Agent Watchdog, System, Remote Hosts
- Each item: ✓ OK / ⚠ Warning / ✗ Critical + one line detail
Pass anomalies to `context/handoff.md` for notification skill (future).
Also write machine-readable output to `../../logs/infra-monitor/last-run.json`.
Pass anomalies to `context/handoff.md` for Raven notification.
## Wrap-up
@@ -54,5 +103,5 @@ After writing output:
## Schedule
- **Heartbeat:** every hour — checks Docker + Ollama only (fast, <30s)
- **Full digest:** daily at 07:00 — all checks
- **Heartbeat:** every hour — checks Docker + Ollama + critical services only (fast, <30s)
- **Full digest:** daily at 07:00 — all checks including remote hosts and disk usage