All files were 5-7 weeks stale. Updated brain.md (complete service/agent/VPN/cron inventory), identity.md (current expertise + infra context), CLAUDE.md (full agent ecosystem table, Citadel tool registry, gotchas), README.md (LLM quick-start guide), all memory files (current projects, decisions, constraints, persistent facts), and infra-monitor skill.md (current container list with criticality tiers). Also fixed: git remote switched from HTTP+embedded-token to SSH, removed references to decommissioned services (Netbird, WireGuard, Flowise, Zabbix), corrected Ollama IP (172.27.40.20), TrueNAS IP (172.27.40.220), and added 20+ services/agents that were built since the last commit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7.7 KiB
Agent OS — Project CLAUDE.md
What This Project Is
Personal Agentic Operating System for NxM / Nexum SA infrastructure. Tool-agnostic AI foundation for scheduled skills, monitoring, and automation. Plain markdown files — no databases, no vendor lock-in.
- Runtime:
/opt/agent-os/on 172.27.40.3 - Gitea:
git.nxm.co.za/admin/agent-os(SSH:gitea-local:admin/agent-os.git) - Owner: Jaco Bezuidenhout, Nexum SA (PTY) Ltd
Current Phase
| Phase | Status |
|---|---|
| 1 — NFS export + mount | DONE 2026-05-01 (NFS no longer needed — consolidated to server) |
| 2 — Identity interview → identity.md | DONE 2026-05-01 |
| 3 — infra-monitor skill | NEXT (spec at skills/infra-monitor/skill.md, needs update) |
| 4 — Cron scheduling (hourly heartbeat + daily digest) | Pending Phase 3 |
| 5 — Future skills (backup monitor, log digest) | Future |
Live Agent Ecosystem (as of 2026-06-19)
All agents run as Docker containers on 172.27.40.3 unless noted. Every agent writes to /opt/agent-os/logs/<agent>/last-run.json.
Always-on agents
| Agent | Port | Stack Path | Role |
|---|---|---|---|
| citadel-mcp | 8300 | /opt/stacks/citadel-mcp/ |
MCP tool server (37 tools: Docker, Plane, TRMM, Directus, files, web search) |
| raven-notify | 8400 | /opt/stacks/raven-notify/ |
Notification hub — Discord webhook + Gmail SMTP |
| sam-research | 8500 | /opt/stacks/sam-research/ |
SearXNG + Ollama research agent |
| qyburn-coder | 8700 | /opt/stacks/qyburn-coder/ |
LLM coding agent with approve/reject workflow |
| maester-reports | 8800 | /opt/stacks/maester-reports/ |
NIST CSF compliance reports (⚠ restart = Anthropic API cost) |
| jon-snow | 8900 | /opt/stacks/jon-snow/ |
Chief of staff orchestrator, HMAC approval gate |
| hodor-gateway | 8200 | /opt/stacks/hodor-gateway/ |
Simple Ollama gateway (POST /ask) |
| tarly-backup | 8750 | /opt/stacks/tarly-backup/ |
Backup monitoring — OPNsense configs + Proxmox |
| hermes-cloud | 8643 | /opt/stacks/hermes-cloud/ |
Claude-sonnet brain, Citadel MCP wired |
| hermes-native | VM 108 | native on 172.27.40.30 | Primary conversational agent — OpenRouter, Honcho memory, WhatsApp, dashboard at hermes.nxm.co.za:9119 |
Scheduled/one-shot agents
| Agent | Schedule | Stack Path | Role |
|---|---|---|---|
| bran-changelog | Daily 06:00 | /opt/stacks/bran-changelog/ |
Git changelog generator |
| varys-monitor | Every 15 min | /opt/stacks/varys-monitor/ |
HTTP reachability checks for all services |
Support agents (via Hermes Native, VM 108)
| Agent | Role |
|---|---|
| vexis (workshop profile) | Nexum workshop agent — TRMM script execution on client devices |
Integrations running as cron (not standalone agents)
| Job | Schedule | Script |
|---|---|---|
| ovpn-status.py | Every 1 min | /opt/stacks/monitoring/ovpn-status.py |
| trmm-frappe-sync.py | Every 30 min | /opt/stacks/monitoring/trmm-frappe-sync.py |
| zenarmor-pull.py | Daily 06:00 | /opt/stacks/monitoring/zenarmor-pull.py |
| hub-backup.sh | Daily 02:05 | /opt/stacks/tarly-backup/hub-backup.sh |
Citadel MCP Tool Registry (37 tools)
The central tool server that other agents call via MCP protocol:
File operations: read_file, write_file, list_files, delete_file, propose_file_change Docker: docker_list_containers, docker_container_stats, docker_stack_list, docker_rebuild Plane (project management): plane_add_issue, plane_get_issues, plane_list_projects, plane_create_project, plane_create_page, plane_update_issue TRMM (remote management): trmm_list_agents, trmm_get_agent, trmm_list_scripts, trmm_add_script, trmm_delete_script, trmm_run_script, trmm_confirm_with_user, trmm_sync_now Directus (CRM): directus_list_clients, directus_get_client, directus_get_client_services, directus_get_renewals, directus_upcoming_renewals Other: get_agent_status, get_agent_output, list_agents, qyburn_task, qyburn_status, qyburn_approve, sam_research, web_search, proxmox_backup_status
Agent Web Pages
Static HTML dashboards served at agents.nxm.co.za/<name>/ from /opt/sites/:
- agents-dashboard, bran, changelog, citadel, hermes, hermes-native, hodor, jon-snow, qyburn, raven, sam, security-review, setup, stock, swarm, tarly, varys, workflow-test
Phase 3 — infra-monitor (NEXT)
Skill scaffold at skills/infra-monitor/skill.md. Spec is stale — needs update before building.
Goal: Docker container state + system resource checks. Complements Varys (HTTP reachability) — do not duplicate.
Before building:
- Update
skills/infra-monitor/skill.md— container list is stale (references Flowise/Netbird, missing 20+ services) - Ollama URL is now
http://172.27.40.20:11434 - Decide: Docker one-shot container (consistent with bran/varys) or host cron + shell script?
Output targets:
/opt/sites/infra-monitor/index.html— web dashboard at agents.nxm.co.za/infra-monitor//opt/agent-os/logs/infra-monitor/last-run.json— machine-readable, read by Varys watchdog- Raven alert on critical:
http://raven-notify:8400
Directory Structure
/opt/agent-os/
├── CLAUDE.md ← this file (project brief)
├── README.md ← onboarding for new LLMs
├── identity.md ← who the user is, hard limits
├── brain.md ← all infra facts, IPs, services, decisions
├── memory/
│ ├── active-projects.md ← what's in flight right now
│ ├── persistent.md ← facts that never expire
│ ├── recent-decisions.md ← decisions from last 30 days
│ ├── constraints.md ← hard limits agents must respect
│ └── notes-from-last-run.md ← cleared each session
├── claude-code/
│ └── memory/ ← Claude Code's persistent memory files (symlinked)
├── skills/
│ └── infra-monitor/ ← Phase 3 target (not yet built)
│ ├── skill.md
│ ├── learnings.md
│ ├── eval.json
│ ├── last-output.md
│ └── context/handoff.md
└── logs/ ← all agent log outputs
├── bran-changelog/
├── citadel-mcp/
├── jon-snow/
├── qyburn-coder/
├── raven-notify/
├── sam-research/
├── tarly-backup/
├── trmm-frappe-sync/
└── varys-monitor/
Architecture
- LLM inference: Ollama at
http://172.27.40.20:11434(gemma4, llama3.1:8b, phi4) + Anthropic API (Claude Code, Hermes) - Agent output pages:
/opt/sites/<name>/served at agents.nxm.co.za - Log standard:
/opt/agent-os/logs/<agent>/last-run.json - Notifications: Raven at
http://raven-notify:8400(Discord + Gmail) - Task tracking: Plane at plane.nxm.co.za
- Client CRM: Directus at directus.nxm.co.za
- Client devices: Tactical RMM at 172.27.40.4 (45 agents, 13 clients)
- Helpdesk: Frappe Helpdesk at helpdesk.nxm.co.za (VM 109)
- Credentials:
~/.nxm-keys(chmod 600) — ONLY place credential values live
Key Gotchas
- maester-reports restart = Anthropic API cost — cache is in-memory only
- Open WebUI → Citadel MCP: auth_type must be
none(empty bearer key = silent failure) - Docker → OPNsense API: Docker proxy network can't reach 172.27.6.1 (HTTP 400) — run from host
- Headscale v0.28: all write operations require numeric user ID, not username
- Vaultwarden: requires HTTPS — use vault.nxm.co.za, not LAN IP
- Tailscale on Windows: overrides DNS — disconnect when testing split DNS
- NPM forward scheme: HTTP even for HTTPS external — NPM handles SSL termination
- NocoDB: RvDM personal birthday DB only — never use for Nexum projects