Files
Claude Code 6cebab9a4a docs: comprehensive update — bring all Agent OS docs current for LLM onboarding
All files were 5-7 weeks stale. Updated brain.md (complete service/agent/VPN/cron
inventory), identity.md (current expertise + infra context), CLAUDE.md (full agent
ecosystem table, Citadel tool registry, gotchas), README.md (LLM quick-start guide),
all memory files (current projects, decisions, constraints, persistent facts), and
infra-monitor skill.md (current container list with criticality tiers).

Also fixed: git remote switched from HTTP+embedded-token to SSH, removed references
to decommissioned services (Netbird, WireGuard, Flowise, Zabbix), corrected Ollama
IP (172.27.40.20), TrueNAS IP (172.27.40.220), and added 20+ services/agents that
were built since the last commit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 17:15:45 +00:00

7.7 KiB

Agent OS — Project CLAUDE.md

What This Project Is

Personal Agentic Operating System for NxM / Nexum SA infrastructure. Tool-agnostic AI foundation for scheduled skills, monitoring, and automation. Plain markdown files — no databases, no vendor lock-in.

  • Runtime: /opt/agent-os/ on 172.27.40.3
  • Gitea: git.nxm.co.za/admin/agent-os (SSH: gitea-local:admin/agent-os.git)
  • Owner: Jaco Bezuidenhout, Nexum SA (PTY) Ltd

Current Phase

Phase Status
1 — NFS export + mount DONE 2026-05-01 (NFS no longer needed — consolidated to server)
2 — Identity interview → identity.md DONE 2026-05-01
3 — infra-monitor skill NEXT (spec at skills/infra-monitor/skill.md, needs update)
4 — Cron scheduling (hourly heartbeat + daily digest) Pending Phase 3
5 — Future skills (backup monitor, log digest) Future

Live Agent Ecosystem (as of 2026-06-19)

All agents run as Docker containers on 172.27.40.3 unless noted. Every agent writes to /opt/agent-os/logs/<agent>/last-run.json.

Always-on agents

Agent Port Stack Path Role
citadel-mcp 8300 /opt/stacks/citadel-mcp/ MCP tool server (37 tools: Docker, Plane, TRMM, Directus, files, web search)
raven-notify 8400 /opt/stacks/raven-notify/ Notification hub — Discord webhook + Gmail SMTP
sam-research 8500 /opt/stacks/sam-research/ SearXNG + Ollama research agent
qyburn-coder 8700 /opt/stacks/qyburn-coder/ LLM coding agent with approve/reject workflow
maester-reports 8800 /opt/stacks/maester-reports/ NIST CSF compliance reports (⚠ restart = Anthropic API cost)
jon-snow 8900 /opt/stacks/jon-snow/ Chief of staff orchestrator, HMAC approval gate
hodor-gateway 8200 /opt/stacks/hodor-gateway/ Simple Ollama gateway (POST /ask)
tarly-backup 8750 /opt/stacks/tarly-backup/ Backup monitoring — OPNsense configs + Proxmox
hermes-cloud 8643 /opt/stacks/hermes-cloud/ Claude-sonnet brain, Citadel MCP wired
hermes-native VM 108 native on 172.27.40.30 Primary conversational agent — OpenRouter, Honcho memory, WhatsApp, dashboard at hermes.nxm.co.za:9119

Scheduled/one-shot agents

Agent Schedule Stack Path Role
bran-changelog Daily 06:00 /opt/stacks/bran-changelog/ Git changelog generator
varys-monitor Every 15 min /opt/stacks/varys-monitor/ HTTP reachability checks for all services

Support agents (via Hermes Native, VM 108)

Agent Role
vexis (workshop profile) Nexum workshop agent — TRMM script execution on client devices

Integrations running as cron (not standalone agents)

Job Schedule Script
ovpn-status.py Every 1 min /opt/stacks/monitoring/ovpn-status.py
trmm-frappe-sync.py Every 30 min /opt/stacks/monitoring/trmm-frappe-sync.py
zenarmor-pull.py Daily 06:00 /opt/stacks/monitoring/zenarmor-pull.py
hub-backup.sh Daily 02:05 /opt/stacks/tarly-backup/hub-backup.sh

Citadel MCP Tool Registry (37 tools)

The central tool server that other agents call via MCP protocol:

File operations: read_file, write_file, list_files, delete_file, propose_file_change Docker: docker_list_containers, docker_container_stats, docker_stack_list, docker_rebuild Plane (project management): plane_add_issue, plane_get_issues, plane_list_projects, plane_create_project, plane_create_page, plane_update_issue TRMM (remote management): trmm_list_agents, trmm_get_agent, trmm_list_scripts, trmm_add_script, trmm_delete_script, trmm_run_script, trmm_confirm_with_user, trmm_sync_now Directus (CRM): directus_list_clients, directus_get_client, directus_get_client_services, directus_get_renewals, directus_upcoming_renewals Other: get_agent_status, get_agent_output, list_agents, qyburn_task, qyburn_status, qyburn_approve, sam_research, web_search, proxmox_backup_status

Agent Web Pages

Static HTML dashboards served at agents.nxm.co.za/<name>/ from /opt/sites/:

  • agents-dashboard, bran, changelog, citadel, hermes, hermes-native, hodor, jon-snow, qyburn, raven, sam, security-review, setup, stock, swarm, tarly, varys, workflow-test

Phase 3 — infra-monitor (NEXT)

Skill scaffold at skills/infra-monitor/skill.md. Spec is stale — needs update before building.

Goal: Docker container state + system resource checks. Complements Varys (HTTP reachability) — do not duplicate.

Before building:

  • Update skills/infra-monitor/skill.md — container list is stale (references Flowise/Netbird, missing 20+ services)
  • Ollama URL is now http://172.27.40.20:11434
  • Decide: Docker one-shot container (consistent with bran/varys) or host cron + shell script?

Output targets:

  • /opt/sites/infra-monitor/index.html — web dashboard at agents.nxm.co.za/infra-monitor/
  • /opt/agent-os/logs/infra-monitor/last-run.json — machine-readable, read by Varys watchdog
  • Raven alert on critical: http://raven-notify:8400

Directory Structure

/opt/agent-os/
├── CLAUDE.md                   ← this file (project brief)
├── README.md                   ← onboarding for new LLMs
├── identity.md                 ← who the user is, hard limits
├── brain.md                    ← all infra facts, IPs, services, decisions
├── memory/
│   ├── active-projects.md      ← what's in flight right now
│   ├── persistent.md           ← facts that never expire
│   ├── recent-decisions.md     ← decisions from last 30 days
│   ├── constraints.md          ← hard limits agents must respect
│   └── notes-from-last-run.md  ← cleared each session
├── claude-code/
│   └── memory/                 ← Claude Code's persistent memory files (symlinked)
├── skills/
│   └── infra-monitor/          ← Phase 3 target (not yet built)
│       ├── skill.md
│       ├── learnings.md
│       ├── eval.json
│       ├── last-output.md
│       └── context/handoff.md
└── logs/                       ← all agent log outputs
    ├── bran-changelog/
    ├── citadel-mcp/
    ├── jon-snow/
    ├── qyburn-coder/
    ├── raven-notify/
    ├── sam-research/
    ├── tarly-backup/
    ├── trmm-frappe-sync/
    └── varys-monitor/

Architecture

  • LLM inference: Ollama at http://172.27.40.20:11434 (gemma4, llama3.1:8b, phi4) + Anthropic API (Claude Code, Hermes)
  • Agent output pages: /opt/sites/<name>/ served at agents.nxm.co.za
  • Log standard: /opt/agent-os/logs/<agent>/last-run.json
  • Notifications: Raven at http://raven-notify:8400 (Discord + Gmail)
  • Task tracking: Plane at plane.nxm.co.za
  • Client CRM: Directus at directus.nxm.co.za
  • Client devices: Tactical RMM at 172.27.40.4 (45 agents, 13 clients)
  • Helpdesk: Frappe Helpdesk at helpdesk.nxm.co.za (VM 109)
  • Credentials: ~/.nxm-keys (chmod 600) — ONLY place credential values live

Key Gotchas

  • maester-reports restart = Anthropic API cost — cache is in-memory only
  • Open WebUI → Citadel MCP: auth_type must be none (empty bearer key = silent failure)
  • Docker → OPNsense API: Docker proxy network can't reach 172.27.6.1 (HTTP 400) — run from host
  • Headscale v0.28: all write operations require numeric user ID, not username
  • Vaultwarden: requires HTTPS — use vault.nxm.co.za, not LAN IP
  • Tailscale on Windows: overrides DNS — disconnect when testing split DNS
  • NPM forward scheme: HTTP even for HTTPS external — NPM handles SSL termination
  • NocoDB: RvDM personal birthday DB only — never use for Nexum projects