From 6cebab9a4adbfbd044cf815109ffca03bdc92b28 Mon Sep 17 00:00:00 2001 From: Claude Code Date: Fri, 19 Jun 2026 17:15:11 +0000 Subject: [PATCH] =?UTF-8?q?docs:=20comprehensive=20update=20=E2=80=94=20br?= =?UTF-8?q?ing=20all=20Agent=20OS=20docs=20current=20for=20LLM=20onboardin?= =?UTF-8?q?g?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit All files were 5-7 weeks stale. Updated brain.md (complete service/agent/VPN/cron inventory), identity.md (current expertise + infra context), CLAUDE.md (full agent ecosystem table, Citadel tool registry, gotchas), README.md (LLM quick-start guide), all memory files (current projects, decisions, constraints, persistent facts), and infra-monitor skill.md (current container list with criticality tiers). Also fixed: git remote switched from HTTP+embedded-token to SSH, removed references to decommissioned services (Netbird, WireGuard, Flowise, Zabbix), corrected Ollama IP (172.27.40.20), TrueNAS IP (172.27.40.220), and added 20+ services/agents that were built since the last commit. Co-Authored-By: Claude Sonnet 4.6 --- CLAUDE.md | 143 ++++++++++++++++++++++++++-------- README.md | 44 +++++++++-- brain.md | 135 +++++++++++++++++++++++--------- identity.md | 29 ++++--- memory/active-projects.md | 32 ++++---- memory/constraints.md | 38 +++++++-- memory/persistent.md | 38 +++++++-- memory/recent-decisions.md | 25 ++++-- skills/infra-monitor/skill.md | 71 ++++++++++++++--- 9 files changed, 427 insertions(+), 128 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 727ee5b..038bf36 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,28 +1,84 @@ # Agent OS — Project CLAUDE.md ## What This Project Is -Personal Agentic Operating System. Tool-agnostic AI foundation for scheduled skills, monitoring, and automation. -- Runtime: `/opt/agent-os/` on 172.27.40.3 -- Gitea: `git.nxm.co.za/admin/agent-os` -- Edit clone (server): `/home/nxm/Documents/agent-os/` (clone pending) +Personal Agentic Operating System for NxM / Nexum SA infrastructure. Tool-agnostic AI foundation for scheduled skills, monitoring, and automation. Plain markdown files — no databases, no vendor lock-in. + +- **Runtime:** `/opt/agent-os/` on 172.27.40.3 +- **Gitea:** `git.nxm.co.za/admin/agent-os` (SSH: `gitea-local:admin/agent-os.git`) +- **Owner:** Jaco Bezuidenhout, Nexum SA (PTY) Ltd ## Current Phase + | Phase | Status | |---|---| -| 1 — NFS export + Kubuntu mount | ✓ DONE 2026-05-01 (NFS no longer needed — consolidated to server) | -| 2 — Identity interview → identity.md populated | ✓ DONE 2026-05-01 | -| **3 — infra-monitor skill** | **NEXT** | +| 1 — NFS export + mount | DONE 2026-05-01 (NFS no longer needed — consolidated to server) | +| 2 — Identity interview → identity.md | DONE 2026-05-01 | +| 3 — infra-monitor skill | NEXT (spec at `skills/infra-monitor/skill.md`, needs update) | | 4 — Cron scheduling (hourly heartbeat + daily digest) | Pending Phase 3 | -| 5 — Future skills (backup monitor, peer health, log digest) | Future | +| 5 — Future skills (backup monitor, log digest) | Future | + +## Live Agent Ecosystem (as of 2026-06-19) + +All agents run as Docker containers on 172.27.40.3 unless noted. Every agent writes to `/opt/agent-os/logs//last-run.json`. + +### Always-on agents +| Agent | Port | Stack Path | Role | +|---|---|---|---| +| citadel-mcp | 8300 | `/opt/stacks/citadel-mcp/` | MCP tool server (37 tools: Docker, Plane, TRMM, Directus, files, web search) | +| raven-notify | 8400 | `/opt/stacks/raven-notify/` | Notification hub — Discord webhook + Gmail SMTP | +| sam-research | 8500 | `/opt/stacks/sam-research/` | SearXNG + Ollama research agent | +| qyburn-coder | 8700 | `/opt/stacks/qyburn-coder/` | LLM coding agent with approve/reject workflow | +| maester-reports | 8800 | `/opt/stacks/maester-reports/` | NIST CSF compliance reports (⚠ restart = Anthropic API cost) | +| jon-snow | 8900 | `/opt/stacks/jon-snow/` | Chief of staff orchestrator, HMAC approval gate | +| hodor-gateway | 8200 | `/opt/stacks/hodor-gateway/` | Simple Ollama gateway (POST /ask) | +| tarly-backup | 8750 | `/opt/stacks/tarly-backup/` | Backup monitoring — OPNsense configs + Proxmox | +| hermes-cloud | 8643 | `/opt/stacks/hermes-cloud/` | Claude-sonnet brain, Citadel MCP wired | +| hermes-native | VM 108 | native on 172.27.40.30 | Primary conversational agent — OpenRouter, Honcho memory, WhatsApp, dashboard at hermes.nxm.co.za:9119 | + +### Scheduled/one-shot agents +| Agent | Schedule | Stack Path | Role | +|---|---|---|---| +| bran-changelog | Daily 06:00 | `/opt/stacks/bran-changelog/` | Git changelog generator | +| varys-monitor | Every 15 min | `/opt/stacks/varys-monitor/` | HTTP reachability checks for all services | + +### Support agents (via Hermes Native, VM 108) +| Agent | Role | +|---|---| +| vexis (workshop profile) | Nexum workshop agent — TRMM script execution on client devices | + +### Integrations running as cron (not standalone agents) +| Job | Schedule | Script | +|---|---|---| +| ovpn-status.py | Every 1 min | `/opt/stacks/monitoring/ovpn-status.py` | +| trmm-frappe-sync.py | Every 30 min | `/opt/stacks/monitoring/trmm-frappe-sync.py` | +| zenarmor-pull.py | Daily 06:00 | `/opt/stacks/monitoring/zenarmor-pull.py` | +| hub-backup.sh | Daily 02:05 | `/opt/stacks/tarly-backup/hub-backup.sh` | + +## Citadel MCP Tool Registry (37 tools) + +The central tool server that other agents call via MCP protocol: + +**File operations:** read_file, write_file, list_files, delete_file, propose_file_change +**Docker:** docker_list_containers, docker_container_stats, docker_stack_list, docker_rebuild +**Plane (project management):** plane_add_issue, plane_get_issues, plane_list_projects, plane_create_project, plane_create_page, plane_update_issue +**TRMM (remote management):** trmm_list_agents, trmm_get_agent, trmm_list_scripts, trmm_add_script, trmm_delete_script, trmm_run_script, trmm_confirm_with_user, trmm_sync_now +**Directus (CRM):** directus_list_clients, directus_get_client, directus_get_client_services, directus_get_renewals, directus_upcoming_renewals +**Other:** get_agent_status, get_agent_output, list_agents, qyburn_task, qyburn_status, qyburn_approve, sam_research, web_search, proxmox_backup_status + +## Agent Web Pages + +Static HTML dashboards served at `agents.nxm.co.za//` from `/opt/sites/`: +- agents-dashboard, bran, changelog, citadel, hermes, hermes-native, hodor, jon-snow, qyburn, raven, sam, security-review, setup, stock, swarm, tarly, varys, workflow-test ## Phase 3 — infra-monitor (NEXT) -Skill scaffold at `skills/infra-monitor/skill.md`. Ready to implement after spec update. + +Skill scaffold at `skills/infra-monitor/skill.md`. **Spec is stale — needs update before building.** **Goal:** Docker container state + system resource checks. Complements Varys (HTTP reachability) — do not duplicate. **Before building:** -- Update `skills/infra-monitor/skill.md` — container list is stale (has Flowise, missing Open WebUI + all new agents) -- Correct Ollama URL: now `http://172.27.40.20:11434` (migrated from 172.27.6.139) +- Update `skills/infra-monitor/skill.md` — container list is stale (references Flowise/Netbird, missing 20+ services) +- Ollama URL is now `http://172.27.40.20:11434` - Decide: Docker one-shot container (consistent with bran/varys) or host cron + shell script? **Output targets:** @@ -30,38 +86,59 @@ Skill scaffold at `skills/infra-monitor/skill.md`. Ready to implement after spec - `/opt/agent-os/logs/infra-monitor/last-run.json` — machine-readable, read by Varys watchdog - Raven alert on critical: `http://raven-notify:8400` -**Schedule:** hourly heartbeat (Docker + Ollama only) + daily 07:00 full digest - ## Directory Structure ``` /opt/agent-os/ -├── CLAUDE.md ← this file (project brief, tracked in Gitea) -├── identity.md ← populated Phase 2 -├── brain.md +├── CLAUDE.md ← this file (project brief) +├── README.md ← onboarding for new LLMs +├── identity.md ← who the user is, hard limits +├── brain.md ← all infra facts, IPs, services, decisions ├── memory/ -│ ├── active-projects.md ← update at end of each session -│ ├── persistent.md -│ ├── recent-decisions.md -│ ├── constraints.md -│ └── notes-from-last-run.md -├── context/ +│ ├── active-projects.md ← what's in flight right now +│ ├── persistent.md ← facts that never expire +│ ├── recent-decisions.md ← decisions from last 30 days +│ ├── constraints.md ← hard limits agents must respect +│ └── notes-from-last-run.md ← cleared each session +├── claude-code/ +│ └── memory/ ← Claude Code's persistent memory files (symlinked) ├── skills/ -│ └── infra-monitor/ ← Phase 3 target -│ ├── skill.md ← spec (stale container list — update before building) +│ └── infra-monitor/ ← Phase 3 target (not yet built) +│ ├── skill.md │ ├── learnings.md │ ├── eval.json │ ├── last-output.md │ └── context/handoff.md -└── logs/ +└── logs/ ← all agent log outputs + ├── bran-changelog/ + ├── citadel-mcp/ + ├── jon-snow/ + ├── qyburn-coder/ + ├── raven-notify/ + ├── sam-research/ + ├── tarly-backup/ + ├── trmm-frappe-sync/ + └── varys-monitor/ ``` ## Architecture -- LLM inference: Kubuntu Ollama at `http://172.27.40.20:11434` -- All agent output: `/opt/sites//` served at agents.nxm.co.za -- Log standard: `/opt/agent-os/logs//last-run.json` -- Notifications: Raven at `http://raven-notify:8400` -## Pending — Gitea SSH Key (security debt) -Server remote uses HTTP with embedded token. Before next token rotation: -1. Add SSH key for `nxm@172.27.40.3` to Gitea (Admin → Settings → SSH Keys) -2. `cd /opt/agent-os && git remote set-url origin gitea-local:admin/agent-os.git` +- **LLM inference:** Ollama at `http://172.27.40.20:11434` (gemma4, llama3.1:8b, phi4) + Anthropic API (Claude Code, Hermes) +- **Agent output pages:** `/opt/sites//` served at agents.nxm.co.za +- **Log standard:** `/opt/agent-os/logs//last-run.json` +- **Notifications:** Raven at `http://raven-notify:8400` (Discord + Gmail) +- **Task tracking:** Plane at plane.nxm.co.za +- **Client CRM:** Directus at directus.nxm.co.za +- **Client devices:** Tactical RMM at 172.27.40.4 (45 agents, 13 clients) +- **Helpdesk:** Frappe Helpdesk at helpdesk.nxm.co.za (VM 109) +- **Credentials:** `~/.nxm-keys` (chmod 600) — ONLY place credential values live + +## Key Gotchas + +- **maester-reports restart = Anthropic API cost** — cache is in-memory only +- **Open WebUI → Citadel MCP:** auth_type must be `none` (empty bearer key = silent failure) +- **Docker → OPNsense API:** Docker proxy network can't reach 172.27.6.1 (HTTP 400) — run from host +- **Headscale v0.28:** all write operations require numeric user ID, not username +- **Vaultwarden:** requires HTTPS — use vault.nxm.co.za, not LAN IP +- **Tailscale on Windows:** overrides DNS — disconnect when testing split DNS +- **NPM forward scheme:** HTTP even for HTTPS external — NPM handles SSL termination +- **NocoDB:** RvDM personal birthday DB only — never use for Nexum projects diff --git a/README.md b/README.md index 06986db..f90d1ca 100644 --- a/README.md +++ b/README.md @@ -10,14 +10,41 @@ Every agent interaction reads from and writes back to files in this repo. No dat | Layer | File(s) | Purpose | |---|---|---| -| Identity | `identity.md` | Who you are, communication style, values | +| Identity | `identity.md` | Who the user is, communication style, values, hard limits | | Context | `context/` | Dated, task-specific working files | -| Brain | `brain.md` | Persistent facts — infra, people, decisions | +| Brain | `brain.md` | Persistent facts — infra, services, IPs, standing decisions | | Memory | `memory/` | Short and long-term session notes | | Skills | `skills/` | Repeatable workflows, each self-improving | | Processes | `skills/*/context/handoff.md` | Output passed between chained skills | | Automation | cron on 172.27.40.3 | Scheduled skill execution | +## Quick start for a new LLM + +If you are an LLM reading this repo for the first time: + +1. **Read `identity.md`** — who you're working for, hard limits, communication style +2. **Read `brain.md`** — all infrastructure facts: IPs, services, ports, agents, standing decisions +3. **Read `memory/active-projects.md`** — what's currently in flight +4. **Read `memory/constraints.md`** — things you must never do +5. **Read `CLAUDE.md`** — project status and architecture details + +Do NOT take any action without reading `identity.md` first. The hard limits there are non-negotiable. + +## Live agent ecosystem + +The NxM infrastructure runs 12+ named agents across Docker containers and VMs. Every agent writes logs to `/opt/agent-os/logs//last-run.json` and most publish web dashboards to `agents.nxm.co.za//`. + +Key agents: +- **Citadel MCP** (port 8300) — central tool server, 37 tools covering Docker, Plane, TRMM, Directus, file ops, web search +- **Raven** (port 8400) — notification hub (Discord + Gmail), all alerts route through here +- **Jon Snow** (port 8900) — chief of staff orchestrator with approval gates +- **Maester** (port 8800) — NIST CSF compliance reporting +- **Hermes Native** (VM 108) — primary conversational agent with WhatsApp + Honcho memory +- **Tarly** (port 8750) — backup monitoring (OPNsense configs + Proxmox) +- **Vexis** (via Hermes, VM 108) — workshop/TRMM scripting agent for client devices + +See `brain.md` for the complete agent table with ports and schedules. + ## Adding a new skill 1. Create `skills//skill.md` — what the skill does and how @@ -28,10 +55,11 @@ Every agent interaction reads from and writes back to files in this repo. No dat ## Runtime -- Files live on server: `/opt/agent-os/` (cloned from this repo) -- LLM inference: Ollama at `http://172.27.6.139:11434` -- Scheduled jobs: cron on `172.27.40.3` -- Local editing: `/home/nxm/Documents/agent-os/` on Kubuntu (this machine) +- **Server:** `/opt/agent-os/` on 172.27.40.3 (Ubuntu, Docker host) +- **Repo:** `git.nxm.co.za/admin/agent-os` (SSH: `gitea-local:admin/agent-os.git`) +- **LLM inference:** Ollama at `http://172.27.40.20:11434` (local) or Anthropic API (Claude Code/Hermes) +- **Scheduled jobs:** cron on 172.27.40.3 +- **Agent web pages:** `/opt/sites//` → agents.nxm.co.za ## Infra reference @@ -39,3 +67,7 @@ Cross-repo links to supporting documentation: - [IP & Port Map](https://git.nxm.co.za/admin/nxm-infrastructure/src/branch/main/Quick%20Reference/IP%20%26%20Port%20Map.md) - [Docker Stacks](https://git.nxm.co.za/admin/nxm-infrastructure/src/branch/main/Quick%20Reference/Docker%20Stacks.md) - [Network Overview](https://git.nxm.co.za/admin/nxm-infrastructure/src/branch/main/Infrastructure/Network%20Overview.md) + +## Credential policy + +All API keys and passwords live in `~/.nxm-keys` (chmod 600). Never write credential values into code, config files, logs, or documentation. Reference the file location instead. diff --git a/brain.md b/brain.md index 03a1015..9a1ed76 100644 --- a/brain.md +++ b/brain.md @@ -1,64 +1,127 @@ # Brain -Core facts read by all skills. Keep under 1000 words. Update when infrastructure changes. -Last updated: 2026-04-30 +Core facts read by all skills. Keep under 1500 words. Update when infrastructure changes. +Last updated: 2026-06-19 --- ## Infrastructure -**Primary server:** 172.27.40.3 — Ubuntu Server LTS, Docker host -**Kubuntu desktop:** 172.27.6.139 — NxM-AI, runs Ollama -**TrueNAS NAS:** 172.27.40.220 (Servers40), management: 172.27.6.221 -**Firewall:** OPNsense at 172.27.6.1 +**Primary server:** 172.27.40.3 — Ubuntu Server LTS, Docker host, all agent runtimes +**Ollama inference host:** 172.27.40.20 — Windows 11 Pro (NxM-AI), Vulkan GPU, Scheduled Task auto-start +**TrueNAS NAS:** 172.27.40.220 (data) / 172.27.6.221 (mgmt) — 35.6 TB, NFS shares for ISOs + Proxmox backups +**Firewall:** OPNsense at 172.27.6.1 (mgmt UI, not routed gateway) +**Proxmox VE:** 172.27.40.2 — PVE 9.1.1, 2× Xeon Gold 6138 (80 vCPUs), 252 GB RAM +**Hermes Native VM:** 172.27.40.30 (VM 108) — dedicated agent VM, Honcho memory, WhatsApp connected +**Tactical RMM:** 172.27.40.4 (VM 101) — remote management for all Nexum clients +**Home Assistant:** 172.27.10.6 (VM 100) — IoT automation +**Synology DS423+:** 172.27.40.80 — Coetzee off-site backup NAS, Active Backup via S2S **VLANs:** -| VLAN | Name | Subnet | -|---|---|---| -| 40 | Servers40 | 172.27.40.0/24 | -| 20 | Workshop20 | 172.27.20.0/24 | -| 10 | IoT10 | 172.27.10.0/24 | +| VLAN | Name | Subnet | Gateway | +|---|---|---|---| +| 40 | Servers40 | 172.27.40.0/24 | 172.27.40.1 | +| 20 | Workshop20 | 172.27.20.0/24 | 172.27.20.1 | +| 10 | IoT10 | 172.27.10.0/24 | 172.27.10.1 | ## Key Services (172.27.40.3) -| Service | Port | URL | +| Service | Port | URL | Role | +|---|---|---|---| +| Portainer | 9443 | https://172.27.40.3:9443 | Docker management | +| Nginx Proxy Manager | 80/81/443 | http://172.27.40.3:81 | Reverse proxy, SSL termination | +| Uptime Kuma | 3002 | kuma.nxm.co.za | HTTP monitoring | +| Gitea | 3000 | git.nxm.co.za | Self-hosted git, all docs + code | +| Headscale | 8080 | headscale.nxm.co.za | VPN (self-hosted Tailscale) | +| Vaultwarden | 8222 | vault.nxm.co.za | Password manager | +| Open WebUI | 3010 | chat.nxm.co.za | Chat UI for Ollama + MCP | +| Plane | 8095 | plane.nxm.co.za | Project/task tracking | +| Homarr | 7575 | http://172.27.40.3:7575 | Dashboard | +| Grafana | 3020 | grafana.nxm.co.za | Monitoring dashboards | +| InfluxDB | 8086 | internal | Time-series DB for monitoring | +| NetBox | 8100 | netbox.nxm.co.za | IPAM, network documentation | +| NocoDB | 8150 | rvd.nxm.co.za | RvDM birthday DB (personal, NOT Nexum) | +| InvenTree | 8160 | inventree.nxm.co.za | IT stock + BOM tracking (testing) | +| Directus | 8850 | directus.nxm.co.za | Nexum client CRM | +| Nextcloud | — | — | Phone backup | +| Wetty | 8450/8451 | terminal.nxm.co.za / term.nxm.co.za | Web SSH terminal | +| RustDesk | 21115-21119 | internal | Self-hosted remote desktop relay | +| SearXNG | 8600 | internal | Search backend for sam + citadel | +| iVentoy | 26000 | internal | PXE boot server | + +## AI / Agent Stack + +**LLM inference:** +- **Ollama** on 172.27.40.20:11434 — models: gemma4, llama3.1:8b, phi4 +- **Claude Code** on 172.27.40.3 — primary AI assistant (Anthropic API) +- **Hermes Native** on 172.27.40.30 — OpenRouter, Honcho memory, WhatsApp +- **Hermes Cloud** on 172.27.40.3:8643 — claude-sonnet-4-6, Citadel MCP wired + +**Named agents (all Docker on 172.27.40.3 unless noted):** +| Agent | Port | Role | Schedule | +|---|---|---|---| +| hodor-gateway | 8200 | Simple Ollama gateway (POST /ask) | On-demand | +| citadel-mcp | 8300 | MCP SSE+HTTP server, 37 tools | Always-on | +| raven-notify | 8400 | Discord + Gmail notifications | Always-on | +| sam-research | 8500 | SearXNG + Ollama research | On-demand | +| qyburn-coder | 8700 | LLM coding agent (approve/reject) | On-demand | +| maester-reports | 8800 | NIST CSF compliance reports | On-demand | +| jon-snow | 8900 | Chief of staff orchestrator | Always-on | +| bran-changelog | — | Git changelog generator | Daily 06:00 | +| varys-monitor | — | Service HTTP reachability checks | Cron every 15 min | +| tarly-backup | 8750 | OPNsense config + Proxmox backup monitor | Daily 04:00 SAST | +| hermes-cloud | 8643 | Claude-powered conversational agent | Always-on | +| hermes-native | VM 108 | Primary Hermes agent (WhatsApp) | Always-on | +| vexis (workshop) | VM 108 | Nexum workshop agent (TRMM scripts) | On-demand via Hermes | + +**Citadel MCP tools (37):** file ops, Docker management, Plane issues/projects/pages, TRMM (agents/scripts/confirm), Directus CRM, Proxmox backups, Qyburn task/approve, Sam research, web search, propose_file_change. + +## Cron Jobs (172.27.40.3) + +| Schedule | Job | Log | |---|---|---| -| Portainer | 9443 | https://172.27.40.3:9443 | -| Nginx Proxy Manager | 80/81/443 | http://172.27.40.3:81 | -| Uptime Kuma | 3002 | http://172.27.40.3:3002 | -| Gitea | 3000 | https://git.nxm.co.za | -| Headscale | 8080 | https://headscale.nxm.co.za | -| Netbird | 3479/udp | https://netbird.nxm.co.za | -| Vaultwarden | 8222 | https://vault.nxm.co.za | -| Flowise | 3010 | http://172.27.40.3:3010 | -| Plane | 8095 | https://plane.nxm.co.za | -| Zabbix | 8091 | https://zabbix.nxm.co.za | -| Homarr | 7575 | http://172.27.40.3:7575 | +| Daily 06:00 | bran-changelog/run.sh | logs/bran.log | +| Daily 06:00 | zenarmor-pull.py | monitoring/logs/zenarmor-pull.log | +| Daily 02:05 | tarly hub-backup.sh | logs/tarly-backup/hub-backup.log | +| Every 1 min | ovpn-status.py | logs/ovpn-status.log | +| Every 30 min | trmm-frappe-sync.py | logs/trmm-frappe-sync.log | -## AI Stack +## OpenVPN S2S Sites -- **Ollama** on 172.27.6.139:11434 (bound to 0.0.0.0) -- **Models:** gemma4, qwen2.5-coder:7b -- **Flowise** on 172.27.40.3:3010 — visual agent/flow builder -- **Claude Code** — primary AI assistant, runs on Kubuntu +| Site | Tunnel IP | Status | Notes | +|---|---|---|---| +| bezhuis | 172.16.17.2 | COMPLETE | NAT + DNS overrides, LAN access live | +| mwp | 172.16.17.3 | COMPLETE | Monitoring live | +| coetzee | 172.16.17.4 | COMPLETE | Monitoring-only + Active Backup to Synology | +| fwlaw | — | PENDING | Awaiting migration | ## Agent OS Runtime - Files: `/opt/agent-os/` on 172.27.40.3 -- Local edit path: `/home/nxm/Documents/agent-os/` on 172.27.6.139 -- Repo: `https://git.nxm.co.za/admin/agent-os` +- Repo: `git.nxm.co.za/admin/agent-os` (SSH remote: `gitea-local:admin/agent-os.git`) - Scheduled jobs: cron on 172.27.40.3 -- LLM calls: `http://172.27.6.139:11434` +- LLM calls: `http://172.27.40.20:11434` (Ollama) or Anthropic API (Claude Code / Hermes) +- Agent web pages: `/opt/sites//` served at agents.nxm.co.za ## Key Paths on Server - Docker stacks: `/opt/stacks/` - Agent OS: `/opt/agent-os/` +- Agent web pages: `/opt/sites/` +- Credentials: `~/.nxm-keys` (chmod 600) — NEVER write values elsewhere +- SSH keys: `~/.ssh/` (ED25519) +- NxM infrastructure docs: `/home/nxm/Documents/NxM Linux Server/` +- Nexum project docs: `/home/nxm/Documents/Nexum Projects/` ## Standing Decisions -- TrueNAS will move to a dedicated server — avoid hardcoding 172.27.40.5 in automation -- NPM handles all SSL termination — internal services use HTTP, NPM adds HTTPS -- NFS preferred for Linux-to-Linux file sharing -- Docker Compose only (no Kubernetes) -- All destructive actions require explicit confirmation before execution +- NPM handles all SSL termination — internal services use HTTP +- Docker Compose only (no Kubernetes, no Swarm) +- All destructive actions require explicit confirmation +- Credentials only in `~/.nxm-keys` — never in output, logs, or config files +- Netbird fully removed (2026-05-28) — VPN is Headscale + OpenVPN S2S +- WireGuard fully removed (2026-05-30) — replaced by OpenVPN S2S +- Open WebUI → Citadel MCP: auth_type must be `none` (empty bearer = silent failure) +- Docker → OPNsense API: run from host, never from inside a container (HTTP 400) +- NocoDB = RvDM personal only — never use for Nexum projects +- Nexum client data layer = Directus CRM diff --git a/identity.md b/identity.md index 29bcf53..6d33d1a 100644 --- a/identity.md +++ b/identity.md @@ -1,6 +1,6 @@ # Identity -> **Status: COMPLETE** — Interview completed 2026-05-01. +> **Status: COMPLETE** — Interview completed 2026-05-01, updated 2026-06-19. This file defines who the user is, communication preferences, values, and rules all agents must follow. Every skill reads this file before executing. @@ -11,7 +11,9 @@ This file defines who the user is, communication preferences, values, and rules - **Name:** Jaco Bezuidenhout - **Company:** Nexum SA (PTY) Ltd — Mossel Bay, South Africa - **Role:** Business owner, IT admin, network engineer -- **Primary focus:** Network monitoring for early problem detection; IT infrastructure management for clients +- **Primary focus:** Network monitoring, NIST CSF compliance reporting, IT infrastructure management for clients +- **Domain expertise:** VLANs, inter-VLAN routing, firewall rules (OPNsense), split DNS, VPN (Headscale/OpenVPN S2S), Docker Compose, Ubuntu Server admin, reverse proxy (NPM), IPAM (NetBox), monitoring (Grafana/Uptime Kuma/InfluxDB) +- **Not expert in:** Kubernetes, cloud platforms (AWS/Azure/GCP), advanced Python (learning), application development --- @@ -19,9 +21,10 @@ This file defines who the user is, communication preferences, values, and rules Priority order: 1. **Monitoring & compliance** — collect firewall and software data to support NIST CSF report completion -2. **Coding** — scripting, automation, tooling -3. **Summarising** — distil logs, changelogs, reports into concise output -4. **General automation** — recurring tasks, scheduled jobs +2. **Client management** — TRMM remote management, Directus CRM, Frappe Helpdesk ticketing +3. **Coding** — scripting, automation, tooling +4. **Summarising** — distil logs, changelogs, reports into concise output +5. **General automation** — recurring tasks, scheduled jobs, backups --- @@ -48,7 +51,7 @@ Priority order: - Send any external message (email, webhook, notification) - Push to git or any remote repository - Drop, reset, or modify databases -- **Never use a cloud-hosted LLM** (OpenAI, Anthropic API, Google, etc.) unless explicitly instructed. All inference stays on local Ollama (172.27.6.139:11434). +- Expose any service publicly without confirming NPM + Cloudflare + firewall implications --- @@ -56,13 +59,17 @@ Priority order: - Depends on the task — choose the format that fits the output type. - **Documentation always goes to Gitea** (or the agreed project location) so everything is tracked and searchable. -- **Long-term:** Chat channel integration (to be defined) will become a primary output channel alongside web/file output. +- **Notifications route through Raven** (Discord + Gmail) at `http://raven-notify:8400` +- **Agent web output** goes to `/opt/sites//` served at agents.nxm.co.za --- ## Infrastructure Context -- Local LLM: Ollama at `http://172.27.6.139:11434` (gemma4, qwen2.5-coder:7b) -- Server: Ubuntu at `172.27.40.3` — Docker host, all agent runtimes -- Git: Gitea at `https://git.nxm.co.za` — all code and docs live here -- Agent OS runtime: `/opt/agent-os/` on 172.27.40.3, mounted at `/mnt/agent-os` on Kubuntu +- **Ollama:** `http://172.27.40.20:11434` — Windows 11 Pro (NxM-AI), models: gemma4, llama3.1:8b, phi4 +- **Server:** Ubuntu at `172.27.40.3` — Docker host, all agent runtimes +- **Hermes Native:** VM 108 at `172.27.40.30` — OpenRouter LLM, Honcho memory, WhatsApp connected +- **Git:** Gitea at `https://git.nxm.co.za` — all code and docs +- **Agent OS runtime:** `/opt/agent-os/` on 172.27.40.3 +- **Credentials:** `~/.nxm-keys` (chmod 600) — API keys for NPM, OPNsense, Proxmox, TrueNAS, Plane, Gitea, NetBox +- **Claude Code:** installed on 172.27.40.3, primary AI assistant diff --git a/memory/active-projects.md b/memory/active-projects.md index 3bfa314..1adcee9 100644 --- a/memory/active-projects.md +++ b/memory/active-projects.md @@ -1,7 +1,7 @@ # Active Projects Current in-flight work. Update at the end of each session. -Last updated: 2026-05-16 +Last updated: 2026-06-19 --- @@ -12,8 +12,8 @@ Phases 1 (NFS + mount) and 2 (identity interview) are complete. **Phase 3 goal:** Docker container state monitoring + system resources. Complements Varys (HTTP reachability) — do not duplicate. Pre-work before implementing: -- [ ] Update `skills/infra-monitor/skill.md` — container list is stale (has Flowise, missing Open WebUI + all new agents: citadel, varys, bran, sam, raven, qyburn, hodor, searxng, monitoring, bni-scheduler, nocodb) -- [ ] Correct Ollama URL in skill.md: now `http://172.27.40.20:11434` (moved from 172.27.6.139) +- [ ] Update `skills/infra-monitor/skill.md` — container list is stale (references Flowise/Netbird, missing 20+ current services) +- [ ] Correct Ollama URL in skill.md: now `http://172.27.40.20:11434` (moved from 172.27.6.139 → 172.27.40.20) - [ ] Decide implementation: Docker one-shot container (consistent with bran/varys pattern) vs host cron + shell script Implementation tasks: @@ -26,23 +26,27 @@ Implementation tasks: - [ ] Hourly heartbeat cron on 172.27.40.3 - [ ] Daily 07:00 full digest cron - [ ] Notification channel: Raven (confirmed live at http://raven-notify:8400) -- [ ] Home Assistant integration (172.27.10.6) — optional, revisit after Phase 3 ## Agent OS — Phase 5: Future Skills (Future) -- backup-monitor: TrueNAS migrated to new hardware (172.27.40.220) — skill ready to build -- Netbird/Headscale peer health: Netbird API at http://172.22.0.11:80/api/ +- backup-monitor: extend Tarly with deeper TrueNAS integration - Daily log digest: summarise /opt/agent-os/logs/ via Ollama --- -## Gitea Documentation Repos -- [x] nxm-infrastructure repo — Obsidian vault imported, CLAUDE.md added 2026-05-16 -- [x] nexum-projects repo — Obsidian vault imported (on Kubuntu) -- [x] agent-os repo — scaffolding created, CLAUDE.md is global symlink +## Active Infrastructure Projects + +| Project | Status | Next Step | +|---|---|---| +| **Monitoring** | bezhuis+mwp+coetzee alerts live | CPU/mem/WAN/ping Grafana rules pending | +| **OpenVPN S2S** | bezhuis/mwp/coetzee DONE | fwlaw pending | +| **Tarly Backup** | Hub working | bezhuis/mwp/coetzee API key fix (backup privilege) | +| **Directus CRM** | LIVE, 12 clients seeded | Manual data enrichment (contacts, renewals) | +| **InvenTree** | LIVE (testing) | SSL cert, production use | +| **Mailcow** | MAIL-1+2 done | Blocked on Mimecast (MAIL-3→9) | +| **Vexis** | nexum-private-customer-setup + office-install done | ESET/Evolve creds or standard-setup next | +| **Maester Phase 2** | Phase 1 live | Hermes narrative + .docx generation | --- -## Pending: Gitea SSH Key (security debt) -Server remote uses HTTP with embedded token. Before rotating: -1. Add SSH key for `nxm@172.27.40.3` to Gitea (Admin → Settings → SSH Keys) -2. `cd /opt/agent-os && git remote set-url origin gitea-local:admin/agent-os.git` +## Gitea SSH Key — DONE +Server remote switched from HTTP+token to SSH (`gitea-local:admin/agent-os.git`) on 2026-06-19. diff --git a/memory/constraints.md b/memory/constraints.md index 7b590c7..c15312d 100644 --- a/memory/constraints.md +++ b/memory/constraints.md @@ -1,13 +1,41 @@ # Constraints Hard limits agents must respect. Never work around these without explicit user confirmation. -Last updated: 2026-04-30 +Last updated: 2026-06-19 --- -- Never take destructive or irreversible action without explicit confirmation (delete, overwrite, drop, reset, force push) -- Never store credentials in output files, logs, or generated markdown — reference their location instead -- Never skip git hooks or bypass signing -- TrueNAS is on new hardware — use 172.27.40.220 (Servers40) for services, 172.27.6.221 for management/API +## Destructive actions +- Never delete or overwrite files without explicit confirmation +- Never restart or stop services without explicit confirmation +- Never drop, reset, or modify databases without explicit confirmation +- Never force push to git or bypass hooks +- Never run `pfctl` commands on OPNsense (risk of locking out remote access) + +## Credentials +- All credentials live in `~/.nxm-keys` (chmod 600) — ONLY location +- Never store credentials in output files, logs, generated markdown, .env files, or code +- Reference the file location, never the values +- TrueNAS IPs: 172.27.40.220 (Servers40 data) / 172.27.6.221 (management/API) + +## Infrastructure - Linux server (172.27.40.3) has no GPU — never schedule LLM inference to run locally there +- Ollama runs on 172.27.40.20 (Windows 11 Pro) — not on the Docker host - Docker Compose only — no Kubernetes, no Swarm +- Docker proxy network (172.22.0.0/16) cannot reach OPNsense API at 172.27.6.1 — always run OPNsense API scripts from the host +- NPM handles SSL termination — internal services always use HTTP + +## Agent-specific +- **maester-reports:** restart clears in-memory cache → re-parses all evidence PDFs via Claude Opus vision (Anthropic API cost). Avoid unnecessary restarts. +- **NocoDB:** RvDM personal birthday DB ONLY — never suggest for any Nexum project. Nexum data layer = Directus. +- **Open WebUI → Citadel MCP:** auth_type must be `none`. Empty bearer key generates illegal header → silent connection failure. +- **Qyburn task specs:** never embed code in the description field — use plain English only (14b models explain code instead of writing it) + +## External communication +- Never send any external message (email, webhook, Discord notification) without explicit confirmation +- Notifications always route through Raven (http://raven-notify:8400) +- Never expose services publicly without confirming NPM + Cloudflare + firewall implications + +## Naming +- S2S = always suggest Site-to-Site VPN (not Road Warrior) for permanent infrastructure endpoints +- Use `.50+` IP range for non-firewall infrastructure devices on S2S tunnels diff --git a/memory/persistent.md b/memory/persistent.md index e4c5270..f822521 100644 --- a/memory/persistent.md +++ b/memory/persistent.md @@ -1,18 +1,44 @@ # Persistent Memory Facts that don't expire. If you'd have to re-explain it to a new agent every time, it belongs here. -Last updated: 2026-04-30 +Last updated: 2026-06-19 --- ## Infrastructure decisions - RustDesk is self-hosted on 172.27.40.3 — clients connect to local server not public relay -- Netbird signal+management both route through NPM on port 443 — exposedAddress in /opt/stacks/netbird/config.yaml must be https://netbird.nxm.co.za:443 (caddy-netbird on :8443 exists but is not used externally) +- NPM handles all SSL termination — internal services use HTTP, NPM adds HTTPS - Headscale v0.28: all write operations require numeric user ID, not username - Tailscale on Windows overrides DNS — disconnect before testing split DNS changes -- Servers running Tailscale must run `sudo tailscale set --accept-dns=false` before joining Netbird +- Docker Compose only — no Kubernetes, no Swarm +- Docker → OPNsense API: HTTP 400 from Docker proxy network — always run OPNsense API scripts from the host +- All internal subdomains: gray-cloud CNAME → opnsense.nxm.co.za in Cloudflare. Proxied = 523 error. +- OPNsense split DNS: all subdomains resolve to 172.27.40.3 internally via Unbound host overrides + +## Decommissioned services (do not reference) +- **Netbird:** Fully removed from server 2026-05-28. Orphaned clients on mwp/coetzee/b0qxxx/fwlaw firewalls pending removal. +- **WireGuard (N2W):** Fully removed 2026-05-30. Replaced by OpenVPN S2S. +- **Flowise:** Replaced by Open WebUI 2026-05-01. +- **Zabbix:** No longer running (monitoring moved to Grafana + InfluxDB + Telegraf). ## Agent OS build state -- Phase 1-2 (file structure + NFS + identity interview): not yet started -- First skill to build: infra-monitor (Docker health + agent watchdog) -- Notifications target: Home Assistant at 172.27.10.6 +- Phase 1-2 complete (file structure + identity interview) +- Phase 3 (infra-monitor skill): spec written but stale, not yet implemented +- Notifications target: Raven at http://raven-notify:8400 (Discord + Gmail) +- All agent logs write to `/opt/agent-os/logs//last-run.json` + +## Credential policy +- All API keys and passwords: `~/.nxm-keys` (chmod 600) +- Never write credential values into output, logs, docs, or config files +- Reference credential location instead + +## VPN topology +- **Headscale** (self-hosted Tailscale): remote access for admin devices +- **OpenVPN S2S:** site-to-site for client firewalls (bezhuis/mwp/coetzee done, fwlaw pending) +- Hub tunnel IPs: bezhuis=172.16.17.2, mwp=172.16.17.3, coetzee=172.16.17.4 + +## Ollama +- Host: 172.27.40.20 (Windows 11 Pro, NxM-AI), Vulkan GPU +- Models: gemma4, llama3.1:8b, phi4 +- Auto-starts via Scheduled Task (S4U + AtStartup) +- Used by: hodor-gateway, sam-research, qyburn-coder, Open WebUI diff --git a/memory/recent-decisions.md b/memory/recent-decisions.md index 34091c3..420efb2 100644 --- a/memory/recent-decisions.md +++ b/memory/recent-decisions.md @@ -1,11 +1,24 @@ # Recent Decisions -Decisions made in the last 30 days that affect current work. Archive when no longer relevant. -Last updated: 2026-04-30 +Decisions made in the last 60 days that affect current work. Archive when no longer relevant. +Last updated: 2026-06-19 --- -- **2026-04-30:** Chose Gitea (self-hosted git) over Obsidian for documentation — AI-writable, browser-accessible, version controlled -- **2026-04-30:** Agent OS files to live on 172.27.40.3 at /opt/agent-os/, accessed from Kubuntu via NFS -- **2026-04-29:** Chose Syncthing-free approach for Obsidian migration — NFS for Linux, SMB for Windows -- **2026-04-29:** infra-monitor will be first Agent OS skill — covers Docker health and agent watchdog in one skill +- **2026-06-19:** Agent OS git remote switched from HTTP+token to SSH (gitea-local:admin/agent-os.git) — security debt resolved +- **2026-06-19:** Comprehensive Agent OS documentation update — brain.md, identity.md, all memory files brought current for LLM onboarding +- **2026-06-18:** Coetzee OpenVPN S2S complete — monitoring-only + hub-side NAT for Active Backup to Synology DS423+ +- **2026-06-18:** Tarly backup service live — OPNsense config backups to TrueNAS NFS, Proxmox monitoring +- **2026-06-17:** Directus CRM live — 6 collections, 12 clients seeded from TRMM, 5 Citadel MCP tools +- **2026-06-17:** MWP Netbird fully removed, WireGuard spoke cleaned +- **2026-06-12:** NxM-AI (Kubuntu) migrated to Windows 11 Pro — same IP 172.27.40.20, Ollama via Scheduled Task +- **2026-06-12:** Vexis office-install/uninstall scripts live-tested, windows-update scripts done +- **2026-06-11:** Workshop20 → Servers40 firewall rules (1677-1680) for TRMM + Vexis access +- **2026-06-10:** Frappe Helpdesk live — TRMM→HD sync, Citadel tools, Vexis wired +- **2026-06-10:** trmm_confirm_with_user proven working (incl. response-parsing bug fix) +- **2026-05-30:** WireGuard fully removed, replaced by OpenVPN S2S +- **2026-05-29:** Maester reports Phase 1 live — 8 automated CSF controls, Grafana dashboard +- **2026-05-28:** Netbird fully removed from server +- **2026-05-28:** ZenArmor → Grafana pipeline all 3 phases complete +- **2026-05-27:** Jon Snow Phase 3 complete — approval gate, Discord approve/reject +- **2026-04-30:** Agent OS architecture: plain markdown files at /opt/agent-os/, Gitea-tracked, cron-scheduled diff --git a/skills/infra-monitor/skill.md b/skills/infra-monitor/skill.md index 042ae15..8760355 100644 --- a/skills/infra-monitor/skill.md +++ b/skills/infra-monitor/skill.md @@ -15,35 +15,84 @@ Reads before executing: ### Docker health (on 172.27.40.3) - All expected containers are running (not exited/restarting) - Flag any container that has restarted more than 3 times in the last hour -- Expected containers: portainer, nginx-proxy-manager, uptime-kuma, gitea, headscale, netbird, vaultwarden, flowise, plane, zabbix, homarr +- Expected containers (grouped by criticality): + +**Critical (alert immediately if down):** +- nginx-proxy-manager (reverse proxy — everything depends on this) +- gitea (all code + docs) +- citadel-mcp (central tool server) +- raven-notify (notification hub) +- open-webui (chat UI) +- vaultwarden (password manager) + +**Important (alert after 15 min down):** +- headscale (VPN) +- grafana (monitoring dashboards) +- influxdb (time-series data) +- portainer (Docker management) +- uptime-kuma (HTTP monitoring) +- maester-reports (CSF compliance) +- jon-snow (orchestrator) +- tarly-backup (backup monitoring) +- directus + directus-db + directus-redis (CRM) + +**Normal (report in daily digest only):** +- hodor-gateway, sam-research, qyburn-coder, searxng +- homarr, headplane, headscale-ui +- plane-* (all Plane containers) +- netbox-* (all NetBox containers) +- nocodb, bni-scheduler, inventree-*, wetty, term-dash +- rustdesk-hbbs, rustdesk-hbbr +- iventoy, agent-sites ### Service reachability Lightweight HTTP check (curl, timeout 5s) on each internal URL: - http://172.27.40.3:9443 (Portainer) - http://172.27.40.3:3002 (Uptime Kuma) - http://172.27.40.3:3000 (Gitea) -- http://172.27.40.3:3010 (Flowise) +- http://172.27.40.3:3010 (Open WebUI) - http://172.27.40.3:7575 (Homarr) -- http://172.27.6.139:11434 (Ollama) +- http://172.27.40.3:8300 (Citadel MCP) +- http://172.27.40.3:8400 (Raven) +- http://172.27.40.3:8800 (Maester) +- http://172.27.40.3:8900 (Jon Snow) +- http://172.27.40.3:3020 (Grafana) +- http://172.27.40.3:8100 (NetBox) +- http://172.27.40.3:8850 (Directus) +- http://172.27.40.20:11434 (Ollama on NxM-AI) ### Agent watchdog -For each skill directory under `../../skills/`: -- Check `last-output.md` modification time — flag if older than expected schedule -- Check `../../logs//` for ERROR entries in last run -- Report: healthy / stale / erroring +For each agent log at `../../logs//last-run.json`: +- Check modification time — flag if older than expected schedule +- Check `status` field — flag if not "success" +- Expected agents and max staleness: + - bran-changelog: 25 hours (daily) + - varys-monitor: 20 minutes (every 15 min) + - trmm-frappe-sync: 35 minutes (every 30 min) + - tarly-backup: 25 hours (daily) + - raven-notify: 25 hours (event-driven, check status only) + - citadel-mcp, sam-research, qyburn-coder, jon-snow: check status only (on-demand) ### System resources (on 172.27.40.3) - Disk usage on / — warn if >80%, critical if >90% - Memory usage — flag if >85% +- Docker disk usage (`docker system df`) — warn if reclaimable > 10GB + +### Remote hosts (optional, best-effort) +- Ping 172.27.40.20 (Ollama host) +- Ping 172.27.40.30 (Hermes Native VM) +- Ping 172.27.40.2 (Proxmox) ## Output Write a digest to `last-output.md` in this format: - Summary line: X healthy, Y warnings, Z critical -- Section per category: Docker, Services, Agent Watchdog, System +- Section per category: Docker, Services, Agent Watchdog, System, Remote Hosts - Each item: ✓ OK / ⚠ Warning / ✗ Critical + one line detail -Pass anomalies to `context/handoff.md` for notification skill (future). +Also write machine-readable output to `../../logs/infra-monitor/last-run.json`. + +Pass anomalies to `context/handoff.md` for Raven notification. ## Wrap-up @@ -54,5 +103,5 @@ After writing output: ## Schedule -- **Heartbeat:** every hour — checks Docker + Ollama only (fast, <30s) -- **Full digest:** daily at 07:00 — all checks +- **Heartbeat:** every hour — checks Docker + Ollama + critical services only (fast, <30s) +- **Full digest:** daily at 07:00 — all checks including remote hosts and disk usage