docs: comprehensive update — bring all Agent OS docs current for LLM onboarding
All files were 5-7 weeks stale. Updated brain.md (complete service/agent/VPN/cron inventory), identity.md (current expertise + infra context), CLAUDE.md (full agent ecosystem table, Citadel tool registry, gotchas), README.md (LLM quick-start guide), all memory files (current projects, decisions, constraints, persistent facts), and infra-monitor skill.md (current container list with criticality tiers). Also fixed: git remote switched from HTTP+embedded-token to SSH, removed references to decommissioned services (Netbird, WireGuard, Flowise, Zabbix), corrected Ollama IP (172.27.40.20), TrueNAS IP (172.27.40.220), and added 20+ services/agents that were built since the last commit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,28 +1,84 @@
|
|||||||
# Agent OS — Project CLAUDE.md
|
# Agent OS — Project CLAUDE.md
|
||||||
|
|
||||||
## What This Project Is
|
## What This Project Is
|
||||||
Personal Agentic Operating System. Tool-agnostic AI foundation for scheduled skills, monitoring, and automation.
|
Personal Agentic Operating System for NxM / Nexum SA infrastructure. Tool-agnostic AI foundation for scheduled skills, monitoring, and automation. Plain markdown files — no databases, no vendor lock-in.
|
||||||
- Runtime: `/opt/agent-os/` on 172.27.40.3
|
|
||||||
- Gitea: `git.nxm.co.za/admin/agent-os`
|
- **Runtime:** `/opt/agent-os/` on 172.27.40.3
|
||||||
- Edit clone (server): `/home/nxm/Documents/agent-os/` (clone pending)
|
- **Gitea:** `git.nxm.co.za/admin/agent-os` (SSH: `gitea-local:admin/agent-os.git`)
|
||||||
|
- **Owner:** Jaco Bezuidenhout, Nexum SA (PTY) Ltd
|
||||||
|
|
||||||
## Current Phase
|
## Current Phase
|
||||||
|
|
||||||
| Phase | Status |
|
| Phase | Status |
|
||||||
|---|---|
|
|---|---|
|
||||||
| 1 — NFS export + Kubuntu mount | ✓ DONE 2026-05-01 (NFS no longer needed — consolidated to server) |
|
| 1 — NFS export + mount | DONE 2026-05-01 (NFS no longer needed — consolidated to server) |
|
||||||
| 2 — Identity interview → identity.md populated | ✓ DONE 2026-05-01 |
|
| 2 — Identity interview → identity.md | DONE 2026-05-01 |
|
||||||
| **3 — infra-monitor skill** | **NEXT** |
|
| 3 — infra-monitor skill | NEXT (spec at `skills/infra-monitor/skill.md`, needs update) |
|
||||||
| 4 — Cron scheduling (hourly heartbeat + daily digest) | Pending Phase 3 |
|
| 4 — Cron scheduling (hourly heartbeat + daily digest) | Pending Phase 3 |
|
||||||
| 5 — Future skills (backup monitor, peer health, log digest) | Future |
|
| 5 — Future skills (backup monitor, log digest) | Future |
|
||||||
|
|
||||||
|
## Live Agent Ecosystem (as of 2026-06-19)
|
||||||
|
|
||||||
|
All agents run as Docker containers on 172.27.40.3 unless noted. Every agent writes to `/opt/agent-os/logs/<agent>/last-run.json`.
|
||||||
|
|
||||||
|
### Always-on agents
|
||||||
|
| Agent | Port | Stack Path | Role |
|
||||||
|
|---|---|---|---|
|
||||||
|
| citadel-mcp | 8300 | `/opt/stacks/citadel-mcp/` | MCP tool server (37 tools: Docker, Plane, TRMM, Directus, files, web search) |
|
||||||
|
| raven-notify | 8400 | `/opt/stacks/raven-notify/` | Notification hub — Discord webhook + Gmail SMTP |
|
||||||
|
| sam-research | 8500 | `/opt/stacks/sam-research/` | SearXNG + Ollama research agent |
|
||||||
|
| qyburn-coder | 8700 | `/opt/stacks/qyburn-coder/` | LLM coding agent with approve/reject workflow |
|
||||||
|
| maester-reports | 8800 | `/opt/stacks/maester-reports/` | NIST CSF compliance reports (⚠ restart = Anthropic API cost) |
|
||||||
|
| jon-snow | 8900 | `/opt/stacks/jon-snow/` | Chief of staff orchestrator, HMAC approval gate |
|
||||||
|
| hodor-gateway | 8200 | `/opt/stacks/hodor-gateway/` | Simple Ollama gateway (POST /ask) |
|
||||||
|
| tarly-backup | 8750 | `/opt/stacks/tarly-backup/` | Backup monitoring — OPNsense configs + Proxmox |
|
||||||
|
| hermes-cloud | 8643 | `/opt/stacks/hermes-cloud/` | Claude-sonnet brain, Citadel MCP wired |
|
||||||
|
| hermes-native | VM 108 | native on 172.27.40.30 | Primary conversational agent — OpenRouter, Honcho memory, WhatsApp, dashboard at hermes.nxm.co.za:9119 |
|
||||||
|
|
||||||
|
### Scheduled/one-shot agents
|
||||||
|
| Agent | Schedule | Stack Path | Role |
|
||||||
|
|---|---|---|---|
|
||||||
|
| bran-changelog | Daily 06:00 | `/opt/stacks/bran-changelog/` | Git changelog generator |
|
||||||
|
| varys-monitor | Every 15 min | `/opt/stacks/varys-monitor/` | HTTP reachability checks for all services |
|
||||||
|
|
||||||
|
### Support agents (via Hermes Native, VM 108)
|
||||||
|
| Agent | Role |
|
||||||
|
|---|---|
|
||||||
|
| vexis (workshop profile) | Nexum workshop agent — TRMM script execution on client devices |
|
||||||
|
|
||||||
|
### Integrations running as cron (not standalone agents)
|
||||||
|
| Job | Schedule | Script |
|
||||||
|
|---|---|---|
|
||||||
|
| ovpn-status.py | Every 1 min | `/opt/stacks/monitoring/ovpn-status.py` |
|
||||||
|
| trmm-frappe-sync.py | Every 30 min | `/opt/stacks/monitoring/trmm-frappe-sync.py` |
|
||||||
|
| zenarmor-pull.py | Daily 06:00 | `/opt/stacks/monitoring/zenarmor-pull.py` |
|
||||||
|
| hub-backup.sh | Daily 02:05 | `/opt/stacks/tarly-backup/hub-backup.sh` |
|
||||||
|
|
||||||
|
## Citadel MCP Tool Registry (37 tools)
|
||||||
|
|
||||||
|
The central tool server that other agents call via MCP protocol:
|
||||||
|
|
||||||
|
**File operations:** read_file, write_file, list_files, delete_file, propose_file_change
|
||||||
|
**Docker:** docker_list_containers, docker_container_stats, docker_stack_list, docker_rebuild
|
||||||
|
**Plane (project management):** plane_add_issue, plane_get_issues, plane_list_projects, plane_create_project, plane_create_page, plane_update_issue
|
||||||
|
**TRMM (remote management):** trmm_list_agents, trmm_get_agent, trmm_list_scripts, trmm_add_script, trmm_delete_script, trmm_run_script, trmm_confirm_with_user, trmm_sync_now
|
||||||
|
**Directus (CRM):** directus_list_clients, directus_get_client, directus_get_client_services, directus_get_renewals, directus_upcoming_renewals
|
||||||
|
**Other:** get_agent_status, get_agent_output, list_agents, qyburn_task, qyburn_status, qyburn_approve, sam_research, web_search, proxmox_backup_status
|
||||||
|
|
||||||
|
## Agent Web Pages
|
||||||
|
|
||||||
|
Static HTML dashboards served at `agents.nxm.co.za/<name>/` from `/opt/sites/`:
|
||||||
|
- agents-dashboard, bran, changelog, citadel, hermes, hermes-native, hodor, jon-snow, qyburn, raven, sam, security-review, setup, stock, swarm, tarly, varys, workflow-test
|
||||||
|
|
||||||
## Phase 3 — infra-monitor (NEXT)
|
## Phase 3 — infra-monitor (NEXT)
|
||||||
Skill scaffold at `skills/infra-monitor/skill.md`. Ready to implement after spec update.
|
|
||||||
|
Skill scaffold at `skills/infra-monitor/skill.md`. **Spec is stale — needs update before building.**
|
||||||
|
|
||||||
**Goal:** Docker container state + system resource checks. Complements Varys (HTTP reachability) — do not duplicate.
|
**Goal:** Docker container state + system resource checks. Complements Varys (HTTP reachability) — do not duplicate.
|
||||||
|
|
||||||
**Before building:**
|
**Before building:**
|
||||||
- Update `skills/infra-monitor/skill.md` — container list is stale (has Flowise, missing Open WebUI + all new agents)
|
- Update `skills/infra-monitor/skill.md` — container list is stale (references Flowise/Netbird, missing 20+ services)
|
||||||
- Correct Ollama URL: now `http://172.27.40.20:11434` (migrated from 172.27.6.139)
|
- Ollama URL is now `http://172.27.40.20:11434`
|
||||||
- Decide: Docker one-shot container (consistent with bran/varys) or host cron + shell script?
|
- Decide: Docker one-shot container (consistent with bran/varys) or host cron + shell script?
|
||||||
|
|
||||||
**Output targets:**
|
**Output targets:**
|
||||||
@@ -30,38 +86,59 @@ Skill scaffold at `skills/infra-monitor/skill.md`. Ready to implement after spec
|
|||||||
- `/opt/agent-os/logs/infra-monitor/last-run.json` — machine-readable, read by Varys watchdog
|
- `/opt/agent-os/logs/infra-monitor/last-run.json` — machine-readable, read by Varys watchdog
|
||||||
- Raven alert on critical: `http://raven-notify:8400`
|
- Raven alert on critical: `http://raven-notify:8400`
|
||||||
|
|
||||||
**Schedule:** hourly heartbeat (Docker + Ollama only) + daily 07:00 full digest
|
|
||||||
|
|
||||||
## Directory Structure
|
## Directory Structure
|
||||||
```
|
```
|
||||||
/opt/agent-os/
|
/opt/agent-os/
|
||||||
├── CLAUDE.md ← this file (project brief, tracked in Gitea)
|
├── CLAUDE.md ← this file (project brief)
|
||||||
├── identity.md ← populated Phase 2
|
├── README.md ← onboarding for new LLMs
|
||||||
├── brain.md
|
├── identity.md ← who the user is, hard limits
|
||||||
|
├── brain.md ← all infra facts, IPs, services, decisions
|
||||||
├── memory/
|
├── memory/
|
||||||
│ ├── active-projects.md ← update at end of each session
|
│ ├── active-projects.md ← what's in flight right now
|
||||||
│ ├── persistent.md
|
│ ├── persistent.md ← facts that never expire
|
||||||
│ ├── recent-decisions.md
|
│ ├── recent-decisions.md ← decisions from last 30 days
|
||||||
│ ├── constraints.md
|
│ ├── constraints.md ← hard limits agents must respect
|
||||||
│ └── notes-from-last-run.md
|
│ └── notes-from-last-run.md ← cleared each session
|
||||||
├── context/
|
├── claude-code/
|
||||||
|
│ └── memory/ ← Claude Code's persistent memory files (symlinked)
|
||||||
├── skills/
|
├── skills/
|
||||||
│ └── infra-monitor/ ← Phase 3 target
|
│ └── infra-monitor/ ← Phase 3 target (not yet built)
|
||||||
│ ├── skill.md ← spec (stale container list — update before building)
|
│ ├── skill.md
|
||||||
│ ├── learnings.md
|
│ ├── learnings.md
|
||||||
│ ├── eval.json
|
│ ├── eval.json
|
||||||
│ ├── last-output.md
|
│ ├── last-output.md
|
||||||
│ └── context/handoff.md
|
│ └── context/handoff.md
|
||||||
└── logs/
|
└── logs/ ← all agent log outputs
|
||||||
|
├── bran-changelog/
|
||||||
|
├── citadel-mcp/
|
||||||
|
├── jon-snow/
|
||||||
|
├── qyburn-coder/
|
||||||
|
├── raven-notify/
|
||||||
|
├── sam-research/
|
||||||
|
├── tarly-backup/
|
||||||
|
├── trmm-frappe-sync/
|
||||||
|
└── varys-monitor/
|
||||||
```
|
```
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
- LLM inference: Kubuntu Ollama at `http://172.27.40.20:11434`
|
|
||||||
- All agent output: `/opt/sites/<name>/` served at agents.nxm.co.za
|
|
||||||
- Log standard: `/opt/agent-os/logs/<skill>/last-run.json`
|
|
||||||
- Notifications: Raven at `http://raven-notify:8400`
|
|
||||||
|
|
||||||
## Pending — Gitea SSH Key (security debt)
|
- **LLM inference:** Ollama at `http://172.27.40.20:11434` (gemma4, llama3.1:8b, phi4) + Anthropic API (Claude Code, Hermes)
|
||||||
Server remote uses HTTP with embedded token. Before next token rotation:
|
- **Agent output pages:** `/opt/sites/<name>/` served at agents.nxm.co.za
|
||||||
1. Add SSH key for `nxm@172.27.40.3` to Gitea (Admin → Settings → SSH Keys)
|
- **Log standard:** `/opt/agent-os/logs/<agent>/last-run.json`
|
||||||
2. `cd /opt/agent-os && git remote set-url origin gitea-local:admin/agent-os.git`
|
- **Notifications:** Raven at `http://raven-notify:8400` (Discord + Gmail)
|
||||||
|
- **Task tracking:** Plane at plane.nxm.co.za
|
||||||
|
- **Client CRM:** Directus at directus.nxm.co.za
|
||||||
|
- **Client devices:** Tactical RMM at 172.27.40.4 (45 agents, 13 clients)
|
||||||
|
- **Helpdesk:** Frappe Helpdesk at helpdesk.nxm.co.za (VM 109)
|
||||||
|
- **Credentials:** `~/.nxm-keys` (chmod 600) — ONLY place credential values live
|
||||||
|
|
||||||
|
## Key Gotchas
|
||||||
|
|
||||||
|
- **maester-reports restart = Anthropic API cost** — cache is in-memory only
|
||||||
|
- **Open WebUI → Citadel MCP:** auth_type must be `none` (empty bearer key = silent failure)
|
||||||
|
- **Docker → OPNsense API:** Docker proxy network can't reach 172.27.6.1 (HTTP 400) — run from host
|
||||||
|
- **Headscale v0.28:** all write operations require numeric user ID, not username
|
||||||
|
- **Vaultwarden:** requires HTTPS — use vault.nxm.co.za, not LAN IP
|
||||||
|
- **Tailscale on Windows:** overrides DNS — disconnect when testing split DNS
|
||||||
|
- **NPM forward scheme:** HTTP even for HTTPS external — NPM handles SSL termination
|
||||||
|
- **NocoDB:** RvDM personal birthday DB only — never use for Nexum projects
|
||||||
|
|||||||
@@ -10,14 +10,41 @@ Every agent interaction reads from and writes back to files in this repo. No dat
|
|||||||
|
|
||||||
| Layer | File(s) | Purpose |
|
| Layer | File(s) | Purpose |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| Identity | `identity.md` | Who you are, communication style, values |
|
| Identity | `identity.md` | Who the user is, communication style, values, hard limits |
|
||||||
| Context | `context/` | Dated, task-specific working files |
|
| Context | `context/` | Dated, task-specific working files |
|
||||||
| Brain | `brain.md` | Persistent facts — infra, people, decisions |
|
| Brain | `brain.md` | Persistent facts — infra, services, IPs, standing decisions |
|
||||||
| Memory | `memory/` | Short and long-term session notes |
|
| Memory | `memory/` | Short and long-term session notes |
|
||||||
| Skills | `skills/` | Repeatable workflows, each self-improving |
|
| Skills | `skills/` | Repeatable workflows, each self-improving |
|
||||||
| Processes | `skills/*/context/handoff.md` | Output passed between chained skills |
|
| Processes | `skills/*/context/handoff.md` | Output passed between chained skills |
|
||||||
| Automation | cron on 172.27.40.3 | Scheduled skill execution |
|
| Automation | cron on 172.27.40.3 | Scheduled skill execution |
|
||||||
|
|
||||||
|
## Quick start for a new LLM
|
||||||
|
|
||||||
|
If you are an LLM reading this repo for the first time:
|
||||||
|
|
||||||
|
1. **Read `identity.md`** — who you're working for, hard limits, communication style
|
||||||
|
2. **Read `brain.md`** — all infrastructure facts: IPs, services, ports, agents, standing decisions
|
||||||
|
3. **Read `memory/active-projects.md`** — what's currently in flight
|
||||||
|
4. **Read `memory/constraints.md`** — things you must never do
|
||||||
|
5. **Read `CLAUDE.md`** — project status and architecture details
|
||||||
|
|
||||||
|
Do NOT take any action without reading `identity.md` first. The hard limits there are non-negotiable.
|
||||||
|
|
||||||
|
## Live agent ecosystem
|
||||||
|
|
||||||
|
The NxM infrastructure runs 12+ named agents across Docker containers and VMs. Every agent writes logs to `/opt/agent-os/logs/<agent>/last-run.json` and most publish web dashboards to `agents.nxm.co.za/<agent>/`.
|
||||||
|
|
||||||
|
Key agents:
|
||||||
|
- **Citadel MCP** (port 8300) — central tool server, 37 tools covering Docker, Plane, TRMM, Directus, file ops, web search
|
||||||
|
- **Raven** (port 8400) — notification hub (Discord + Gmail), all alerts route through here
|
||||||
|
- **Jon Snow** (port 8900) — chief of staff orchestrator with approval gates
|
||||||
|
- **Maester** (port 8800) — NIST CSF compliance reporting
|
||||||
|
- **Hermes Native** (VM 108) — primary conversational agent with WhatsApp + Honcho memory
|
||||||
|
- **Tarly** (port 8750) — backup monitoring (OPNsense configs + Proxmox)
|
||||||
|
- **Vexis** (via Hermes, VM 108) — workshop/TRMM scripting agent for client devices
|
||||||
|
|
||||||
|
See `brain.md` for the complete agent table with ports and schedules.
|
||||||
|
|
||||||
## Adding a new skill
|
## Adding a new skill
|
||||||
|
|
||||||
1. Create `skills/<skill-name>/skill.md` — what the skill does and how
|
1. Create `skills/<skill-name>/skill.md` — what the skill does and how
|
||||||
@@ -28,10 +55,11 @@ Every agent interaction reads from and writes back to files in this repo. No dat
|
|||||||
|
|
||||||
## Runtime
|
## Runtime
|
||||||
|
|
||||||
- Files live on server: `/opt/agent-os/` (cloned from this repo)
|
- **Server:** `/opt/agent-os/` on 172.27.40.3 (Ubuntu, Docker host)
|
||||||
- LLM inference: Ollama at `http://172.27.6.139:11434`
|
- **Repo:** `git.nxm.co.za/admin/agent-os` (SSH: `gitea-local:admin/agent-os.git`)
|
||||||
- Scheduled jobs: cron on `172.27.40.3`
|
- **LLM inference:** Ollama at `http://172.27.40.20:11434` (local) or Anthropic API (Claude Code/Hermes)
|
||||||
- Local editing: `/home/nxm/Documents/agent-os/` on Kubuntu (this machine)
|
- **Scheduled jobs:** cron on 172.27.40.3
|
||||||
|
- **Agent web pages:** `/opt/sites/<name>/` → agents.nxm.co.za
|
||||||
|
|
||||||
## Infra reference
|
## Infra reference
|
||||||
|
|
||||||
@@ -39,3 +67,7 @@ Cross-repo links to supporting documentation:
|
|||||||
- [IP & Port Map](https://git.nxm.co.za/admin/nxm-infrastructure/src/branch/main/Quick%20Reference/IP%20%26%20Port%20Map.md)
|
- [IP & Port Map](https://git.nxm.co.za/admin/nxm-infrastructure/src/branch/main/Quick%20Reference/IP%20%26%20Port%20Map.md)
|
||||||
- [Docker Stacks](https://git.nxm.co.za/admin/nxm-infrastructure/src/branch/main/Quick%20Reference/Docker%20Stacks.md)
|
- [Docker Stacks](https://git.nxm.co.za/admin/nxm-infrastructure/src/branch/main/Quick%20Reference/Docker%20Stacks.md)
|
||||||
- [Network Overview](https://git.nxm.co.za/admin/nxm-infrastructure/src/branch/main/Infrastructure/Network%20Overview.md)
|
- [Network Overview](https://git.nxm.co.za/admin/nxm-infrastructure/src/branch/main/Infrastructure/Network%20Overview.md)
|
||||||
|
|
||||||
|
## Credential policy
|
||||||
|
|
||||||
|
All API keys and passwords live in `~/.nxm-keys` (chmod 600). Never write credential values into code, config files, logs, or documentation. Reference the file location instead.
|
||||||
|
|||||||
@@ -1,64 +1,127 @@
|
|||||||
# Brain
|
# Brain
|
||||||
|
|
||||||
Core facts read by all skills. Keep under 1000 words. Update when infrastructure changes.
|
Core facts read by all skills. Keep under 1500 words. Update when infrastructure changes.
|
||||||
Last updated: 2026-04-30
|
Last updated: 2026-06-19
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Infrastructure
|
## Infrastructure
|
||||||
|
|
||||||
**Primary server:** 172.27.40.3 — Ubuntu Server LTS, Docker host
|
**Primary server:** 172.27.40.3 — Ubuntu Server LTS, Docker host, all agent runtimes
|
||||||
**Kubuntu desktop:** 172.27.6.139 — NxM-AI, runs Ollama
|
**Ollama inference host:** 172.27.40.20 — Windows 11 Pro (NxM-AI), Vulkan GPU, Scheduled Task auto-start
|
||||||
**TrueNAS NAS:** 172.27.40.220 (Servers40), management: 172.27.6.221
|
**TrueNAS NAS:** 172.27.40.220 (data) / 172.27.6.221 (mgmt) — 35.6 TB, NFS shares for ISOs + Proxmox backups
|
||||||
**Firewall:** OPNsense at 172.27.6.1
|
**Firewall:** OPNsense at 172.27.6.1 (mgmt UI, not routed gateway)
|
||||||
|
**Proxmox VE:** 172.27.40.2 — PVE 9.1.1, 2× Xeon Gold 6138 (80 vCPUs), 252 GB RAM
|
||||||
|
**Hermes Native VM:** 172.27.40.30 (VM 108) — dedicated agent VM, Honcho memory, WhatsApp connected
|
||||||
|
**Tactical RMM:** 172.27.40.4 (VM 101) — remote management for all Nexum clients
|
||||||
|
**Home Assistant:** 172.27.10.6 (VM 100) — IoT automation
|
||||||
|
**Synology DS423+:** 172.27.40.80 — Coetzee off-site backup NAS, Active Backup via S2S
|
||||||
|
|
||||||
**VLANs:**
|
**VLANs:**
|
||||||
| VLAN | Name | Subnet |
|
| VLAN | Name | Subnet | Gateway |
|
||||||
|---|---|---|
|
|---|---|---|---|
|
||||||
| 40 | Servers40 | 172.27.40.0/24 |
|
| 40 | Servers40 | 172.27.40.0/24 | 172.27.40.1 |
|
||||||
| 20 | Workshop20 | 172.27.20.0/24 |
|
| 20 | Workshop20 | 172.27.20.0/24 | 172.27.20.1 |
|
||||||
| 10 | IoT10 | 172.27.10.0/24 |
|
| 10 | IoT10 | 172.27.10.0/24 | 172.27.10.1 |
|
||||||
|
|
||||||
## Key Services (172.27.40.3)
|
## Key Services (172.27.40.3)
|
||||||
|
|
||||||
| Service | Port | URL |
|
| Service | Port | URL | Role |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Portainer | 9443 | https://172.27.40.3:9443 | Docker management |
|
||||||
|
| Nginx Proxy Manager | 80/81/443 | http://172.27.40.3:81 | Reverse proxy, SSL termination |
|
||||||
|
| Uptime Kuma | 3002 | kuma.nxm.co.za | HTTP monitoring |
|
||||||
|
| Gitea | 3000 | git.nxm.co.za | Self-hosted git, all docs + code |
|
||||||
|
| Headscale | 8080 | headscale.nxm.co.za | VPN (self-hosted Tailscale) |
|
||||||
|
| Vaultwarden | 8222 | vault.nxm.co.za | Password manager |
|
||||||
|
| Open WebUI | 3010 | chat.nxm.co.za | Chat UI for Ollama + MCP |
|
||||||
|
| Plane | 8095 | plane.nxm.co.za | Project/task tracking |
|
||||||
|
| Homarr | 7575 | http://172.27.40.3:7575 | Dashboard |
|
||||||
|
| Grafana | 3020 | grafana.nxm.co.za | Monitoring dashboards |
|
||||||
|
| InfluxDB | 8086 | internal | Time-series DB for monitoring |
|
||||||
|
| NetBox | 8100 | netbox.nxm.co.za | IPAM, network documentation |
|
||||||
|
| NocoDB | 8150 | rvd.nxm.co.za | RvDM birthday DB (personal, NOT Nexum) |
|
||||||
|
| InvenTree | 8160 | inventree.nxm.co.za | IT stock + BOM tracking (testing) |
|
||||||
|
| Directus | 8850 | directus.nxm.co.za | Nexum client CRM |
|
||||||
|
| Nextcloud | — | — | Phone backup |
|
||||||
|
| Wetty | 8450/8451 | terminal.nxm.co.za / term.nxm.co.za | Web SSH terminal |
|
||||||
|
| RustDesk | 21115-21119 | internal | Self-hosted remote desktop relay |
|
||||||
|
| SearXNG | 8600 | internal | Search backend for sam + citadel |
|
||||||
|
| iVentoy | 26000 | internal | PXE boot server |
|
||||||
|
|
||||||
|
## AI / Agent Stack
|
||||||
|
|
||||||
|
**LLM inference:**
|
||||||
|
- **Ollama** on 172.27.40.20:11434 — models: gemma4, llama3.1:8b, phi4
|
||||||
|
- **Claude Code** on 172.27.40.3 — primary AI assistant (Anthropic API)
|
||||||
|
- **Hermes Native** on 172.27.40.30 — OpenRouter, Honcho memory, WhatsApp
|
||||||
|
- **Hermes Cloud** on 172.27.40.3:8643 — claude-sonnet-4-6, Citadel MCP wired
|
||||||
|
|
||||||
|
**Named agents (all Docker on 172.27.40.3 unless noted):**
|
||||||
|
| Agent | Port | Role | Schedule |
|
||||||
|
|---|---|---|---|
|
||||||
|
| hodor-gateway | 8200 | Simple Ollama gateway (POST /ask) | On-demand |
|
||||||
|
| citadel-mcp | 8300 | MCP SSE+HTTP server, 37 tools | Always-on |
|
||||||
|
| raven-notify | 8400 | Discord + Gmail notifications | Always-on |
|
||||||
|
| sam-research | 8500 | SearXNG + Ollama research | On-demand |
|
||||||
|
| qyburn-coder | 8700 | LLM coding agent (approve/reject) | On-demand |
|
||||||
|
| maester-reports | 8800 | NIST CSF compliance reports | On-demand |
|
||||||
|
| jon-snow | 8900 | Chief of staff orchestrator | Always-on |
|
||||||
|
| bran-changelog | — | Git changelog generator | Daily 06:00 |
|
||||||
|
| varys-monitor | — | Service HTTP reachability checks | Cron every 15 min |
|
||||||
|
| tarly-backup | 8750 | OPNsense config + Proxmox backup monitor | Daily 04:00 SAST |
|
||||||
|
| hermes-cloud | 8643 | Claude-powered conversational agent | Always-on |
|
||||||
|
| hermes-native | VM 108 | Primary Hermes agent (WhatsApp) | Always-on |
|
||||||
|
| vexis (workshop) | VM 108 | Nexum workshop agent (TRMM scripts) | On-demand via Hermes |
|
||||||
|
|
||||||
|
**Citadel MCP tools (37):** file ops, Docker management, Plane issues/projects/pages, TRMM (agents/scripts/confirm), Directus CRM, Proxmox backups, Qyburn task/approve, Sam research, web search, propose_file_change.
|
||||||
|
|
||||||
|
## Cron Jobs (172.27.40.3)
|
||||||
|
|
||||||
|
| Schedule | Job | Log |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| Portainer | 9443 | https://172.27.40.3:9443 |
|
| Daily 06:00 | bran-changelog/run.sh | logs/bran.log |
|
||||||
| Nginx Proxy Manager | 80/81/443 | http://172.27.40.3:81 |
|
| Daily 06:00 | zenarmor-pull.py | monitoring/logs/zenarmor-pull.log |
|
||||||
| Uptime Kuma | 3002 | http://172.27.40.3:3002 |
|
| Daily 02:05 | tarly hub-backup.sh | logs/tarly-backup/hub-backup.log |
|
||||||
| Gitea | 3000 | https://git.nxm.co.za |
|
| Every 1 min | ovpn-status.py | logs/ovpn-status.log |
|
||||||
| Headscale | 8080 | https://headscale.nxm.co.za |
|
| Every 30 min | trmm-frappe-sync.py | logs/trmm-frappe-sync.log |
|
||||||
| Netbird | 3479/udp | https://netbird.nxm.co.za |
|
|
||||||
| Vaultwarden | 8222 | https://vault.nxm.co.za |
|
|
||||||
| Flowise | 3010 | http://172.27.40.3:3010 |
|
|
||||||
| Plane | 8095 | https://plane.nxm.co.za |
|
|
||||||
| Zabbix | 8091 | https://zabbix.nxm.co.za |
|
|
||||||
| Homarr | 7575 | http://172.27.40.3:7575 |
|
|
||||||
|
|
||||||
## AI Stack
|
## OpenVPN S2S Sites
|
||||||
|
|
||||||
- **Ollama** on 172.27.6.139:11434 (bound to 0.0.0.0)
|
| Site | Tunnel IP | Status | Notes |
|
||||||
- **Models:** gemma4, qwen2.5-coder:7b
|
|---|---|---|---|
|
||||||
- **Flowise** on 172.27.40.3:3010 — visual agent/flow builder
|
| bezhuis | 172.16.17.2 | COMPLETE | NAT + DNS overrides, LAN access live |
|
||||||
- **Claude Code** — primary AI assistant, runs on Kubuntu
|
| mwp | 172.16.17.3 | COMPLETE | Monitoring live |
|
||||||
|
| coetzee | 172.16.17.4 | COMPLETE | Monitoring-only + Active Backup to Synology |
|
||||||
|
| fwlaw | — | PENDING | Awaiting migration |
|
||||||
|
|
||||||
## Agent OS Runtime
|
## Agent OS Runtime
|
||||||
|
|
||||||
- Files: `/opt/agent-os/` on 172.27.40.3
|
- Files: `/opt/agent-os/` on 172.27.40.3
|
||||||
- Local edit path: `/home/nxm/Documents/agent-os/` on 172.27.6.139
|
- Repo: `git.nxm.co.za/admin/agent-os` (SSH remote: `gitea-local:admin/agent-os.git`)
|
||||||
- Repo: `https://git.nxm.co.za/admin/agent-os`
|
|
||||||
- Scheduled jobs: cron on 172.27.40.3
|
- Scheduled jobs: cron on 172.27.40.3
|
||||||
- LLM calls: `http://172.27.6.139:11434`
|
- LLM calls: `http://172.27.40.20:11434` (Ollama) or Anthropic API (Claude Code / Hermes)
|
||||||
|
- Agent web pages: `/opt/sites/<name>/` served at agents.nxm.co.za
|
||||||
|
|
||||||
## Key Paths on Server
|
## Key Paths on Server
|
||||||
|
|
||||||
- Docker stacks: `/opt/stacks/`
|
- Docker stacks: `/opt/stacks/`
|
||||||
- Agent OS: `/opt/agent-os/`
|
- Agent OS: `/opt/agent-os/`
|
||||||
|
- Agent web pages: `/opt/sites/`
|
||||||
|
- Credentials: `~/.nxm-keys` (chmod 600) — NEVER write values elsewhere
|
||||||
|
- SSH keys: `~/.ssh/` (ED25519)
|
||||||
|
- NxM infrastructure docs: `/home/nxm/Documents/NxM Linux Server/`
|
||||||
|
- Nexum project docs: `/home/nxm/Documents/Nexum Projects/`
|
||||||
|
|
||||||
## Standing Decisions
|
## Standing Decisions
|
||||||
|
|
||||||
- TrueNAS will move to a dedicated server — avoid hardcoding 172.27.40.5 in automation
|
- NPM handles all SSL termination — internal services use HTTP
|
||||||
- NPM handles all SSL termination — internal services use HTTP, NPM adds HTTPS
|
- Docker Compose only (no Kubernetes, no Swarm)
|
||||||
- NFS preferred for Linux-to-Linux file sharing
|
- All destructive actions require explicit confirmation
|
||||||
- Docker Compose only (no Kubernetes)
|
- Credentials only in `~/.nxm-keys` — never in output, logs, or config files
|
||||||
- All destructive actions require explicit confirmation before execution
|
- Netbird fully removed (2026-05-28) — VPN is Headscale + OpenVPN S2S
|
||||||
|
- WireGuard fully removed (2026-05-30) — replaced by OpenVPN S2S
|
||||||
|
- Open WebUI → Citadel MCP: auth_type must be `none` (empty bearer = silent failure)
|
||||||
|
- Docker → OPNsense API: run from host, never from inside a container (HTTP 400)
|
||||||
|
- NocoDB = RvDM personal only — never use for Nexum projects
|
||||||
|
- Nexum client data layer = Directus CRM
|
||||||
|
|||||||
+18
-11
@@ -1,6 +1,6 @@
|
|||||||
# Identity
|
# Identity
|
||||||
|
|
||||||
> **Status: COMPLETE** — Interview completed 2026-05-01.
|
> **Status: COMPLETE** — Interview completed 2026-05-01, updated 2026-06-19.
|
||||||
|
|
||||||
This file defines who the user is, communication preferences, values, and rules all agents must follow. Every skill reads this file before executing.
|
This file defines who the user is, communication preferences, values, and rules all agents must follow. Every skill reads this file before executing.
|
||||||
|
|
||||||
@@ -11,7 +11,9 @@ This file defines who the user is, communication preferences, values, and rules
|
|||||||
- **Name:** Jaco Bezuidenhout
|
- **Name:** Jaco Bezuidenhout
|
||||||
- **Company:** Nexum SA (PTY) Ltd — Mossel Bay, South Africa
|
- **Company:** Nexum SA (PTY) Ltd — Mossel Bay, South Africa
|
||||||
- **Role:** Business owner, IT admin, network engineer
|
- **Role:** Business owner, IT admin, network engineer
|
||||||
- **Primary focus:** Network monitoring for early problem detection; IT infrastructure management for clients
|
- **Primary focus:** Network monitoring, NIST CSF compliance reporting, IT infrastructure management for clients
|
||||||
|
- **Domain expertise:** VLANs, inter-VLAN routing, firewall rules (OPNsense), split DNS, VPN (Headscale/OpenVPN S2S), Docker Compose, Ubuntu Server admin, reverse proxy (NPM), IPAM (NetBox), monitoring (Grafana/Uptime Kuma/InfluxDB)
|
||||||
|
- **Not expert in:** Kubernetes, cloud platforms (AWS/Azure/GCP), advanced Python (learning), application development
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -19,9 +21,10 @@ This file defines who the user is, communication preferences, values, and rules
|
|||||||
|
|
||||||
Priority order:
|
Priority order:
|
||||||
1. **Monitoring & compliance** — collect firewall and software data to support NIST CSF report completion
|
1. **Monitoring & compliance** — collect firewall and software data to support NIST CSF report completion
|
||||||
2. **Coding** — scripting, automation, tooling
|
2. **Client management** — TRMM remote management, Directus CRM, Frappe Helpdesk ticketing
|
||||||
3. **Summarising** — distil logs, changelogs, reports into concise output
|
3. **Coding** — scripting, automation, tooling
|
||||||
4. **General automation** — recurring tasks, scheduled jobs
|
4. **Summarising** — distil logs, changelogs, reports into concise output
|
||||||
|
5. **General automation** — recurring tasks, scheduled jobs, backups
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -48,7 +51,7 @@ Priority order:
|
|||||||
- Send any external message (email, webhook, notification)
|
- Send any external message (email, webhook, notification)
|
||||||
- Push to git or any remote repository
|
- Push to git or any remote repository
|
||||||
- Drop, reset, or modify databases
|
- Drop, reset, or modify databases
|
||||||
- **Never use a cloud-hosted LLM** (OpenAI, Anthropic API, Google, etc.) unless explicitly instructed. All inference stays on local Ollama (172.27.6.139:11434).
|
- Expose any service publicly without confirming NPM + Cloudflare + firewall implications
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -56,13 +59,17 @@ Priority order:
|
|||||||
|
|
||||||
- Depends on the task — choose the format that fits the output type.
|
- Depends on the task — choose the format that fits the output type.
|
||||||
- **Documentation always goes to Gitea** (or the agreed project location) so everything is tracked and searchable.
|
- **Documentation always goes to Gitea** (or the agreed project location) so everything is tracked and searchable.
|
||||||
- **Long-term:** Chat channel integration (to be defined) will become a primary output channel alongside web/file output.
|
- **Notifications route through Raven** (Discord + Gmail) at `http://raven-notify:8400`
|
||||||
|
- **Agent web output** goes to `/opt/sites/<name>/` served at agents.nxm.co.za
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Infrastructure Context
|
## Infrastructure Context
|
||||||
|
|
||||||
- Local LLM: Ollama at `http://172.27.6.139:11434` (gemma4, qwen2.5-coder:7b)
|
- **Ollama:** `http://172.27.40.20:11434` — Windows 11 Pro (NxM-AI), models: gemma4, llama3.1:8b, phi4
|
||||||
- Server: Ubuntu at `172.27.40.3` — Docker host, all agent runtimes
|
- **Server:** Ubuntu at `172.27.40.3` — Docker host, all agent runtimes
|
||||||
- Git: Gitea at `https://git.nxm.co.za` — all code and docs live here
|
- **Hermes Native:** VM 108 at `172.27.40.30` — OpenRouter LLM, Honcho memory, WhatsApp connected
|
||||||
- Agent OS runtime: `/opt/agent-os/` on 172.27.40.3, mounted at `/mnt/agent-os` on Kubuntu
|
- **Git:** Gitea at `https://git.nxm.co.za` — all code and docs
|
||||||
|
- **Agent OS runtime:** `/opt/agent-os/` on 172.27.40.3
|
||||||
|
- **Credentials:** `~/.nxm-keys` (chmod 600) — API keys for NPM, OPNsense, Proxmox, TrueNAS, Plane, Gitea, NetBox
|
||||||
|
- **Claude Code:** installed on 172.27.40.3, primary AI assistant
|
||||||
|
|||||||
+18
-14
@@ -1,7 +1,7 @@
|
|||||||
# Active Projects
|
# Active Projects
|
||||||
|
|
||||||
Current in-flight work. Update at the end of each session.
|
Current in-flight work. Update at the end of each session.
|
||||||
Last updated: 2026-05-16
|
Last updated: 2026-06-19
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -12,8 +12,8 @@ Phases 1 (NFS + mount) and 2 (identity interview) are complete.
|
|||||||
**Phase 3 goal:** Docker container state monitoring + system resources. Complements Varys (HTTP reachability) — do not duplicate.
|
**Phase 3 goal:** Docker container state monitoring + system resources. Complements Varys (HTTP reachability) — do not duplicate.
|
||||||
|
|
||||||
Pre-work before implementing:
|
Pre-work before implementing:
|
||||||
- [ ] Update `skills/infra-monitor/skill.md` — container list is stale (has Flowise, missing Open WebUI + all new agents: citadel, varys, bran, sam, raven, qyburn, hodor, searxng, monitoring, bni-scheduler, nocodb)
|
- [ ] Update `skills/infra-monitor/skill.md` — container list is stale (references Flowise/Netbird, missing 20+ current services)
|
||||||
- [ ] Correct Ollama URL in skill.md: now `http://172.27.40.20:11434` (moved from 172.27.6.139)
|
- [ ] Correct Ollama URL in skill.md: now `http://172.27.40.20:11434` (moved from 172.27.6.139 → 172.27.40.20)
|
||||||
- [ ] Decide implementation: Docker one-shot container (consistent with bran/varys pattern) vs host cron + shell script
|
- [ ] Decide implementation: Docker one-shot container (consistent with bran/varys pattern) vs host cron + shell script
|
||||||
|
|
||||||
Implementation tasks:
|
Implementation tasks:
|
||||||
@@ -26,23 +26,27 @@ Implementation tasks:
|
|||||||
- [ ] Hourly heartbeat cron on 172.27.40.3
|
- [ ] Hourly heartbeat cron on 172.27.40.3
|
||||||
- [ ] Daily 07:00 full digest cron
|
- [ ] Daily 07:00 full digest cron
|
||||||
- [ ] Notification channel: Raven (confirmed live at http://raven-notify:8400)
|
- [ ] Notification channel: Raven (confirmed live at http://raven-notify:8400)
|
||||||
- [ ] Home Assistant integration (172.27.10.6) — optional, revisit after Phase 3
|
|
||||||
|
|
||||||
## Agent OS — Phase 5: Future Skills (Future)
|
## Agent OS — Phase 5: Future Skills (Future)
|
||||||
- backup-monitor: TrueNAS migrated to new hardware (172.27.40.220) — skill ready to build
|
- backup-monitor: extend Tarly with deeper TrueNAS integration
|
||||||
- Netbird/Headscale peer health: Netbird API at http://172.22.0.11:80/api/
|
|
||||||
- Daily log digest: summarise /opt/agent-os/logs/ via Ollama
|
- Daily log digest: summarise /opt/agent-os/logs/ via Ollama
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Gitea Documentation Repos
|
## Active Infrastructure Projects
|
||||||
- [x] nxm-infrastructure repo — Obsidian vault imported, CLAUDE.md added 2026-05-16
|
|
||||||
- [x] nexum-projects repo — Obsidian vault imported (on Kubuntu)
|
| Project | Status | Next Step |
|
||||||
- [x] agent-os repo — scaffolding created, CLAUDE.md is global symlink
|
|---|---|---|
|
||||||
|
| **Monitoring** | bezhuis+mwp+coetzee alerts live | CPU/mem/WAN/ping Grafana rules pending |
|
||||||
|
| **OpenVPN S2S** | bezhuis/mwp/coetzee DONE | fwlaw pending |
|
||||||
|
| **Tarly Backup** | Hub working | bezhuis/mwp/coetzee API key fix (backup privilege) |
|
||||||
|
| **Directus CRM** | LIVE, 12 clients seeded | Manual data enrichment (contacts, renewals) |
|
||||||
|
| **InvenTree** | LIVE (testing) | SSL cert, production use |
|
||||||
|
| **Mailcow** | MAIL-1+2 done | Blocked on Mimecast (MAIL-3→9) |
|
||||||
|
| **Vexis** | nexum-private-customer-setup + office-install done | ESET/Evolve creds or standard-setup next |
|
||||||
|
| **Maester Phase 2** | Phase 1 live | Hermes narrative + .docx generation |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Pending: Gitea SSH Key (security debt)
|
## Gitea SSH Key — DONE
|
||||||
Server remote uses HTTP with embedded token. Before rotating:
|
Server remote switched from HTTP+token to SSH (`gitea-local:admin/agent-os.git`) on 2026-06-19.
|
||||||
1. Add SSH key for `nxm@172.27.40.3` to Gitea (Admin → Settings → SSH Keys)
|
|
||||||
2. `cd /opt/agent-os && git remote set-url origin gitea-local:admin/agent-os.git`
|
|
||||||
|
|||||||
+33
-5
@@ -1,13 +1,41 @@
|
|||||||
# Constraints
|
# Constraints
|
||||||
|
|
||||||
Hard limits agents must respect. Never work around these without explicit user confirmation.
|
Hard limits agents must respect. Never work around these without explicit user confirmation.
|
||||||
Last updated: 2026-04-30
|
Last updated: 2026-06-19
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
- Never take destructive or irreversible action without explicit confirmation (delete, overwrite, drop, reset, force push)
|
## Destructive actions
|
||||||
- Never store credentials in output files, logs, or generated markdown — reference their location instead
|
- Never delete or overwrite files without explicit confirmation
|
||||||
- Never skip git hooks or bypass signing
|
- Never restart or stop services without explicit confirmation
|
||||||
- TrueNAS is on new hardware — use 172.27.40.220 (Servers40) for services, 172.27.6.221 for management/API
|
- Never drop, reset, or modify databases without explicit confirmation
|
||||||
|
- Never force push to git or bypass hooks
|
||||||
|
- Never run `pfctl` commands on OPNsense (risk of locking out remote access)
|
||||||
|
|
||||||
|
## Credentials
|
||||||
|
- All credentials live in `~/.nxm-keys` (chmod 600) — ONLY location
|
||||||
|
- Never store credentials in output files, logs, generated markdown, .env files, or code
|
||||||
|
- Reference the file location, never the values
|
||||||
|
- TrueNAS IPs: 172.27.40.220 (Servers40 data) / 172.27.6.221 (management/API)
|
||||||
|
|
||||||
|
## Infrastructure
|
||||||
- Linux server (172.27.40.3) has no GPU — never schedule LLM inference to run locally there
|
- Linux server (172.27.40.3) has no GPU — never schedule LLM inference to run locally there
|
||||||
|
- Ollama runs on 172.27.40.20 (Windows 11 Pro) — not on the Docker host
|
||||||
- Docker Compose only — no Kubernetes, no Swarm
|
- Docker Compose only — no Kubernetes, no Swarm
|
||||||
|
- Docker proxy network (172.22.0.0/16) cannot reach OPNsense API at 172.27.6.1 — always run OPNsense API scripts from the host
|
||||||
|
- NPM handles SSL termination — internal services always use HTTP
|
||||||
|
|
||||||
|
## Agent-specific
|
||||||
|
- **maester-reports:** restart clears in-memory cache → re-parses all evidence PDFs via Claude Opus vision (Anthropic API cost). Avoid unnecessary restarts.
|
||||||
|
- **NocoDB:** RvDM personal birthday DB ONLY — never suggest for any Nexum project. Nexum data layer = Directus.
|
||||||
|
- **Open WebUI → Citadel MCP:** auth_type must be `none`. Empty bearer key generates illegal header → silent connection failure.
|
||||||
|
- **Qyburn task specs:** never embed code in the description field — use plain English only (14b models explain code instead of writing it)
|
||||||
|
|
||||||
|
## External communication
|
||||||
|
- Never send any external message (email, webhook, Discord notification) without explicit confirmation
|
||||||
|
- Notifications always route through Raven (http://raven-notify:8400)
|
||||||
|
- Never expose services publicly without confirming NPM + Cloudflare + firewall implications
|
||||||
|
|
||||||
|
## Naming
|
||||||
|
- S2S = always suggest Site-to-Site VPN (not Road Warrior) for permanent infrastructure endpoints
|
||||||
|
- Use `.50+` IP range for non-firewall infrastructure devices on S2S tunnels
|
||||||
|
|||||||
+32
-6
@@ -1,18 +1,44 @@
|
|||||||
# Persistent Memory
|
# Persistent Memory
|
||||||
|
|
||||||
Facts that don't expire. If you'd have to re-explain it to a new agent every time, it belongs here.
|
Facts that don't expire. If you'd have to re-explain it to a new agent every time, it belongs here.
|
||||||
Last updated: 2026-04-30
|
Last updated: 2026-06-19
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Infrastructure decisions
|
## Infrastructure decisions
|
||||||
- RustDesk is self-hosted on 172.27.40.3 — clients connect to local server not public relay
|
- RustDesk is self-hosted on 172.27.40.3 — clients connect to local server not public relay
|
||||||
- Netbird signal+management both route through NPM on port 443 — exposedAddress in /opt/stacks/netbird/config.yaml must be https://netbird.nxm.co.za:443 (caddy-netbird on :8443 exists but is not used externally)
|
- NPM handles all SSL termination — internal services use HTTP, NPM adds HTTPS
|
||||||
- Headscale v0.28: all write operations require numeric user ID, not username
|
- Headscale v0.28: all write operations require numeric user ID, not username
|
||||||
- Tailscale on Windows overrides DNS — disconnect before testing split DNS changes
|
- Tailscale on Windows overrides DNS — disconnect before testing split DNS changes
|
||||||
- Servers running Tailscale must run `sudo tailscale set --accept-dns=false` before joining Netbird
|
- Docker Compose only — no Kubernetes, no Swarm
|
||||||
|
- Docker → OPNsense API: HTTP 400 from Docker proxy network — always run OPNsense API scripts from the host
|
||||||
|
- All internal subdomains: gray-cloud CNAME → opnsense.nxm.co.za in Cloudflare. Proxied = 523 error.
|
||||||
|
- OPNsense split DNS: all subdomains resolve to 172.27.40.3 internally via Unbound host overrides
|
||||||
|
|
||||||
|
## Decommissioned services (do not reference)
|
||||||
|
- **Netbird:** Fully removed from server 2026-05-28. Orphaned clients on mwp/coetzee/b0qxxx/fwlaw firewalls pending removal.
|
||||||
|
- **WireGuard (N2W):** Fully removed 2026-05-30. Replaced by OpenVPN S2S.
|
||||||
|
- **Flowise:** Replaced by Open WebUI 2026-05-01.
|
||||||
|
- **Zabbix:** No longer running (monitoring moved to Grafana + InfluxDB + Telegraf).
|
||||||
|
|
||||||
## Agent OS build state
|
## Agent OS build state
|
||||||
- Phase 1-2 (file structure + NFS + identity interview): not yet started
|
- Phase 1-2 complete (file structure + identity interview)
|
||||||
- First skill to build: infra-monitor (Docker health + agent watchdog)
|
- Phase 3 (infra-monitor skill): spec written but stale, not yet implemented
|
||||||
- Notifications target: Home Assistant at 172.27.10.6
|
- Notifications target: Raven at http://raven-notify:8400 (Discord + Gmail)
|
||||||
|
- All agent logs write to `/opt/agent-os/logs/<agent>/last-run.json`
|
||||||
|
|
||||||
|
## Credential policy
|
||||||
|
- All API keys and passwords: `~/.nxm-keys` (chmod 600)
|
||||||
|
- Never write credential values into output, logs, docs, or config files
|
||||||
|
- Reference credential location instead
|
||||||
|
|
||||||
|
## VPN topology
|
||||||
|
- **Headscale** (self-hosted Tailscale): remote access for admin devices
|
||||||
|
- **OpenVPN S2S:** site-to-site for client firewalls (bezhuis/mwp/coetzee done, fwlaw pending)
|
||||||
|
- Hub tunnel IPs: bezhuis=172.16.17.2, mwp=172.16.17.3, coetzee=172.16.17.4
|
||||||
|
|
||||||
|
## Ollama
|
||||||
|
- Host: 172.27.40.20 (Windows 11 Pro, NxM-AI), Vulkan GPU
|
||||||
|
- Models: gemma4, llama3.1:8b, phi4
|
||||||
|
- Auto-starts via Scheduled Task (S4U + AtStartup)
|
||||||
|
- Used by: hodor-gateway, sam-research, qyburn-coder, Open WebUI
|
||||||
|
|||||||
@@ -1,11 +1,24 @@
|
|||||||
# Recent Decisions
|
# Recent Decisions
|
||||||
|
|
||||||
Decisions made in the last 30 days that affect current work. Archive when no longer relevant.
|
Decisions made in the last 60 days that affect current work. Archive when no longer relevant.
|
||||||
Last updated: 2026-04-30
|
Last updated: 2026-06-19
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
- **2026-04-30:** Chose Gitea (self-hosted git) over Obsidian for documentation — AI-writable, browser-accessible, version controlled
|
- **2026-06-19:** Agent OS git remote switched from HTTP+token to SSH (gitea-local:admin/agent-os.git) — security debt resolved
|
||||||
- **2026-04-30:** Agent OS files to live on 172.27.40.3 at /opt/agent-os/, accessed from Kubuntu via NFS
|
- **2026-06-19:** Comprehensive Agent OS documentation update — brain.md, identity.md, all memory files brought current for LLM onboarding
|
||||||
- **2026-04-29:** Chose Syncthing-free approach for Obsidian migration — NFS for Linux, SMB for Windows
|
- **2026-06-18:** Coetzee OpenVPN S2S complete — monitoring-only + hub-side NAT for Active Backup to Synology DS423+
|
||||||
- **2026-04-29:** infra-monitor will be first Agent OS skill — covers Docker health and agent watchdog in one skill
|
- **2026-06-18:** Tarly backup service live — OPNsense config backups to TrueNAS NFS, Proxmox monitoring
|
||||||
|
- **2026-06-17:** Directus CRM live — 6 collections, 12 clients seeded from TRMM, 5 Citadel MCP tools
|
||||||
|
- **2026-06-17:** MWP Netbird fully removed, WireGuard spoke cleaned
|
||||||
|
- **2026-06-12:** NxM-AI (Kubuntu) migrated to Windows 11 Pro — same IP 172.27.40.20, Ollama via Scheduled Task
|
||||||
|
- **2026-06-12:** Vexis office-install/uninstall scripts live-tested, windows-update scripts done
|
||||||
|
- **2026-06-11:** Workshop20 → Servers40 firewall rules (1677-1680) for TRMM + Vexis access
|
||||||
|
- **2026-06-10:** Frappe Helpdesk live — TRMM→HD sync, Citadel tools, Vexis wired
|
||||||
|
- **2026-06-10:** trmm_confirm_with_user proven working (incl. response-parsing bug fix)
|
||||||
|
- **2026-05-30:** WireGuard fully removed, replaced by OpenVPN S2S
|
||||||
|
- **2026-05-29:** Maester reports Phase 1 live — 8 automated CSF controls, Grafana dashboard
|
||||||
|
- **2026-05-28:** Netbird fully removed from server
|
||||||
|
- **2026-05-28:** ZenArmor → Grafana pipeline all 3 phases complete
|
||||||
|
- **2026-05-27:** Jon Snow Phase 3 complete — approval gate, Discord approve/reject
|
||||||
|
- **2026-04-30:** Agent OS architecture: plain markdown files at /opt/agent-os/, Gitea-tracked, cron-scheduled
|
||||||
|
|||||||
@@ -15,35 +15,84 @@ Reads before executing:
|
|||||||
### Docker health (on 172.27.40.3)
|
### Docker health (on 172.27.40.3)
|
||||||
- All expected containers are running (not exited/restarting)
|
- All expected containers are running (not exited/restarting)
|
||||||
- Flag any container that has restarted more than 3 times in the last hour
|
- Flag any container that has restarted more than 3 times in the last hour
|
||||||
- Expected containers: portainer, nginx-proxy-manager, uptime-kuma, gitea, headscale, netbird, vaultwarden, flowise, plane, zabbix, homarr
|
- Expected containers (grouped by criticality):
|
||||||
|
|
||||||
|
**Critical (alert immediately if down):**
|
||||||
|
- nginx-proxy-manager (reverse proxy — everything depends on this)
|
||||||
|
- gitea (all code + docs)
|
||||||
|
- citadel-mcp (central tool server)
|
||||||
|
- raven-notify (notification hub)
|
||||||
|
- open-webui (chat UI)
|
||||||
|
- vaultwarden (password manager)
|
||||||
|
|
||||||
|
**Important (alert after 15 min down):**
|
||||||
|
- headscale (VPN)
|
||||||
|
- grafana (monitoring dashboards)
|
||||||
|
- influxdb (time-series data)
|
||||||
|
- portainer (Docker management)
|
||||||
|
- uptime-kuma (HTTP monitoring)
|
||||||
|
- maester-reports (CSF compliance)
|
||||||
|
- jon-snow (orchestrator)
|
||||||
|
- tarly-backup (backup monitoring)
|
||||||
|
- directus + directus-db + directus-redis (CRM)
|
||||||
|
|
||||||
|
**Normal (report in daily digest only):**
|
||||||
|
- hodor-gateway, sam-research, qyburn-coder, searxng
|
||||||
|
- homarr, headplane, headscale-ui
|
||||||
|
- plane-* (all Plane containers)
|
||||||
|
- netbox-* (all NetBox containers)
|
||||||
|
- nocodb, bni-scheduler, inventree-*, wetty, term-dash
|
||||||
|
- rustdesk-hbbs, rustdesk-hbbr
|
||||||
|
- iventoy, agent-sites
|
||||||
|
|
||||||
### Service reachability
|
### Service reachability
|
||||||
Lightweight HTTP check (curl, timeout 5s) on each internal URL:
|
Lightweight HTTP check (curl, timeout 5s) on each internal URL:
|
||||||
- http://172.27.40.3:9443 (Portainer)
|
- http://172.27.40.3:9443 (Portainer)
|
||||||
- http://172.27.40.3:3002 (Uptime Kuma)
|
- http://172.27.40.3:3002 (Uptime Kuma)
|
||||||
- http://172.27.40.3:3000 (Gitea)
|
- http://172.27.40.3:3000 (Gitea)
|
||||||
- http://172.27.40.3:3010 (Flowise)
|
- http://172.27.40.3:3010 (Open WebUI)
|
||||||
- http://172.27.40.3:7575 (Homarr)
|
- http://172.27.40.3:7575 (Homarr)
|
||||||
- http://172.27.6.139:11434 (Ollama)
|
- http://172.27.40.3:8300 (Citadel MCP)
|
||||||
|
- http://172.27.40.3:8400 (Raven)
|
||||||
|
- http://172.27.40.3:8800 (Maester)
|
||||||
|
- http://172.27.40.3:8900 (Jon Snow)
|
||||||
|
- http://172.27.40.3:3020 (Grafana)
|
||||||
|
- http://172.27.40.3:8100 (NetBox)
|
||||||
|
- http://172.27.40.3:8850 (Directus)
|
||||||
|
- http://172.27.40.20:11434 (Ollama on NxM-AI)
|
||||||
|
|
||||||
### Agent watchdog
|
### Agent watchdog
|
||||||
For each skill directory under `../../skills/`:
|
For each agent log at `../../logs/<agent>/last-run.json`:
|
||||||
- Check `last-output.md` modification time — flag if older than expected schedule
|
- Check modification time — flag if older than expected schedule
|
||||||
- Check `../../logs/<skill-name>/` for ERROR entries in last run
|
- Check `status` field — flag if not "success"
|
||||||
- Report: healthy / stale / erroring
|
- Expected agents and max staleness:
|
||||||
|
- bran-changelog: 25 hours (daily)
|
||||||
|
- varys-monitor: 20 minutes (every 15 min)
|
||||||
|
- trmm-frappe-sync: 35 minutes (every 30 min)
|
||||||
|
- tarly-backup: 25 hours (daily)
|
||||||
|
- raven-notify: 25 hours (event-driven, check status only)
|
||||||
|
- citadel-mcp, sam-research, qyburn-coder, jon-snow: check status only (on-demand)
|
||||||
|
|
||||||
### System resources (on 172.27.40.3)
|
### System resources (on 172.27.40.3)
|
||||||
- Disk usage on / — warn if >80%, critical if >90%
|
- Disk usage on / — warn if >80%, critical if >90%
|
||||||
- Memory usage — flag if >85%
|
- Memory usage — flag if >85%
|
||||||
|
- Docker disk usage (`docker system df`) — warn if reclaimable > 10GB
|
||||||
|
|
||||||
|
### Remote hosts (optional, best-effort)
|
||||||
|
- Ping 172.27.40.20 (Ollama host)
|
||||||
|
- Ping 172.27.40.30 (Hermes Native VM)
|
||||||
|
- Ping 172.27.40.2 (Proxmox)
|
||||||
|
|
||||||
## Output
|
## Output
|
||||||
|
|
||||||
Write a digest to `last-output.md` in this format:
|
Write a digest to `last-output.md` in this format:
|
||||||
- Summary line: X healthy, Y warnings, Z critical
|
- Summary line: X healthy, Y warnings, Z critical
|
||||||
- Section per category: Docker, Services, Agent Watchdog, System
|
- Section per category: Docker, Services, Agent Watchdog, System, Remote Hosts
|
||||||
- Each item: ✓ OK / ⚠ Warning / ✗ Critical + one line detail
|
- Each item: ✓ OK / ⚠ Warning / ✗ Critical + one line detail
|
||||||
|
|
||||||
Pass anomalies to `context/handoff.md` for notification skill (future).
|
Also write machine-readable output to `../../logs/infra-monitor/last-run.json`.
|
||||||
|
|
||||||
|
Pass anomalies to `context/handoff.md` for Raven notification.
|
||||||
|
|
||||||
## Wrap-up
|
## Wrap-up
|
||||||
|
|
||||||
@@ -54,5 +103,5 @@ After writing output:
|
|||||||
|
|
||||||
## Schedule
|
## Schedule
|
||||||
|
|
||||||
- **Heartbeat:** every hour — checks Docker + Ollama only (fast, <30s)
|
- **Heartbeat:** every hour — checks Docker + Ollama + critical services only (fast, <30s)
|
||||||
- **Full digest:** daily at 07:00 — all checks
|
- **Full digest:** daily at 07:00 — all checks including remote hosts and disk usage
|
||||||
|
|||||||
Reference in New Issue
Block a user