Initial Agent OS scaffolding — identity, brain, memory, infra-monitor skill

2026-04-30 13:40:45 +02:00
commit e8bd571a77
15 changed files with 283 additions and 0 deletions
@@ -0,0 +1,41 @@
+# NxM Agent OS
+
+A personal agentic operating system built on plain markdown files. Tool-agnostic — works with Claude Code, Ollama, or any LLM harness. Based on the framework from the AI Daily Brief episode "How to Build a Personal Agentic Operating System" (Nufar Gaspar, 2026-04-25).
+
+## How it works
+
+Every agent interaction reads from and writes back to files in this repo. No databases, no APIs, no vendor lock-in. The files ARE the system.
+
+## The seven layers
+
+| Layer | File(s) | Purpose |
+|---|---|---|
+| Identity | `identity.md` | Who you are, communication style, values |
+| Context | `context/` | Dated, task-specific working files |
+| Brain | `brain.md` | Persistent facts — infra, people, decisions |
+| Memory | `memory/` | Short and long-term session notes |
+| Skills | `skills/` | Repeatable workflows, each self-improving |
+| Processes | `skills/*/context/handoff.md` | Output passed between chained skills |
+| Automation | cron on 172.27.40.3 | Scheduled skill execution |
+
+## Adding a new skill
+
+1. Create `skills/<skill-name>/skill.md` — what the skill does and how
+2. Create `skills/<skill-name>/learnings.md` — starts empty, fills over time
+3. Create `skills/<skill-name>/eval.json` — scoring criteria
+4. Add a cron job on 172.27.40.3 calling the skill
+5. The infra-monitor watchdog will automatically pick it up
+
+## Runtime
+
+- Files live on server: `/opt/agent-os/` (cloned from this repo)
+- LLM inference: Ollama at `http://172.27.6.139:11434`
+- Scheduled jobs: cron on `172.27.40.3`
+- Local editing: `/home/nxm/Documents/agent-os/` on Kubuntu (this machine)
+
+## Infra reference
+
+Cross-repo links to supporting documentation:
+- [IP & Port Map](https://git.nxm.co.za/admin/nxm-infrastructure/src/branch/main/Quick%20Reference/IP%20%26%20Port%20Map.md)
+- [Docker Stacks](https://git.nxm.co.za/admin/nxm-infrastructure/src/branch/main/Quick%20Reference/Docker%20Stacks.md)
+- [Network Overview](https://git.nxm.co.za/admin/nxm-infrastructure/src/branch/main/Infrastructure/Network%20Overview.md)
@@ -0,0 +1,64 @@
+# Brain
+
+Core facts read by all skills. Keep under 1000 words. Update when infrastructure changes.
+Last updated: 2026-04-30
+
+---
+
+## Infrastructure
+
+**Primary server:** 172.27.40.3 — Ubuntu Server LTS, Docker host
+**Kubuntu desktop:** 172.27.6.139 — NxM-AI, runs Ollama
+**TrueNAS NAS:** 172.27.40.5
+**Firewall:** OPNsense at 172.27.6.1
+
+**VLANs:**
+| VLAN | Name | Subnet |
+|---|---|---|
+| 40 | Servers40 | 172.27.40.0/24 |
+| 20 | Workshop20 | 172.27.20.0/24 |
+| 10 | IoT10 | 172.27.10.0/24 |
+
+## Key Services (172.27.40.3)
+
+| Service | Port | URL |
+|---|---|---|
+| Portainer | 9443 | https://172.27.40.3:9443 |
+| Nginx Proxy Manager | 80/81/443 | http://172.27.40.3:81 |
+| Uptime Kuma | 3002 | http://172.27.40.3:3002 |
+| Gitea | 3000 | https://git.nxm.co.za |
+| Headscale | 8080 | https://headscale.nxm.co.za |
+| Netbird | 3479/udp | https://netbird.nxm.co.za |
+| Vaultwarden | 8222 | https://vault.nxm.co.za |
+| Flowise | 3010 | http://172.27.40.3:3010 |
+| Plane | 8095 | https://plane.nxm.co.za |
+| Zabbix | 8091 | https://zabbix.nxm.co.za |
+| Homarr | 7575 | http://172.27.40.3:7575 |
+
+## AI Stack
+
+- **Ollama** on 172.27.6.139:11434 (bound to 0.0.0.0)
+- **Models:** gemma4, qwen2.5-coder:7b
+- **Flowise** on 172.27.40.3:3010 — visual agent/flow builder
+- **Claude Code** — primary AI assistant, runs on Kubuntu
+
+## Agent OS Runtime
+
+- Files: `/opt/agent-os/` on 172.27.40.3
+- Local edit path: `/home/nxm/Documents/agent-os/` on 172.27.6.139
+- Repo: `https://git.nxm.co.za/admin/agent-os`
+- Scheduled jobs: cron on 172.27.40.3
+- LLM calls: `http://172.27.6.139:11434`
+
+## Key Paths on Server
+
+- Docker stacks: `/opt/stacks/`
+- Agent OS: `/opt/agent-os/`
+
+## Standing Decisions
+
+- TrueNAS will move to a dedicated server — avoid hardcoding 172.27.40.5 in automation
+- NPM handles all SSL termination — internal services use HTTP, NPM adds HTTPS
+- NFS preferred for Linux-to-Linux file sharing
+- Docker Compose only (no Kubernetes)
+- All destructive actions require explicit confirmation before execution
@@ -0,0 +1,19 @@
+# Identity
+
+> **Status: PENDING** — To be completed via Claude interview session.
+> Run the interview by saying: "Let's complete the Agent OS identity interview."
+
+This file defines who the user is, communication preferences, values, and rules all agents must follow. Every skill reads this file before executing.
+
+## What the interview will capture
+
+- Professional role and responsibilities
+- Communication style preferences
+- Core values and priorities
+- Things agents should never do
+- How decisions should be escalated vs handled autonomously
+- Preferred output formats
+
+---
+
+*This section will be replaced with the completed identity profile after the interview.*
@@ -0,0 +1,23 @@
+# Active Projects
+
+Current in-flight work. Update at the end of each session.
+Last updated: 2026-04-30
+
+---
+
+## Agent OS — Phase 1 (NEXT)
+Complete the foundation before building skills.
+- [ ] Set up NFS export on 172.27.40.3 + mount on Kubuntu at /mnt/agent-os
+- [ ] Run identity interview with Claude → populate identity.md
+- [ ] Seed brain.md review and confirm accuracy
+- [ ] Clone this repo to /opt/agent-os/ on server
+
+## Agent OS — Phase 3 (PENDING Phase 1)
+- [ ] Build infra-monitor skill
+- [ ] Set up cron schedule (hourly heartbeat, daily digest)
+- [ ] Wire up Home Assistant notifications
+
+## Gitea documentation
+- [x] nxm-infrastructure repo — Obsidian vault imported
+- [x] nexum-projects repo — Obsidian vault imported
+- [x] agent-os repo — scaffolding created
@@ -0,0 +1,13 @@
+# Constraints
+
+Hard limits agents must respect. Never work around these without explicit user confirmation.
+Last updated: 2026-04-30
+
+---
+
+- Never take destructive or irreversible action without explicit confirmation (delete, overwrite, drop, reset, force push)
+- Never store credentials in output files, logs, or generated markdown — reference their location instead
+- Never skip git hooks or bypass signing
+- TrueNAS (172.27.40.5) is being migrated to a new server — do not create hard dependencies on that IP
+- Linux server (172.27.40.3) has no GPU — never schedule LLM inference to run locally there
+- Docker Compose only — no Kubernetes, no Swarm
@@ -0,0 +1,8 @@
+# Notes from Last Run
+
+Populated automatically at the end of each skill run. Cleared at the start of each new session.
+Last updated: —
+
+---
+
+*No runs yet — Agent OS not yet deployed.*
@@ -0,0 +1,18 @@
+# Persistent Memory
+
+Facts that don't expire. If you'd have to re-explain it to a new agent every time, it belongs here.
+Last updated: 2026-04-30
+
+---
+
+## Infrastructure decisions
+- RustDesk is self-hosted on 172.27.40.3 — clients connect to local server not public relay
+- Netbird client management is on port 8443 via Caddy sidecar, NOT port 443
+- Headscale v0.28: all write operations require numeric user ID, not username
+- Tailscale on Windows overrides DNS — disconnect before testing split DNS changes
+- Servers running Tailscale must run `sudo tailscale set --accept-dns=false` before joining Netbird
+
+## Agent OS build state
+- Phase 1-2 (file structure + NFS + identity interview): not yet started
+- First skill to build: infra-monitor (Docker health + agent watchdog)
+- Notifications target: Home Assistant at 172.27.10.6
@@ -0,0 +1,11 @@
+# Recent Decisions
+
+Decisions made in the last 30 days that affect current work. Archive when no longer relevant.
+Last updated: 2026-04-30
+
+---
+
+- **2026-04-30:** Chose Gitea (self-hosted git) over Obsidian for documentation — AI-writable, browser-accessible, version controlled
+- **2026-04-30:** Agent OS files to live on 172.27.40.3 at /opt/agent-os/, accessed from Kubuntu via NFS
+- **2026-04-29:** Chose Syncthing-free approach for Obsidian migration — NFS for Linux, SMB for Windows
+- **2026-04-29:** infra-monitor will be first Agent OS skill — covers Docker health and agent watchdog in one skill
@@ -0,0 +1,5 @@
+# Handoff: infra-monitor → notification
+
+Populated by infra-monitor when anomalies are found. Read by the notification skill (future).
+
+*Empty — no anomalies from last run, or skill has not run yet.*
@@ -0,0 +1,8 @@
+{
+  "criteria": [
+    { "name": "all_services_checked", "weight": 0.3, "description": "Every expected container and service was checked, none skipped" },
+    { "name": "clear_status_summary", "weight": 0.3, "description": "Output leads with a plain-English summary line before detail" },
+    { "name": "actionable_findings", "weight": 0.2, "description": "Any warning or critical item includes enough detail to act on immediately" },
+    { "name": "agent_watchdog_complete", "weight": 0.2, "description": "All skills in /skills/ were checked for staleness and errors" }
+  ]
+}
@@ -0,0 +1,3 @@
+# Last Output: infra-monitor
+
+*Not yet populated — skill has not run.*
@@ -0,0 +1,12 @@
+# Learnings: infra-monitor
+
+Updated automatically after each run. The skill reads this before executing to improve its next output.
+
+## What has worked well
+*Not yet populated — skill has not run.*
+
+## What missed the mark
+*Not yet populated — skill has not run.*
+
+## Adjustments for next run
+*Not yet populated — skill has not run.*
@@ -0,0 +1,58 @@
+# Skill: infra-monitor
+
+Monitors server health and watches all Agent OS skills for staleness or errors. Runs on a cron schedule on 172.27.40.3.
+
+## Inputs
+
+Reads before executing:
+- `../../identity.md`
+- `../../brain.md`
+- `../../memory/persistent.md`
+- `learnings.md` (this skill's improvement notes)
+
+## What to check
+
+### Docker health (on 172.27.40.3)
+- All expected containers are running (not exited/restarting)
+- Flag any container that has restarted more than 3 times in the last hour
+- Expected containers: portainer, nginx-proxy-manager, uptime-kuma, gitea, headscale, netbird, vaultwarden, flowise, plane, zabbix, homarr
+
+### Service reachability
+Lightweight HTTP check (curl, timeout 5s) on each internal URL:
+- http://172.27.40.3:9443 (Portainer)
+- http://172.27.40.3:3002 (Uptime Kuma)
+- http://172.27.40.3:3000 (Gitea)
+- http://172.27.40.3:3010 (Flowise)
+- http://172.27.40.3:7575 (Homarr)
+- http://172.27.6.139:11434 (Ollama)
+
+### Agent watchdog
+For each skill directory under `../../skills/`:
+- Check `last-output.md` modification time — flag if older than expected schedule
+- Check `../../logs/<skill-name>/` for ERROR entries in last run
+- Report: healthy / stale / erroring
+
+### System resources (on 172.27.40.3)
+- Disk usage on / — warn if >80%, critical if >90%
+- Memory usage — flag if >85%
+
+## Output
+
+Write a digest to `last-output.md` in this format:
+- Summary line: X healthy, Y warnings, Z critical
+- Section per category: Docker, Services, Agent Watchdog, System
+- Each item: ✓ OK / ⚠ Warning / ✗ Critical + one line detail
+
+Pass anomalies to `context/handoff.md` for notification skill (future).
+
+## Wrap-up
+
+After writing output:
+1. Update `learnings.md` with anything that went wrong or could be improved
+2. Append a one-line log entry to `../../logs/infra-monitor.log`: `YYYY-MM-DD HH:MM | status | summary`
+3. Update `../../memory/notes-from-last-run.md`
+
+## Schedule
+
+- **Heartbeat:** every hour — checks Docker + Ollama only (fast, <30s)
+- **Full digest:** daily at 07:00 — all checks