Initial Agent OS scaffolding — identity, brain, memory, infra-monitor skill

2026-04-30 13:40:45 +02:00
commit e8bd571a77
15 changed files with 283 additions and 0 deletions
@@ -0,0 +1,5 @@
+# Handoff: infra-monitor → notification
+
+Populated by infra-monitor when anomalies are found. Read by the notification skill (future).
+
+*Empty — no anomalies from last run, or skill has not run yet.*
@@ -0,0 +1,8 @@
+{
+  "criteria": [
+    { "name": "all_services_checked", "weight": 0.3, "description": "Every expected container and service was checked, none skipped" },
+    { "name": "clear_status_summary", "weight": 0.3, "description": "Output leads with a plain-English summary line before detail" },
+    { "name": "actionable_findings", "weight": 0.2, "description": "Any warning or critical item includes enough detail to act on immediately" },
+    { "name": "agent_watchdog_complete", "weight": 0.2, "description": "All skills in /skills/ were checked for staleness and errors" }
+  ]
+}
@@ -0,0 +1,3 @@
+# Last Output: infra-monitor
+
+*Not yet populated — skill has not run.*
@@ -0,0 +1,12 @@
+# Learnings: infra-monitor
+
+Updated automatically after each run. The skill reads this before executing to improve its next output.
+
+## What has worked well
+*Not yet populated — skill has not run.*
+
+## What missed the mark
+*Not yet populated — skill has not run.*
+
+## Adjustments for next run
+*Not yet populated — skill has not run.*
@@ -0,0 +1,58 @@
+# Skill: infra-monitor
+
+Monitors server health and watches all Agent OS skills for staleness or errors. Runs on a cron schedule on 172.27.40.3.
+
+## Inputs
+
+Reads before executing:
+- `../../identity.md`
+- `../../brain.md`
+- `../../memory/persistent.md`
+- `learnings.md` (this skill's improvement notes)
+
+## What to check
+
+### Docker health (on 172.27.40.3)
+- All expected containers are running (not exited/restarting)
+- Flag any container that has restarted more than 3 times in the last hour
+- Expected containers: portainer, nginx-proxy-manager, uptime-kuma, gitea, headscale, netbird, vaultwarden, flowise, plane, zabbix, homarr
+
+### Service reachability
+Lightweight HTTP check (curl, timeout 5s) on each internal URL:
+- http://172.27.40.3:9443 (Portainer)
+- http://172.27.40.3:3002 (Uptime Kuma)
+- http://172.27.40.3:3000 (Gitea)
+- http://172.27.40.3:3010 (Flowise)
+- http://172.27.40.3:7575 (Homarr)
+- http://172.27.6.139:11434 (Ollama)
+
+### Agent watchdog
+For each skill directory under `../../skills/`:
+- Check `last-output.md` modification time — flag if older than expected schedule
+- Check `../../logs/<skill-name>/` for ERROR entries in last run
+- Report: healthy / stale / erroring
+
+### System resources (on 172.27.40.3)
+- Disk usage on / — warn if >80%, critical if >90%
+- Memory usage — flag if >85%
+
+## Output
+
+Write a digest to `last-output.md` in this format:
+- Summary line: X healthy, Y warnings, Z critical
+- Section per category: Docker, Services, Agent Watchdog, System
+- Each item: ✓ OK / ⚠ Warning / ✗ Critical + one line detail
+
+Pass anomalies to `context/handoff.md` for notification skill (future).
+
+## Wrap-up
+
+After writing output:
+1. Update `learnings.md` with anything that went wrong or could be improved
+2. Append a one-line log entry to `../../logs/infra-monitor.log`: `YYYY-MM-DD HH:MM | status | summary`
+3. Update `../../memory/notes-from-last-run.md`
+
+## Schedule
+
+- **Heartbeat:** every hour — checks Docker + Ollama only (fast, <30s)
+- **Full digest:** daily at 07:00 — all checks