Render node fleet management.
Release-line note: v2.0.30 exists on
releases.experiencenet.combut not in git — it was a phantom build accidentally produced from a misplacedv2.0.30tag (intended forhydraheadflatscreen) that was deleted seconds later, after CI had already cut a binary. The v2.0.30 binary is byte-identical to v2.0.26. v2.0.31 is the next clean release-line version on master and supersedes both.
| Resource | Value |
|---|---|
| Server | hydracluster.experiencenet.com (46.224.29.125) |
| Config | /root/.hydracluster/config.yaml |
| Data | /root/.hydracluster/nodes.yaml |
| Service | systemctl status hydracluster |
| Logs | journalctl -u hydracluster -f |
HydraCluster manages render node enrollment, provisioning, and monitoring. Render nodes are Windows machines running LarkXR Standalone (cloud rendering). They are enrolled via a web UI and managed through the node agent (hydranode).
/enroll, admin assigns roles and venues. The form accepts optional ?name= and ?ip= URL params (pre-filled by HydraNeck's Enrol button for admin-initiated remote enrolment — the submitted ip field overrides the request's remote address so the correct device IP is stored).ssh root@hydracluster.experiencenet.com
# or
ssh root@46.224.29.125
curl -s https://hydracluster.experiencenet.com/api/v1/health
ssh root@46.224.29.125systemctl status hydraclusterjournalctl -u hydracluster --since '10 min ago' --no-pagersystemctl restart hydraclusterIf POST endpoints (e.g. /api/v1/nodes/{id}/exec) start hanging with 0 bytes received while GETs still respond, and the journal shows lots of GETs but no recent POSTs to the slow endpoint, suspect an auth-middleware self-deadlock.
This was the v2.0.32 → v2.0.33 regression: requireAdminOrNodeToken used defer s.mu.Unlock() and then called next(). The downstream heads handlers (handleListHeads et al.) also acquired s.mu → non-reentrant mutex → goroutine deadlocked on its own lock → mutex pinned → every other lock acquirer queued forever. hydranode's network-recovery routine on the same box then misread the wedged API as a network failure and rebooted the cluster.
Rule for any new server middleware:
s.mu.Lock() briefly inside a middleware to look something up.s.mu BEFORE calling next(). Do NOT hold it via defer across a handler invocation. Downstream handlers do their own locking.requireNodeToken survives this rule only because every handler it wraps (handleBodyHeartbeat, handleBodyShellCheck, etc.) reads node state via the nodeContext value rather than re-locking s.mu. Don't break that invariant.If you suspect this is happening live, the fingerprint in journalctl -u hydracluster is many GET /api/v1/body/shell/check lines but no recent matching POST log line for the slow request — the POST is queued waiting for the lock and never reaches its log statement.
HydraCluster supports two exec modes for running commands on body machines.
Commands are sent as JSON via the API, executed directly by the node agent (PowerShell on Windows, bash on Linux), with clean stdout/stderr separation and proper exit codes. No PTY involved.
# Basic command
hydracluster exec <nodeId> "Get-Process | Select-Object -First 5"
# With timeout
hydracluster exec <nodeId> "choco list --local-only" --timeout 60s
# From a local file (avoids all shell escaping)
hydracluster exec <nodeId> --file ./setup.ps1
# JSON output (for scripting)
hydracluster exec <nodeId> "hostname" --json
The command is queued on the server and picked up by the node agent on its next heartbeat (up to 30 seconds). The CLI polls for the result until it arrives or the timeout expires.
hydracluster exec --shell <nodeId> "top -bn1 | head -5" --timeout 10s
These are two distinct transports. Understanding the difference matters when the node is slow to respond or the queue is backed up.
| Async exec queue (default) | WebSocket shell (--shell) |
|
|---|---|---|
| Transport | HTTP poll: hydranode fetches commands on its tick (up to 30 s delay) | WebSocket PTY: direct connection, responds immediately |
| Queue | Commands pile up in-memory on hydracluster; processed FIFO | No queue — each --shell call opens a fresh PTY session |
| Use when | Scripting, JSON output, fire-and-forget, parallel commands | Queue is backed up; need immediate response; binary data (base64) |
| Failure mode | Times out if queue is long or hydranode stopped polling | Fails fast if WebSocket connection can't be established |
When the async queue backs up: this happens when many exec requests accumulate (e.g. after a debugging session with many rapid execs, or after a node reboot where hydranode was offline and the queue grew). The node heartbeat still arrives (node shows online) but exec results stay pending indefinitely. Switch to --shell until the queue drains.
Detecting a backed-up queue: send a simple echo with a short timeout. If it stays {"status":"pending"} for more than 30 s, the queue is blocked.
The queue is in-memory on hydracluster — it does not survive a hydracluster restart. A server restart is the nuclear option to clear a stuck queue.
Exec is for ad-hoc, one-off operations. When a workflow is used regularly by operators (stream control, status checks, etc.), the receiver binary must expose a proper named Cobra subcommand for it. Exec then invokes that named command — no curl+JSON quoting, no piped commands, no inline shell scripting.
Bad (fragile):
hydracluster exec node-X "curl -s -X POST -H 'Content-Type: application/json' -d '{\"experience\":\"mercator-talks\"}' http://127.0.0.1:9740/api/v1/stream/start"
Good (named command on the receiver):
hydracluster exec node-X "hydraheadflatscreen stream-start mercator-talks"
A named command on the binary is testable, readable, and unambiguous. If an exec command starts looking like a shell script (quoting, pipes, flags), that is a signal to add a Cobra subcommand to the receiver instead.
The exec shell does not include ~/.hydranode/bin in PATH. Always use the full binary
path when calling receiver-side commands:
# macOS kiosk heads (hydraheadflatscreen)
hydracluster exec node-X "~/.hydranode/bin/hydraheadflatscreen stream-start mercator-talks"
hydracluster exec node-X "~/.hydranode/bin/hydraheadflatscreen stream-stop"
| Endpoint | Auth | Purpose |
|---|---|---|
POST /api/v1/nodes/{id}/exec |
Admin | Submit command, returns {"id": "exec-..."} |
GET /api/v1/nodes/{id}/exec/{execId}/result |
Admin | Poll for result (200=done, 202=pending) |
GET /api/v1/body/exec |
Node token | Agent fetches next queued command |
POST /api/v1/body/exec/result |
Node token | Agent reports result |
Assign roles to a node via API (roles must be in the RoleCatalog):
curl -s -X POST https://hydracluster.experiencenet.com/api/v1/nodes/<ID>/roles \
-H "Authorization: Bearer <ADMIN_TOKEN>" \
-H "Content-Type: application/json" \
-d '{"roles": ["hydraguard-air", "hydraheadflatscreen"]}'
The node agent picks up new roles on its next provision poll and executes the matching recipe.
Head devices (hydraheadflatscreen, hydraheadwindows) stream from body nodes via Moonlight/Sunshine. In production, heads pick their body through the eligibility discovery procedure (GET /api/v1/bodies/eligible). See body-selection.md for the selection discipline and head-identity design principles.
The endpoint below is an admin override for exceptional situations. For normal operations, do not manually pin a body to a head. Instead, change the inputs to selection (body district/venue/owner, eligibility rules, drain flag). See body-selection.md.
curl -s -X POST https://hydracluster.experiencenet.com/api/v1/nodes/<HEAD_ID>/head-assignment \
-H "Authorization: Bearer <ADMIN_TOKEN>" \
-H "Content-Type: application/json" \
-d '{"body_id": "<BODY_NODE_ID>", "app_id": "Desktop"}'
curl -s https://hydracluster.experiencenet.com/api/v1/heads/<HEAD_ID> \
-H "Authorization: Bearer <ADMIN_TOKEN>"
Returns stream config with both WireGuard (stream_url) and LAN (stream_url_lan) addresses resolved from the body node. The hydraheadflatscreen agent probes LAN first (TCP 47990, 1s timeout) and falls back to WireGuard.
curl -s https://hydracluster.experiencenet.com/api/v1/heads \
-H "Authorization: Bearer <ADMIN_TOKEN>"
Returns an array of head objects. Each entry includes node_status ("online"/"offline" — node heartbeat state), last_seen (RFC3339 timestamp of the last hydranode heartbeat), assigned_body_name (the name of the assigned body node, if any), and diagnostics (a map of agent-reported key/value pairs, omitted when empty) in addition to the head-level status ("idle"/"streaming"/"error" — reported by the kiosk agent). Use node_status to distinguish a kiosk that is genuinely offline from one that is online but idle. Use assigned_body_name to correlate heads to bodies without resolving IP addresses. Diagnostic keys set by hydraheadflatscreen include version, wireguard ("up"/"down"), app ("kiosk"/"moonlight"/"none"), routing ("lan"/"wireguard"/"unknown"), and latency_ms (TCP RTT in milliseconds to Sunshine port 47990, present only when streaming).
iPad enrollment is a two-step self-registration flow. No head entry needs to exist before scanning.
Step 1 — display the QR code (admin web UI, recommended):
https://hydracluster.experiencenet.com/enrollStep 1 (alternative) — get the raw payload via API:
curl -s https://hydracluster.experiencenet.com/api/v1/enroll-qr \
-H "Authorization: Bearer <ADMIN_TOKEN>"
Returns {"server_url":"...","enrollment_token":"..."}. Encode this JSON as a QR code and print/display it — one QR serves all iPads in the fleet.
Step 2 — iPad scans the QR, app self-registers:
The app POSTs to POST /api/v1/heads using the enrollment token:
curl -s -X POST https://hydracluster.experiencenet.com/api/v1/heads \
-H "Authorization: Bearer <FLEET_ENROLLMENT_TOKEN>" \
-H "Content-Type: application/json" \
-d '{"name":"ipad-lobby","district":"bxl1","venue":"cloud-seven"}'
Returns {"head_id":"node-...","token":"...","server_url":"..."}. The iPad saves these as its permanent identity. The head is auto-approved and immediately starts heartbeating.
Config: fleet_enrollment_token must be set in config.yaml under server:. The token is the same for all iPads — treat it as a moderately sensitive credential (anyone with it can enroll new heads).
curl -s https://hydracluster.experiencenet.com/api/v1/heads/<HEAD_ID>/experiences \
-H "Authorization: Bearer <ADMIN_TOKEN>"
Returns the catalog of experiences a kiosk should display, sourced from hydraexperiencelibrary's /api/v1/experiences/live endpoint keyed by the head's district + venue. If the head has a non-empty AllowedExperiences whitelist, the result is intersected with it.
This endpoint is the source of truth for kiosk grids and is intentionally independent of body availability — a venue with no body online still returns its catalog (or [] if nothing is planted there yet). Body discovery only runs at stream-start time (POST /api/v1/stream/start on the agent's local API).
Status codes:
200 — catalog (possibly [])400 — node exists but is not a hydraheadflatscreen head404 — no node with that id502 — hydraexperiencelibrary upstream returned an error503 — experienceLibrary.url is not configured on hydraclusterRequires the Terminal-based screencapture loop to be running on the head (see hydraheadflatscreen runbook → Screenshots). Returns image/jpeg directly — no exec gymnastics needed:
TOKEN="<admin-token>"
curl -sf -H "Authorization: Bearer $TOKEN" \
"https://hydracluster.experiencenet.com/api/v1/heads/<HEAD_ID>/screenshot" \
> /tmp/kiosk.jpg
Status codes: 200 JPEG on success; 502 if screenshot file is stale or absent (Terminal loop not running); 504 if the head exec channel doesn't respond within 20 s.
If 504 (exec queue backed up): use the WebSocket shell path instead — it bypasses the async exec queue:
"$HC_BIN" exec <node-id> "base64 -i /tmp/hydra-live-screenshot.jpg" \
--shell --server "$HC_SERVER" --admin-token "$HC_TOKEN" --timeout 30s \
> /tmp/raw.txt
grep -E '^[A-Za-z0-9+/]+=*$' /tmp/raw.txt | tr -d '\n' | base64 -d > /tmp/kiosk.jpg
Note: macOS base64 requires -i <file> not a positional argument. The grep strips shell prompts before decoding.
Returns the last 200 lines of the head agent log as JSON ({"log_path":"...","lines":[...]}):
curl -sf -H "Authorization: Bearer $TOKEN" \
"https://hydracluster.experiencenet.com/api/v1/heads/<HEAD_ID>/logs" | jq '.lines[-20:]'
Status codes: 200 JSON on success; 504 if the head exec channel doesn't respond within 15 s.
| Endpoint | Auth | Purpose |
|---|---|---|
POST /api/v1/nodes/{id}/roles |
Admin | Set roles on a node |
POST /api/v1/nodes/{id}/update-node |
Admin | Flag node for immediate hydranode self-update |
POST /api/v1/nodes/{id}/update-services |
Admin | Flag node to update all provisioned service binaries |
POST /api/v1/nodes/{id}/head-assignment |
Admin | Assign body to head |
POST /api/v1/nodes/{id}/allowed-experiences |
Admin | Set experience filter for kiosk head |
GET /api/v1/heads |
Admin | List all head nodes with stream config |
GET /api/v1/heads/{id} |
Admin | Get head config (includes experience_library_url, allowed_experiences) |
GET /api/v1/heads/{id}/experiences |
Admin | Get the kiosk experience catalog for a head (proxied from hydraexperiencelibrary) |
GET /api/v1/heads/{id}/screenshot |
Admin | Fetch kiosk desktop as JPEG via exec channel (requires Terminal screenshot loop on the head) |
GET /api/v1/heads/{id}/logs |
Admin | Fetch last 200 lines of head agent log as JSON via exec channel |
PUT /api/v1/heads/{id} |
Admin | Update head status |
Render nodes are Windows machines that run LarkXR Standalone (cloud rendering). They are enrolled in HydraCluster, which provisions them via the node agent.
hydracluster.experiencenet.com and releases.experiencenet.comhttps://hydracluster.experiencenet.com/enroll on the machinerender-node role and set district/venueThe node agent will:
larkxr-standalone-windows.zip from the release serverC:\LarkXR\larkxr-standalone\install/ folder)larkxr_center MySQL databaseAfter provisioning, verify these ports are listening:
Check via remote shell:
Get-Process *lark*,*java*,*mysql*,*redis*,*nginx* | Format-Table Name,Id
netstat -an | Select-String '8181|8282|13306'
The dashboard should show provider_status: running.
When LarkXR needs to be fully reinstalled:
Stop all LarkXR processes:
Stop-Process -Name LarkXRLauncher,LarkXRServer,CloudLarkRenderServer,java,mysqld,nginx,redis-server -Force -ErrorAction SilentlyContinue
Remove old installation and cached state:
Remove-Item C:\LarkXR -Recurse -Force
Remove-Item C:\Windows\System32\config\systemprofile\.hydranode\provider_version.txt -Force
Remove-Item C:\Windows\System32\config\systemprofile\.hydranode\config_cache.yaml -Force
Remove-Item C:\Windows\System32\config\systemprofile\.hydranode\downloads -Recurse -Force
The node agent will detect the missing installation on its next heartbeat (30s) and reprovision automatically.
Alternatively, trigger reprovision from the dashboard (node detail page > Reprovision button). As of v0.23.10+, the Reprovision button triggers a forced reinstall -- the node agent stops the provider and re-downloads regardless of existing files.
If the node is offline (node agent not heartbeating):
Windows:
# Check if the scheduled task exists
schtasks /query /tn HydraNode
# Start it
schtasks /run /tn HydraNode
# If missing, reinstall
C:\hydranode\hydranode.exe install
schtasks /run /tn HydraNode
Recovery instructions are also available at https://hydracluster.experiencenet.com/enroll (no admin login needed).
The node agent auto-updates from the release server. As of v0.23.9:
ensureInstall)To force an update on a remote machine via API:
curl -s -X POST https://hydracluster.experiencenet.com/api/v1/nodes/<ID>/update-node \
-H "Authorization: Bearer <ADMIN_TOKEN>"
The node picks up the flag on its next heartbeat and triggers the self-updater immediately. To update service binaries (not hydranode itself):
curl -s -X POST https://hydracluster.experiencenet.com/api/v1/nodes/<ID>/update-services \
-H "Authorization: Bearer <ADMIN_TOKEN>"
Manual update on Windows (last resort):
# Download new binary
Invoke-WebRequest -Uri 'https://releases.experiencenet.com/hydranode/production/latest/hydranode-windows-amd64.exe' -OutFile C:\hydranode\hydranode-new.exe
# Stop, replace, reinstall, start
schtasks /End /TN HydraNode
Start-Sleep 3
Copy-Item C:\hydranode\hydranode-new.exe C:\hydranode\hydranode.exe -Force
Remove-Item C:\hydranode\hydranode-new.exe
C:\hydranode\hydranode.exe install
schtasks /Run /TN HydraNode
Warning: Stopping the node agent on a remote-only machine (private LAN) means you lose remote access until it restarts. The task's repetition interval (1 minute) should auto-restart it, but if the binary is locked, the replacement will fail silently.
| Path | Purpose |
|---|---|
C:\hydranode\hydranode.exe |
Node agent binary |
C:\hydranode\enroll.yaml |
Enrollment token |
C:\Windows\System32\config\systemprofile\.hydranode\ |
SYSTEM profile data dir |
...\.hydranode\config.yaml |
Node config (server URL, token) |
...\.hydranode\config_cache.yaml |
Cached config from server |
...\.hydranode\provider_version.txt |
Installed provider version |
...\.hydranode\hydranode.log |
Node agent log |
C:\LarkXR\larkxr-standalone\ |
LarkXR installation |
C:\LarkXR\larkxr-standalone\log\ |
LarkXR Launcher logs |
CreateProcessAsUser to launch it in the logged-in user's session. If no user is logged in, the launcher won't start.C:\LarkXR and provider_version.txt first on those.Hetzner automated daily server snapshots are enabled on hydracluster (46.224.29.125), context hydraexperiencenet. Backup window: 14:00–18:00 UTC, 7-day retention.
The snapshot covers the entire server disk, including:
~/.hydracluster/nodes.yaml — full render node fleet config~/.hydracluster/config.yaml — server config and tokenshydraexperiencenet project), open Servers → hydracluster → Backups.curl https://hydracluster.experiencenet.com/api/v1/health
curl -H "Authorization: Bearer $TOKEN" https://hydracluster.experiencenet.com/api/v1/nodes | jq length