HydraCluster Runbook

Render node fleet management.

Release-line note: v2.0.30 exists on releases.experiencenet.com but not in git — it was a phantom build accidentally produced from a misplaced v2.0.30 tag (intended for hydraheadflatscreen) that was deleted seconds later, after CI had already cut a binary. The v2.0.30 binary is byte-identical to v2.0.26. v2.0.31 is the next clean release-line version on master and supersedes both.

Infrastructure

Resource	Value
Server	`hydracluster.experiencenet.com` (`46.224.29.125`)
Config	`/root/.hydracluster/config.yaml`
Data	`/root/.hydracluster/nodes.yaml`
Service	`systemctl status hydracluster`
Logs	`journalctl -u hydracluster -f`

Overview

HydraCluster manages render node enrollment, provisioning, and monitoring. Render nodes are Windows machines running LarkXR Standalone (cloud rendering). They are enrolled via a web UI and managed through the node agent (hydranode).

Key Concepts

Enrollment flow: Machines self-enroll via /enroll, admin assigns roles and venues. The form accepts optional ?name= and ?ip= URL params (pre-filled by HydraNeck's Enrol button for admin-initiated remote enrolment — the submitted ip field overrides the request's remote address so the correct device IP is stored).
Two-wave provisioning: Bodies (base OS + agent) then heads (LarkXR provider)
Node agent: hydranode runs on each machine, heartbeats to the server, executes provisioning
Provider: Pluggable provider interface (LarkXR is the current implementation)

SSH Access

ssh root@hydracluster.experiencenet.com
# or
ssh root@46.224.29.125

Health Check

curl -s https://hydracluster.experiencenet.com/api/v1/health

Troubleshooting

Service not responding

SSH to the server: ssh root@46.224.29.125
Check service status: systemctl status hydracluster
Check logs: journalctl -u hydracluster --since '10 min ago' --no-pager
Restart if needed: systemctl restart hydracluster

Symptoms of a self-deadlock from new auth middleware

If POST endpoints (e.g. /api/v1/nodes/{id}/exec) start hanging with 0 bytes received while GETs still respond, and the journal shows lots of GETs but no recent POSTs to the slow endpoint, suspect an auth-middleware self-deadlock.

This was the v2.0.32 → v2.0.33 regression: requireAdminOrNodeToken used defer s.mu.Unlock() and then called next(). The downstream heads handlers (handleListHeads et al.) also acquired s.mu → non-reentrant mutex → goroutine deadlocked on its own lock → mutex pinned → every other lock acquirer queued forever. hydranode's network-recovery routine on the same box then misread the wedged API as a network failure and rebooted the cluster.

Rule for any new server middleware:

It is fine to s.mu.Lock() briefly inside a middleware to look something up.
Release s.mu BEFORE calling next(). Do NOT hold it via defer across a handler invocation. Downstream handlers do their own locking.
The pre-existing requireNodeToken survives this rule only because every handler it wraps (handleBodyHeartbeat, handleBodyShellCheck, etc.) reads node state via the nodeContext value rather than re-locking s.mu. Don't break that invariant.

If you suspect this is happening live, the fingerprint in journalctl -u hydracluster is many GET /api/v1/body/shell/check lines but no recent matching POST log line for the slow request — the POST is queued waiting for the lock and never reaches its log statement.

Remote Command Execution

HydraCluster supports two exec modes for running commands on body machines.

Structured Exec (default)

Commands are sent as JSON via the API, executed directly by the node agent (PowerShell on Windows, bash on Linux), with clean stdout/stderr separation and proper exit codes. No PTY involved.

# Basic command
hydracluster exec <nodeId> "Get-Process | Select-Object -First 5"

# With timeout
hydracluster exec <nodeId> "choco list --local-only" --timeout 60s

# From a local file (avoids all shell escaping)
hydracluster exec <nodeId> --file ./setup.ps1

# JSON output (for scripting)
hydracluster exec <nodeId> "hostname" --json

The command is queued on the server and picked up by the node agent on its next heartbeat (up to 30 seconds). The CLI polls for the result until it arrives or the timeout expires.

Shell Exec (WebSocket PTY — always responsive)

hydracluster exec --shell <nodeId> "top -bn1 | head -5" --timeout 10s

Async queue vs WebSocket shell — when to use which

These are two distinct transports. Understanding the difference matters when the node is slow to respond or the queue is backed up.

	Async exec queue (default)	WebSocket shell (`--shell`)
Transport	HTTP poll: hydranode fetches commands on its tick (up to 30 s delay)	WebSocket PTY: direct connection, responds immediately
Queue	Commands pile up in-memory on hydracluster; processed FIFO	No queue — each `--shell` call opens a fresh PTY session
Use when	Scripting, JSON output, fire-and-forget, parallel commands	Queue is backed up; need immediate response; binary data (base64)
Failure mode	Times out if queue is long or hydranode stopped polling	Fails fast if WebSocket connection can't be established

When the async queue backs up: this happens when many exec requests accumulate (e.g. after a debugging session with many rapid execs, or after a node reboot where hydranode was offline and the queue grew). The node heartbeat still arrives (node shows online) but exec results stay pending indefinitely. Switch to --shell until the queue drains.

Detecting a backed-up queue: send a simple echo with a short timeout. If it stays {"status":"pending"} for more than 30 s, the queue is blocked.

The queue is in-memory on hydracluster — it does not survive a hydracluster restart. A server restart is the nuclear option to clear a stuck queue.

Design principle: exec invokes named commands, not raw shell magic

Exec is for ad-hoc, one-off operations. When a workflow is used regularly by operators (stream control, status checks, etc.), the receiver binary must expose a proper named Cobra subcommand for it. Exec then invokes that named command — no curl+JSON quoting, no piped commands, no inline shell scripting.

Bad (fragile):

hydracluster exec node-X "curl -s -X POST -H 'Content-Type: application/json' -d '{\"experience\":\"mercator-talks\"}' http://127.0.0.1:9740/api/v1/stream/start"

Good (named command on the receiver):

hydracluster exec node-X "hydraheadflatscreen stream-start mercator-talks"

A named command on the binary is testable, readable, and unambiguous. If an exec command starts looking like a shell script (quoting, pipes, flags), that is a signal to add a Cobra subcommand to the receiver instead.

The exec shell does not include ~/.hydranode/bin in PATH. Always use the full binary path when calling receiver-side commands:

# macOS kiosk heads (hydraheadflatscreen)
hydracluster exec node-X "~/.hydranode/bin/hydraheadflatscreen stream-start mercator-talks"
hydracluster exec node-X "~/.hydranode/bin/hydraheadflatscreen stream-stop"

API Endpoints

Endpoint	Auth	Purpose
`POST /api/v1/nodes/{id}/exec`	Admin	Submit command, returns `{"id": "exec-..."}`
`GET /api/v1/nodes/{id}/exec/{execId}/result`	Admin	Poll for result (200=done, 202=pending)
`GET /api/v1/body/exec`	Node token	Agent fetches next queued command
`POST /api/v1/body/exec/result`	Node token	Agent reports result

Node Management API

Role Assignment

Assign roles to a node via API (roles must be in the RoleCatalog):

curl -s -X POST https://hydracluster.experiencenet.com/api/v1/nodes/<ID>/roles \
  -H "Authorization: Bearer <ADMIN_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"roles": ["hydraguard-air", "hydraheadflatscreen"]}'

The node agent picks up new roles on its next provision poll and executes the matching recipe.

Head Management

Head devices (hydraheadflatscreen, hydraheadwindows) stream from body nodes via Moonlight/Sunshine. In production, heads pick their body through the eligibility discovery procedure (GET /api/v1/bodies/eligible). See body-selection.md for the selection discipline and head-identity design principles.

Assign a body to a head (admin override, not the normal path)

The endpoint below is an admin override for exceptional situations. For normal operations, do not manually pin a body to a head. Instead, change the inputs to selection (body district/venue/owner, eligibility rules, drain flag). See body-selection.md.

curl -s -X POST https://hydracluster.experiencenet.com/api/v1/nodes/<HEAD_ID>/head-assignment \
  -H "Authorization: Bearer <ADMIN_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"body_id": "<BODY_NODE_ID>", "app_id": "Desktop"}'

Get head config (polled by agent)

curl -s https://hydracluster.experiencenet.com/api/v1/heads/<HEAD_ID> \
  -H "Authorization: Bearer <ADMIN_TOKEN>"

Returns stream config with both WireGuard (stream_url) and LAN (stream_url_lan) addresses resolved from the body node. The hydraheadflatscreen agent probes LAN first (TCP 47990, 1s timeout) and falls back to WireGuard.

List all heads

curl -s https://hydracluster.experiencenet.com/api/v1/heads \
  -H "Authorization: Bearer <ADMIN_TOKEN>"

Returns an array of head objects. Each entry includes node_status ("online"/"offline" — node heartbeat state), last_seen (RFC3339 timestamp of the last hydranode heartbeat), assigned_body_name (the name of the assigned body node, if any), and diagnostics (a map of agent-reported key/value pairs, omitted when empty) in addition to the head-level status ("idle"/"streaming"/"error" — reported by the kiosk agent). Use node_status to distinguish a kiosk that is genuinely offline from one that is online but idle. Use assigned_body_name to correlate heads to bodies without resolving IP addresses. Diagnostic keys set by hydraheadflatscreen include version, wireguard ("up"/"down"), app ("kiosk"/"moonlight"/"none"), routing ("lan"/"wireguard"/"unknown"), and latency_ms (TCP RTT in milliseconds to Sunshine port 47990, present only when streaming).

iPad fleet enrollment QR (HydraHeadiPad)

iPad enrollment is a two-step self-registration flow. No head entry needs to exist before scanning.

Step 1 — display the QR code (admin web UI, recommended):

Open https://hydracluster.experiencenet.com/enroll
Click the iPad tab
The QR code is shown inline — point the iPad at the screen

Step 1 (alternative) — get the raw payload via API:

curl -s https://hydracluster.experiencenet.com/api/v1/enroll-qr \
  -H "Authorization: Bearer <ADMIN_TOKEN>"

Returns {"server_url":"...","enrollment_token":"..."}. Encode this JSON as a QR code and print/display it — one QR serves all iPads in the fleet.

Step 2 — iPad scans the QR, app self-registers:

The app POSTs to POST /api/v1/heads using the enrollment token:

curl -s -X POST https://hydracluster.experiencenet.com/api/v1/heads \
  -H "Authorization: Bearer <FLEET_ENROLLMENT_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"name":"ipad-lobby","district":"bxl1","venue":"cloud-seven"}'

Returns {"head_id":"node-...","token":"...","server_url":"..."}. The iPad saves these as its permanent identity. The head is auto-approved and immediately starts heartbeating.

Config: fleet_enrollment_token must be set in config.yaml under server:. The token is the same for all iPads — treat it as a moderately sensitive credential (anyone with it can enroll new heads).

Get experience catalog for a head

curl -s https://hydracluster.experiencenet.com/api/v1/heads/<HEAD_ID>/experiences \
  -H "Authorization: Bearer <ADMIN_TOKEN>"

Returns the catalog of experiences a kiosk should display, sourced from hydraexperiencelibrary's /api/v1/experiences/live endpoint keyed by the head's district + venue. If the head has a non-empty AllowedExperiences whitelist, the result is intersected with it.

This endpoint is the source of truth for kiosk grids and is intentionally independent of body availability — a venue with no body online still returns its catalog (or [] if nothing is planted there yet). Body discovery only runs at stream-start time (POST /api/v1/stream/start on the agent's local API).

Status codes:

200 — catalog (possibly [])
400 — node exists but is not a hydraheadflatscreen head
404 — no node with that id
502 — hydraexperiencelibrary upstream returned an error
503 — experienceLibrary.url is not configured on hydracluster

Fetch kiosk desktop screenshot

Requires the Terminal-based screencapture loop to be running on the head (see hydraheadflatscreen runbook → Screenshots). Returns image/jpeg directly — no exec gymnastics needed:

TOKEN="<admin-token>"
curl -sf -H "Authorization: Bearer $TOKEN" \
  "https://hydracluster.experiencenet.com/api/v1/heads/<HEAD_ID>/screenshot" \
  > /tmp/kiosk.jpg

Status codes: 200 JPEG on success; 502 if screenshot file is stale or absent (Terminal loop not running); 504 if the head exec channel doesn't respond within 20 s.

If 504 (exec queue backed up): use the WebSocket shell path instead — it bypasses the async exec queue:

"$HC_BIN" exec <node-id> "base64 -i /tmp/hydra-live-screenshot.jpg" \
  --shell --server "$HC_SERVER" --admin-token "$HC_TOKEN" --timeout 30s \
  > /tmp/raw.txt
grep -E '^[A-Za-z0-9+/]+=*$' /tmp/raw.txt | tr -d '\n' | base64 -d > /tmp/kiosk.jpg

Note: macOS base64 requires -i <file> not a positional argument. The grep strips shell prompts before decoding.

Fetch kiosk agent logs

Returns the last 200 lines of the head agent log as JSON ({"log_path":"...","lines":[...]}):

curl -sf -H "Authorization: Bearer $TOKEN" \
  "https://hydracluster.experiencenet.com/api/v1/heads/<HEAD_ID>/logs" | jq '.lines[-20:]'

Status codes: 200 JSON on success; 504 if the head exec channel doesn't respond within 15 s.

API Endpoints (Node Management)

Endpoint	Auth	Purpose
`POST /api/v1/nodes/{id}/roles`	Admin	Set roles on a node
`POST /api/v1/nodes/{id}/update-node`	Admin	Flag node for immediate hydranode self-update
`POST /api/v1/nodes/{id}/update-services`	Admin	Flag node to update all provisioned service binaries
`POST /api/v1/nodes/{id}/head-assignment`	Admin	Assign body to head
`POST /api/v1/nodes/{id}/allowed-experiences`	Admin	Set experience filter for kiosk head
`GET /api/v1/heads`	Admin	List all head nodes with stream config
`GET /api/v1/heads/{id}`	Admin	Get head config (includes experience_library_url, allowed_experiences)
`GET /api/v1/heads/{id}/experiences`	Admin	Get the kiosk experience catalog for a head (proxied from hydraexperiencelibrary)
`GET /api/v1/heads/{id}/screenshot`	Admin	Fetch kiosk desktop as JPEG via exec channel (requires Terminal screenshot loop on the head)
`GET /api/v1/heads/{id}/logs`	Admin	Fetch last 200 lines of head agent log as JSON via exec channel
`PUT /api/v1/heads/{id}`	Admin	Update head status

Render Node Provisioning

Render nodes are Windows machines that run LarkXR Standalone (cloud rendering). They are enrolled in HydraCluster, which provisions them via the node agent.

Prerequisites

Machine has Windows 10/11 with an NVIDIA GPU
Machine has network access to hydracluster.experiencenet.com and releases.experiencenet.com
A user account is logged in (LarkXR Launcher is a GUI app that needs a desktop session)

Fresh Enrollment

Open https://hydracluster.experiencenet.com/enroll on the machine
Select the Windows / Linux tab (default), fill in a name, set owner, submit
Copy the PowerShell install command and run it as Admin
The node agent installs, starts, and the node goes online in the dashboard
In the admin dashboard, assign the render-node role and set district/venue

The node agent will:

Fetch config from the server (district, provider settings, download URL)
Download larkxr-standalone-windows.zip from the release server
Extract to C:\LarkXR\larkxr-standalone\
Patch MySQL config (absolute datadir path, service binPath)
Install VC++ runtime and DirectX (from the zip's install/ folder)
Create the larkxr_center MySQL database
Start the LarkXR Launcher in the user's desktop session
Report provider status via heartbeat

Verification

After provisioning, verify these ports are listening:

8181 -- LarkXR Admin (center/management)
8282 -- LarkXR Agent
13306 -- MySQL

Check via remote shell:

Get-Process *lark*,*java*,*mysql*,*redis*,*nginx* | Format-Table Name,Id
netstat -an | Select-String '8181|8282|13306'

The dashboard should show provider_status: running.

Reprovision (Clean Reinstall)

When LarkXR needs to be fully reinstalled:

Stop all LarkXR processes:

Stop-Process -Name LarkXRLauncher,LarkXRServer,CloudLarkRenderServer,java,mysqld,nginx,redis-server -Force -ErrorAction SilentlyContinue

Remove old installation and cached state:

Remove-Item C:\LarkXR -Recurse -Force
Remove-Item C:\Windows\System32\config\systemprofile\.hydranode\provider_version.txt -Force
Remove-Item C:\Windows\System32\config\systemprofile\.hydranode\config_cache.yaml -Force
Remove-Item C:\Windows\System32\config\systemprofile\.hydranode\downloads -Recurse -Force

The node agent will detect the missing installation on its next heartbeat (30s) and reprovision automatically.

Alternatively, trigger reprovision from the dashboard (node detail page > Reprovision button). As of v0.23.10+, the Reprovision button triggers a forced reinstall -- the node agent stops the provider and re-downloads regardless of existing files.

Recovery (Node Agent Down)

If the node is offline (node agent not heartbeating):

Windows:

# Check if the scheduled task exists
schtasks /query /tn HydraNode

# Start it
schtasks /run /tn HydraNode

# If missing, reinstall
C:\hydranode\hydranode.exe install
schtasks /run /tn HydraNode

Recovery instructions are also available at https://hydracluster.experiencenet.com/enroll (no admin login needed).

Node Agent Update

The node agent auto-updates from the release server. As of v0.23.9:

Checks for updates immediately on start
Then checks every 5 minutes
On Windows, re-registers the scheduled task on every start (ensureInstall)

To force an update on a remote machine via API:

curl -s -X POST https://hydracluster.experiencenet.com/api/v1/nodes/<ID>/update-node \
  -H "Authorization: Bearer <ADMIN_TOKEN>"

The node picks up the flag on its next heartbeat and triggers the self-updater immediately. To update service binaries (not hydranode itself):

curl -s -X POST https://hydracluster.experiencenet.com/api/v1/nodes/<ID>/update-services \
  -H "Authorization: Bearer <ADMIN_TOKEN>"

Manual update on Windows (last resort):

# Download new binary
Invoke-WebRequest -Uri 'https://releases.experiencenet.com/hydranode/production/latest/hydranode-windows-amd64.exe' -OutFile C:\hydranode\hydranode-new.exe

# Stop, replace, reinstall, start
schtasks /End /TN HydraNode
Start-Sleep 3
Copy-Item C:\hydranode\hydranode-new.exe C:\hydranode\hydranode.exe -Force
Remove-Item C:\hydranode\hydranode-new.exe
C:\hydranode\hydranode.exe install
schtasks /Run /TN HydraNode

Warning: Stopping the node agent on a remote-only machine (private LAN) means you lose remote access until it restarts. The task's repetition interval (1 minute) should auto-restart it, but if the binary is locked, the replacement will fail silently.

Key Paths

Path	Purpose
`C:\hydranode\hydranode.exe`	Node agent binary
`C:\hydranode\enroll.yaml`	Enrollment token
`C:\Windows\System32\config\systemprofile\.hydranode\`	SYSTEM profile data dir
`...\.hydranode\config.yaml`	Node config (server URL, token)
`...\.hydranode\config_cache.yaml`	Cached config from server
`...\.hydranode\provider_version.txt`	Installed provider version
`...\.hydranode\hydranode.log`	Node agent log
`C:\LarkXR\larkxr-standalone\`	LarkXR installation
`C:\LarkXR\larkxr-standalone\log\`	LarkXR Launcher logs

Known Issues

LarkXR Launcher requires a desktop session. The node agent uses CreateProcessAsUser to launch it in the logged-in user's session. If no user is logged in, the launcher won't start.
Reprovision now forces re-download (v0.23.10+). Older agent versions skip re-download when files exist -- clean out C:\LarkXR and provider_version.txt first on those.
Node agent logs are only available from v0.23.9+. Older versions log to stdout which is lost when running as a scheduled task.

Backup & Recovery

What is backed up

Hetzner automated daily server snapshots are enabled on hydracluster (46.224.29.125), context hydraexperiencenet. Backup window: 14:00–18:00 UTC, 7-day retention.

The snapshot covers the entire server disk, including:

~/.hydracluster/nodes.yaml — full render node fleet config
~/.hydracluster/config.yaml — server config and tokens

Restore procedure

In the Hetzner Cloud Console (hydraexperiencenet project), open Servers → hydracluster → Backups.
Identify the snapshot to restore from. Restoring overwrites the current disk — all changes since the snapshot are lost.
Power off the server, restore the snapshot, then power on.

Verify the service came back up:

curl https://hydracluster.experiencenet.com/api/v1/health

Check node count matches expectations:

curl -H "Authorization: Bearer $TOKEN" https://hydracluster.experiencenet.com/api/v1/nodes | jq length

HydraPipeline