HydraPipeline

HydraCluster Runbook

Render node fleet management.

Release-line note: v2.0.30 exists on releases.experiencenet.com but not in git — it was a phantom build accidentally produced from a misplaced v2.0.30 tag (intended for hydraheadflatscreen) that was deleted seconds later, after CI had already cut a binary. The v2.0.30 binary is byte-identical to v2.0.26. v2.0.31 is the next clean release-line version on master and supersedes both.

Infrastructure

Resource Value
Server hydracluster.experiencenet.com (46.224.29.125)
Config /root/.hydracluster/config.yaml
Data /root/.hydracluster/nodes.yaml
Service systemctl status hydracluster
Logs journalctl -u hydracluster -f

Overview

HydraCluster manages render node enrollment, provisioning, and monitoring. Render nodes are Windows machines running LarkXR Standalone (cloud rendering). They are enrolled via a web UI and managed through the node agent (hydranode).

Key Concepts

SSH Access

ssh root@hydracluster.experiencenet.com
# or
ssh root@46.224.29.125

Health Check

curl -s https://hydracluster.experiencenet.com/api/v1/health

Troubleshooting

Service not responding

  1. SSH to the server: ssh root@46.224.29.125
  2. Check service status: systemctl status hydracluster
  3. Check logs: journalctl -u hydracluster --since '10 min ago' --no-pager
  4. Restart if needed: systemctl restart hydracluster

Symptoms of a self-deadlock from new auth middleware

If POST endpoints (e.g. /api/v1/nodes/{id}/exec) start hanging with 0 bytes received while GETs still respond, and the journal shows lots of GETs but no recent POSTs to the slow endpoint, suspect an auth-middleware self-deadlock.

This was the v2.0.32 → v2.0.33 regression: requireAdminOrNodeToken used defer s.mu.Unlock() and then called next(). The downstream heads handlers (handleListHeads et al.) also acquired s.mu → non-reentrant mutex → goroutine deadlocked on its own lock → mutex pinned → every other lock acquirer queued forever. hydranode's network-recovery routine on the same box then misread the wedged API as a network failure and rebooted the cluster.

Rule for any new server middleware:

If you suspect this is happening live, the fingerprint in journalctl -u hydracluster is many GET /api/v1/body/shell/check lines but no recent matching POST log line for the slow request — the POST is queued waiting for the lock and never reaches its log statement.


Remote Command Execution

HydraCluster supports two exec modes for running commands on body machines.

Structured Exec (default)

Commands are sent as JSON via the API, executed directly by the node agent (PowerShell on Windows, bash on Linux), with clean stdout/stderr separation and proper exit codes. No PTY involved.

# Basic command
hydracluster exec <nodeId> "Get-Process | Select-Object -First 5"

# With timeout
hydracluster exec <nodeId> "choco list --local-only" --timeout 60s

# From a local file (avoids all shell escaping)
hydracluster exec <nodeId> --file ./setup.ps1

# JSON output (for scripting)
hydracluster exec <nodeId> "hostname" --json

The command is queued on the server and picked up by the node agent on its next heartbeat (up to 30 seconds). The CLI polls for the result until it arrives or the timeout expires.

Shell Exec (WebSocket PTY — always responsive)

hydracluster exec --shell <nodeId> "top -bn1 | head -5" --timeout 10s

Async queue vs WebSocket shell — when to use which

These are two distinct transports. Understanding the difference matters when the node is slow to respond or the queue is backed up.

Async exec queue (default) WebSocket shell (--shell)
Transport HTTP poll: hydranode fetches commands on its tick (up to 30 s delay) WebSocket PTY: direct connection, responds immediately
Queue Commands pile up in-memory on hydracluster; processed FIFO No queue — each --shell call opens a fresh PTY session
Use when Scripting, JSON output, fire-and-forget, parallel commands Queue is backed up; need immediate response; binary data (base64)
Failure mode Times out if queue is long or hydranode stopped polling Fails fast if WebSocket connection can't be established

When the async queue backs up: this happens when many exec requests accumulate (e.g. after a debugging session with many rapid execs, or after a node reboot where hydranode was offline and the queue grew). The node heartbeat still arrives (node shows online) but exec results stay pending indefinitely. Switch to --shell until the queue drains.

Detecting a backed-up queue: send a simple echo with a short timeout. If it stays {"status":"pending"} for more than 30 s, the queue is blocked.

The queue is in-memory on hydracluster — it does not survive a hydracluster restart. A server restart is the nuclear option to clear a stuck queue.

Design principle: exec invokes named commands, not raw shell magic

Exec is for ad-hoc, one-off operations. When a workflow is used regularly by operators (stream control, status checks, etc.), the receiver binary must expose a proper named Cobra subcommand for it. Exec then invokes that named command — no curl+JSON quoting, no piped commands, no inline shell scripting.

Bad (fragile):

hydracluster exec node-X "curl -s -X POST -H 'Content-Type: application/json' -d '{\"experience\":\"mercator-talks\"}' http://127.0.0.1:9740/api/v1/stream/start"

Good (named command on the receiver):

hydracluster exec node-X "hydraheadflatscreen stream-start mercator-talks"

A named command on the binary is testable, readable, and unambiguous. If an exec command starts looking like a shell script (quoting, pipes, flags), that is a signal to add a Cobra subcommand to the receiver instead.

The exec shell does not include ~/.hydranode/bin in PATH. Always use the full binary path when calling receiver-side commands:

# macOS kiosk heads (hydraheadflatscreen)
hydracluster exec node-X "~/.hydranode/bin/hydraheadflatscreen stream-start mercator-talks"
hydracluster exec node-X "~/.hydranode/bin/hydraheadflatscreen stream-stop"

API Endpoints

Endpoint Auth Purpose
POST /api/v1/nodes/{id}/exec Admin Submit command, returns {"id": "exec-..."}
GET /api/v1/nodes/{id}/exec/{execId}/result Admin Poll for result (200=done, 202=pending)
GET /api/v1/body/exec Node token Agent fetches next queued command
POST /api/v1/body/exec/result Node token Agent reports result

Node Management API

Role Assignment

Assign roles to a node via API (roles must be in the RoleCatalog):

curl -s -X POST https://hydracluster.experiencenet.com/api/v1/nodes/<ID>/roles \
  -H "Authorization: Bearer <ADMIN_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"roles": ["hydraguard-air", "hydraheadflatscreen"]}'

The node agent picks up new roles on its next provision poll and executes the matching recipe.

Head Management

Head devices (hydraheadflatscreen, hydraheadwindows) stream from body nodes via Moonlight/Sunshine. In production, heads pick their body through the eligibility discovery procedure (GET /api/v1/bodies/eligible). See body-selection.md for the selection discipline and head-identity design principles.

Assign a body to a head (admin override, not the normal path)

The endpoint below is an admin override for exceptional situations. For normal operations, do not manually pin a body to a head. Instead, change the inputs to selection (body district/venue/owner, eligibility rules, drain flag). See body-selection.md.

curl -s -X POST https://hydracluster.experiencenet.com/api/v1/nodes/<HEAD_ID>/head-assignment \
  -H "Authorization: Bearer <ADMIN_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"body_id": "<BODY_NODE_ID>", "app_id": "Desktop"}'

Get head config (polled by agent)

curl -s https://hydracluster.experiencenet.com/api/v1/heads/<HEAD_ID> \
  -H "Authorization: Bearer <ADMIN_TOKEN>"

Returns stream config with both WireGuard (stream_url) and LAN (stream_url_lan) addresses resolved from the body node. The hydraheadflatscreen agent probes LAN first (TCP 47990, 1s timeout) and falls back to WireGuard.

List all heads

curl -s https://hydracluster.experiencenet.com/api/v1/heads \
  -H "Authorization: Bearer <ADMIN_TOKEN>"

Returns an array of head objects. Each entry includes node_status ("online"/"offline" — node heartbeat state), last_seen (RFC3339 timestamp of the last hydranode heartbeat), assigned_body_name (the name of the assigned body node, if any), and diagnostics (a map of agent-reported key/value pairs, omitted when empty) in addition to the head-level status ("idle"/"streaming"/"error" — reported by the kiosk agent). Use node_status to distinguish a kiosk that is genuinely offline from one that is online but idle. Use assigned_body_name to correlate heads to bodies without resolving IP addresses. Diagnostic keys set by hydraheadflatscreen include version, wireguard ("up"/"down"), app ("kiosk"/"moonlight"/"none"), routing ("lan"/"wireguard"/"unknown"), and latency_ms (TCP RTT in milliseconds to Sunshine port 47990, present only when streaming).

iPad fleet enrollment QR (HydraHeadiPad)

iPad enrollment is a two-step self-registration flow. No head entry needs to exist before scanning.

Step 1 — display the QR code (admin web UI, recommended):

  1. Open https://hydracluster.experiencenet.com/enroll
  2. Click the iPad tab
  3. The QR code is shown inline — point the iPad at the screen

Step 1 (alternative) — get the raw payload via API:

curl -s https://hydracluster.experiencenet.com/api/v1/enroll-qr \
  -H "Authorization: Bearer <ADMIN_TOKEN>"

Returns {"server_url":"...","enrollment_token":"..."}. Encode this JSON as a QR code and print/display it — one QR serves all iPads in the fleet.

Step 2 — iPad scans the QR, app self-registers:

The app POSTs to POST /api/v1/heads using the enrollment token:

curl -s -X POST https://hydracluster.experiencenet.com/api/v1/heads \
  -H "Authorization: Bearer <FLEET_ENROLLMENT_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"name":"ipad-lobby","district":"bxl1","venue":"cloud-seven"}'

Returns {"head_id":"node-...","token":"...","server_url":"..."}. The iPad saves these as its permanent identity. The head is auto-approved and immediately starts heartbeating.

Config: fleet_enrollment_token must be set in config.yaml under server:. The token is the same for all iPads — treat it as a moderately sensitive credential (anyone with it can enroll new heads).

Get experience catalog for a head

curl -s https://hydracluster.experiencenet.com/api/v1/heads/<HEAD_ID>/experiences \
  -H "Authorization: Bearer <ADMIN_TOKEN>"

Returns the catalog of experiences a kiosk should display, sourced from hydraexperiencelibrary's /api/v1/experiences/live endpoint keyed by the head's district + venue. If the head has a non-empty AllowedExperiences whitelist, the result is intersected with it.

This endpoint is the source of truth for kiosk grids and is intentionally independent of body availability — a venue with no body online still returns its catalog (or [] if nothing is planted there yet). Body discovery only runs at stream-start time (POST /api/v1/stream/start on the agent's local API).

Status codes:

Fetch kiosk desktop screenshot

Requires the Terminal-based screencapture loop to be running on the head (see hydraheadflatscreen runbook → Screenshots). Returns image/jpeg directly — no exec gymnastics needed:

TOKEN="<admin-token>"
curl -sf -H "Authorization: Bearer $TOKEN" \
  "https://hydracluster.experiencenet.com/api/v1/heads/<HEAD_ID>/screenshot" \
  > /tmp/kiosk.jpg

Status codes: 200 JPEG on success; 502 if screenshot file is stale or absent (Terminal loop not running); 504 if the head exec channel doesn't respond within 20 s.

If 504 (exec queue backed up): use the WebSocket shell path instead — it bypasses the async exec queue:

"$HC_BIN" exec <node-id> "base64 -i /tmp/hydra-live-screenshot.jpg" \
  --shell --server "$HC_SERVER" --admin-token "$HC_TOKEN" --timeout 30s \
  > /tmp/raw.txt
grep -E '^[A-Za-z0-9+/]+=*$' /tmp/raw.txt | tr -d '\n' | base64 -d > /tmp/kiosk.jpg

Note: macOS base64 requires -i <file> not a positional argument. The grep strips shell prompts before decoding.

Fetch kiosk agent logs

Returns the last 200 lines of the head agent log as JSON ({"log_path":"...","lines":[...]}):

curl -sf -H "Authorization: Bearer $TOKEN" \
  "https://hydracluster.experiencenet.com/api/v1/heads/<HEAD_ID>/logs" | jq '.lines[-20:]'

Status codes: 200 JSON on success; 504 if the head exec channel doesn't respond within 15 s.

API Endpoints (Node Management)

Endpoint Auth Purpose
POST /api/v1/nodes/{id}/roles Admin Set roles on a node
POST /api/v1/nodes/{id}/update-node Admin Flag node for immediate hydranode self-update
POST /api/v1/nodes/{id}/update-services Admin Flag node to update all provisioned service binaries
POST /api/v1/nodes/{id}/head-assignment Admin Assign body to head
POST /api/v1/nodes/{id}/allowed-experiences Admin Set experience filter for kiosk head
GET /api/v1/heads Admin List all head nodes with stream config
GET /api/v1/heads/{id} Admin Get head config (includes experience_library_url, allowed_experiences)
GET /api/v1/heads/{id}/experiences Admin Get the kiosk experience catalog for a head (proxied from hydraexperiencelibrary)
GET /api/v1/heads/{id}/screenshot Admin Fetch kiosk desktop as JPEG via exec channel (requires Terminal screenshot loop on the head)
GET /api/v1/heads/{id}/logs Admin Fetch last 200 lines of head agent log as JSON via exec channel
PUT /api/v1/heads/{id} Admin Update head status

Render Node Provisioning

Render nodes are Windows machines that run LarkXR Standalone (cloud rendering). They are enrolled in HydraCluster, which provisions them via the node agent.

Prerequisites

Fresh Enrollment

  1. Open https://hydracluster.experiencenet.com/enroll on the machine
  2. Select the Windows / Linux tab (default), fill in a name, set owner, submit
  3. Copy the PowerShell install command and run it as Admin
  4. The node agent installs, starts, and the node goes online in the dashboard
  5. In the admin dashboard, assign the render-node role and set district/venue

The node agent will:

Verification

After provisioning, verify these ports are listening:

Check via remote shell:

Get-Process *lark*,*java*,*mysql*,*redis*,*nginx* | Format-Table Name,Id
netstat -an | Select-String '8181|8282|13306'

The dashboard should show provider_status: running.

Reprovision (Clean Reinstall)

When LarkXR needs to be fully reinstalled:

  1. Stop all LarkXR processes:

    Stop-Process -Name LarkXRLauncher,LarkXRServer,CloudLarkRenderServer,java,mysqld,nginx,redis-server -Force -ErrorAction SilentlyContinue
    
  2. Remove old installation and cached state:

    Remove-Item C:\LarkXR -Recurse -Force
    Remove-Item C:\Windows\System32\config\systemprofile\.hydranode\provider_version.txt -Force
    Remove-Item C:\Windows\System32\config\systemprofile\.hydranode\config_cache.yaml -Force
    Remove-Item C:\Windows\System32\config\systemprofile\.hydranode\downloads -Recurse -Force
    
  3. The node agent will detect the missing installation on its next heartbeat (30s) and reprovision automatically.

Alternatively, trigger reprovision from the dashboard (node detail page > Reprovision button). As of v0.23.10+, the Reprovision button triggers a forced reinstall -- the node agent stops the provider and re-downloads regardless of existing files.

Recovery (Node Agent Down)

If the node is offline (node agent not heartbeating):

Windows:

# Check if the scheduled task exists
schtasks /query /tn HydraNode

# Start it
schtasks /run /tn HydraNode

# If missing, reinstall
C:\hydranode\hydranode.exe install
schtasks /run /tn HydraNode

Recovery instructions are also available at https://hydracluster.experiencenet.com/enroll (no admin login needed).

Node Agent Update

The node agent auto-updates from the release server. As of v0.23.9:

To force an update on a remote machine via API:

curl -s -X POST https://hydracluster.experiencenet.com/api/v1/nodes/<ID>/update-node \
  -H "Authorization: Bearer <ADMIN_TOKEN>"

The node picks up the flag on its next heartbeat and triggers the self-updater immediately. To update service binaries (not hydranode itself):

curl -s -X POST https://hydracluster.experiencenet.com/api/v1/nodes/<ID>/update-services \
  -H "Authorization: Bearer <ADMIN_TOKEN>"

Manual update on Windows (last resort):

# Download new binary
Invoke-WebRequest -Uri 'https://releases.experiencenet.com/hydranode/production/latest/hydranode-windows-amd64.exe' -OutFile C:\hydranode\hydranode-new.exe

# Stop, replace, reinstall, start
schtasks /End /TN HydraNode
Start-Sleep 3
Copy-Item C:\hydranode\hydranode-new.exe C:\hydranode\hydranode.exe -Force
Remove-Item C:\hydranode\hydranode-new.exe
C:\hydranode\hydranode.exe install
schtasks /Run /TN HydraNode

Warning: Stopping the node agent on a remote-only machine (private LAN) means you lose remote access until it restarts. The task's repetition interval (1 minute) should auto-restart it, but if the binary is locked, the replacement will fail silently.

Key Paths

Path Purpose
C:\hydranode\hydranode.exe Node agent binary
C:\hydranode\enroll.yaml Enrollment token
C:\Windows\System32\config\systemprofile\.hydranode\ SYSTEM profile data dir
...\.hydranode\config.yaml Node config (server URL, token)
...\.hydranode\config_cache.yaml Cached config from server
...\.hydranode\provider_version.txt Installed provider version
...\.hydranode\hydranode.log Node agent log
C:\LarkXR\larkxr-standalone\ LarkXR installation
C:\LarkXR\larkxr-standalone\log\ LarkXR Launcher logs

Known Issues

Backup & Recovery

What is backed up

Hetzner automated daily server snapshots are enabled on hydracluster (46.224.29.125), context hydraexperiencenet. Backup window: 14:00–18:00 UTC, 7-day retention.

The snapshot covers the entire server disk, including:

Restore procedure

  1. In the Hetzner Cloud Console (hydraexperiencenet project), open Servers → hydracluster → Backups.
  2. Identify the snapshot to restore from. Restoring overwrites the current disk — all changes since the snapshot are lost.
  3. Power off the server, restore the snapshot, then power on.
  4. Verify the service came back up:
    curl https://hydracluster.experiencenet.com/api/v1/health
    
  5. Check node count matches expectations:
    curl -H "Authorization: Bearer $TOKEN" https://hydracluster.experiencenet.com/api/v1/nodes | jq length