Connect in one command.
No proxy, no code change, nothing in your request path. The agent attaches to your existing vLLM / SGLang / TGI /metrics, measures your cost frontier, watches it for drift, and reports to your dashboard.
curl -fsSL https://squiz.sh/install.sh | bash -s -- --key YOUR_API_KEY--gpu-cost). Finds vLLM / SGLang / TGI on the standard ports (:8000 / :30000 / :8080); for a non-standard setup add --vllm http://your-engine:8000 --name prod. Builds + starts the agent (~1 min) and connects to your account.http://127.0.0.1:8800 (from your laptop: ssh -L 8800:127.0.0.1:8800 user@gpu-box) and hit Run frontier sweep.How it works under the hood.
Engines
vLLM and SGLang are first-class (validated on real GPUs); TGI is supported via its /metrics (beta). Engine is auto-detected — never gated by plan.
What it reads
Only your engine's /metrics (Prometheus) — running/waiting requests, KV-cache %, token counters. Optionally a DCGM exporter for true GPU utilization. It never sits in your request path.
Drift monitor
Beyond one-shot floors, the agent watches your live $/token and flags a cost regression — even a +20% creep that never trips a hard threshold — and warns you before it lands on the bill. Surfaced on the History view (Pro continuous).
Privacy & keys
Telemetry is opt-in and anonymized: GPU / model / engine / quant / regime / knee $/token + a coarse volume bucket — never prompts, request contents, IPs, or identity. Keys are stored hashed; the full key is shown once at creation.
Kubernetes
A Helm chart ships in live/deploy/helm — set cloudUrl, cloudKey, and upstream in values to run the agent as a sidecar/deployment beside your serving pods. The apply-kit also emits a ready-to-paste Helm args patch set to your measured knee.