Docs

Connect in one command.

No proxy, no code change, nothing in your request path. The agent attaches to your existing vLLM / SGLang / TGI /metrics, measures your cost frontier, watches it for drift, and reports to your dashboard.

Grab your key

Sign up → your dashboard shows the full command with your key pre-filled. Rotate or revoke keys anytime under API keys.

Run it on the box next to your model server

curl -fsSL https://squiz.sh/install.sh | bash -s -- --key YOUR_API_KEY

Requires Docker. Auto-detects your GPU (+ count), model, engine, and a representative $/GPU-hr (override with --gpu-cost). Finds vLLM / SGLang / TGI on the standard ports (:8000 / :30000 / :8080); for a non-standard setup add --vllm http://your-engine:8000 --name prod. Builds + starts the agent (~1 min) and connects to your account.

That’s it — it self-calibrates

squiz auto-runs the first calibration at the next idle moment, then re-checks on a schedule (and the instant it detects cost drift). On an always-busy box it even learns your floor from live traffic — no load injected. Your floors, recover view, drift monitor, and (Pro) floor-over-time history populate on your dashboard automatically — no click. Want it now? On the GPU box open http://127.0.0.1:8800 (from your laptop: ssh -L 8800:127.0.0.1:8800 user@gpu-box) and hit Run frontier sweep.

REFReference

How it works under the hood.

Engines

vLLM and SGLang are first-class (validated on real GPUs); TGI is supported via its /metrics (beta). Engine is auto-detected — never gated by plan.

What it reads

Only your engine's /metrics (Prometheus) — running/waiting requests, KV-cache %, token counters. Optionally a DCGM exporter for true GPU utilization. It never sits in your request path.

Drift monitor

Beyond one-shot floors, the agent watches your live $/token and flags a cost regression — even a +20% creep that never trips a hard threshold — and warns you before it lands on the bill. Surfaced on the History view (Pro continuous).

Privacy & keys

Telemetry is opt-in and anonymized: GPU / model / engine / quant / regime / knee $/token + a coarse volume bucket — never prompts, request contents, IPs, or identity. Keys are stored hashed; the full key is shown once at creation.

Kubernetes

A Helm chart ships in live/deploy/helm — set cloudUrl, cloudKey, and upstream in values to run the agent as a sidecar/deployment beside your serving pods. The apply-kit also emits a ready-to-paste Helm args patch set to your measured knee.