Your GPU reads 100% busy.
You’re overpaying 2–9×.
squiz attaches to your running vLLM, SGLang or TGI — reading /metrics,
nothing in your request path — surfaces your true $/1M-token floor on your own GPUs,
flags it the moment it drifts, and recovers the waste with the one change to make, proven by A/B.
One command · no proxy · every engine free forever · measured on your real GPUs, never estimated.
From blind spend to squeezed savings.
Attach
Run the agent beside your vLLM / SGLang — one command. It reads /metrics. No proxy, no code change, nothing in your request path.
Measure
It sweeps the cost-vs-concurrency frontier on your real hardware and finds your efficient knee — cold & cached.
Recover
See $/mo recoverable at your volume, the exact flags to set, then verify the change actually moved your floor.
A one-off sweep is a snapshot. Your cost is a moving target.
You can run vllm auto_tune once — but it’s offline, vLLM-only, and speaks throughput, not dollars. The day after, your traffic shifts, you bump a flag, you add a model, and you’re blind again.
- Offline — spins up its own server, injects synthetic load
- vLLM only · you hand-author the config grid
- Outputs throughput/latency — never dollars
- True the day you ran it; silent when it drifts
- Live on your real deployment — reads
/metrics, no synthetic load - vLLM · SGLang · TGI, auto-detected · one command
- Speaks $/1M-token + recoverable $/mo
- Catches drift, alerts, and A/B-proves every fix
Don’t benchmark 50 configs. Predict the curve from 2.
Tuning a serving engine usually means sweeping dozens of configs by hand. squiz Predict learns your engine’s behavior from just 2–3 cheap probes, then reconstructs the whole cost / throughput / latency frontier and answers “what if I batch more, quantize, or change GPU?” — without extra runs.
What a million tokens actually costs — measured on real GPUs.
Reference runs on Qwen2.5-7B (vLLM · bf16). Cold = unique prompts (the comparable floor); cached = shared-prefix (the prefix-cache lever). Your model + traffic differ — the only number that matters is yours, in one command.
| GPU | instance | $/hr | cold floor | cached floor | prefix-cache lever |
|---|---|---|---|---|---|
| H100 | p5.48xlarge (÷8) | $6.88/hr | $0.99 /1M · C=128 | $0.85 /1M · C=96 | 1.2× |
| L40S | g6e.2xlarge | $2.24/hr | $1.25 /1M · C=64 | $0.27 /1M · C=128 | 4.6× |
| L4 | g6.2xlarge | $0.98/hr | $2.55 /1M · C=16 | $0.69 /1M · C=32 | 3.7× |
| A10G | g5.2xlarge | $1.21/hr | $2.24 /1M · C=16 | $0.26 /1M · C=96 | 8.8× |
Measured by squiz on real GPUs · Qwen2.5-7B-Instruct · vLLM bf16 · cold = unique prompts (no prefix-cache hit), cached = shared prefix · $/1M computed from measured tok/s at the per-GPU $/hr shown (the rate each row was measured at) · C = concurrent requests.
A fixed $/GPU-hr hides a 2–9× swing in $/token depending on batching + caching — the hourly bill is not your unit cost.
Shared system / RAG prefixes cut the floor multiple-fold (the lever column) — if your prompts actually share a prefix.
An H100 costs far more per hour than an A10G — yet often wins on $/token at scale because it clears the batch. Measure, don't assume.
See your real $/token — free.
Connect a deployment in one command. Free forever, up to 3 deployments — upgrade when you want your whole fleet, alerting, and history.
See your $/token — free →