vLLM · SGLang · TGI

Your GPU reads 100% busy.
You’re overpaying 2–9×.

squiz attaches to your running vLLM, SGLang or TGI — reading /metrics, nothing in your request path — surfaces your true $/1M-token floor on your own GPUs, flags it the moment it drifts, and recovers the waste with the one change to make, proven by A/B.

Measured: $1,984/mo recovered on one A10G— cold $2.24 → cached $0.26/1M, ~1B output tokens/mo.

One command · no proxy · every engine free forever · measured on your real GPUs, never estimated.

$ / 1M TOKENS CONCURRENCY → no cache · 2.24 your floor · 0.26
Typical overpay
2–9×
same GPU, same hour — just a mis-sized config. The prefix-cache lever alone hit 8.8× in our measured A10G run (cold vs cached).
The number no dashboard shows
$/1M
your true cost per million output tokens, cold & cached.
Every fix, proven
A/B
apply, re-measure, confirm it helped — or roll back.
01 — 03How it works

From blind spend to squeezed savings.

01 · ATTACH

Attach

Run the agent beside your vLLM / SGLang — one command. It reads /metrics. No proxy, no code change, nothing in your request path.

02 · MEASURE

Measure

It sweeps the cost-vs-concurrency frontier on your real hardware and finds your efficient knee — cold & cached.

03 · RECOVER

Recover

See $/mo recoverable at your volume, the exact flags to set, then verify the change actually moved your floor.

Why not just run a benchmark?

A one-off sweep is a snapshot. Your cost is a moving target.

You can run vllm auto_tune once — but it’s offline, vLLM-only, and speaks throughput, not dollars. The day after, your traffic shifts, you bump a flag, you add a model, and you’re blind again.

A one-off benchmark (auto_tune, bench sweep)
  • Offline — spins up its own server, injects synthetic load
  • vLLM only · you hand-author the config grid
  • Outputs throughput/latency — never dollars
  • True the day you ran it; silent when it drifts
squiz
  • Live on your real deployment — reads /metrics, no synthetic load
  • vLLM · SGLang · TGI, auto-detected · one command
  • Speaks $/1M-token + recoverable $/mo
  • Catches drift, alerts, and A/B-proves every fix
Predict, don't sweep

Don’t benchmark 50 configs. Predict the curve from 2.

Tuning a serving engine usually means sweeping dozens of configs by hand. squiz Predict learns your engine’s behavior from just 2–3 cheap probes, then reconstructs the whole cost / throughput / latency frontier and answers “what if I batch more, quantize, or change GPU?” — without extra runs.

Probes needed
2
to reconstruct the whole frontier — not 50 benchmark runs.
Predict from 2 probes
2-probe
from a couple of cheap probes squiz reconstructs the whole frontier; in our A10G runs held-out $/token landed within roughly 10–20% of measured (MAPE), regime-dependent — then you verify your own.
Finds waste for you
auto
squiz auto-locates your efficient knee and where throughput leaks — no manual config-hunting.
Cross-GPU benchmark

What a million tokens actually costs — measured on real GPUs.

Reference runs on Qwen2.5-7B (vLLM · bf16). Cold = unique prompts (the comparable floor); cached = shared-prefix (the prefix-cache lever). Your model + traffic differ — the only number that matters is yours, in one command.

GPUinstance$/hrcold floorcached floorprefix-cache lever
H100p5.48xlarge (÷8)$6.88/hr$0.99 /1M · C=128$0.85 /1M · C=961.2×
L40Sg6e.2xlarge$2.24/hr$1.25 /1M · C=64$0.27 /1M · C=1284.6×
L4g6.2xlarge$0.98/hr$2.55 /1M · C=16$0.69 /1M · C=323.7×
A10Gg5.2xlarge$1.21/hr$2.24 /1M · C=16$0.26 /1M · C=968.8×

Measured by squiz on real GPUs · Qwen2.5-7B-Instruct · vLLM bf16 · cold = unique prompts (no prefix-cache hit), cached = shared prefix · $/1M computed from measured tok/s at the per-GPU $/hr shown (the rate each row was measured at) · C = concurrent requests.

Same hour, wildly different $

A fixed $/GPU-hr hides a 2–9× swing in $/token depending on batching + caching — the hourly bill is not your unit cost.

Prefix caching is the biggest lever

Shared system / RAG prefixes cut the floor multiple-fold (the lever column) — if your prompts actually share a prefix.

Cheapest $/hr ≠ cheapest $/token

An H100 costs far more per hour than an A10G — yet often wins on $/token at scale because it clears the batch. Measure, don't assume.

See your real $/token — free.

Connect a deployment in one command. Free forever, up to 3 deployments — upgrade when you want your whole fleet, alerting, and history.

See your $/token — free →