vLLM · SGLang · TGI

Your GPU reads 100% busy.
You’re overpaying 2–9×.

squiz attaches to your running vLLM, SGLang or TGI — reading /metrics, nothing in your request path — surfaces your true $/1M-token floor on your own GPUs, flags it the moment it drifts, and recovers the waste with the one change to make, proven by A/B.

See your $/token — free →How it works

Measured: $1,984/mo recovered on one A10G— cold $2.24 → cached $0.26/1M, ~1B output tokens/mo.

One command · no proxy · every engine free forever · measured on your real GPUs, never estimated.

Typical overpay

2–9×

same GPU, same hour — just a mis-sized config. The prefix-cache lever alone hit 8.8× in our measured A10G run (cold vs cached).

The number no dashboard shows

$/1M

your true cost per million output tokens, cold & cached.

Every fix, proven

A/B

apply, re-measure, confirm it helped — or roll back.

01 — 03How it works

From blind spend to squeezed savings.

01 · ATTACH

Attach

Run the agent beside your vLLM / SGLang — one command. It reads /metrics. No proxy, no code change, nothing in your request path.

02 · MEASURE

Measure

It sweeps the cost-vs-concurrency frontier on your real hardware and finds your efficient knee — cold & cached.

03 · RECOVER

Recover

See $/mo recoverable at your volume, the exact flags to set, then verify the change actually moved your floor.

≠Why not just run a benchmark?

A one-off sweep is a snapshot. Your cost is a moving target.

You can run vllm auto_tune once — but it’s offline, vLLM-only, and speaks throughput, not dollars. The day after, your traffic shifts, you bump a flag, you add a model, and you’re blind again.

A one-off benchmark (auto_tune, bench sweep)

Offline — spins up its own server, injects synthetic load
vLLM only · you hand-author the config grid
Outputs throughput/latency — never dollars
True the day you ran it; silent when it drifts

squiz

Live on your real deployment — reads /metrics, no synthetic load
vLLM · SGLang · TGI, auto-detected · one command
Speaks $/1M-token + recoverable $/mo
Catches drift, alerts, and A/B-proves every fix

◢Predict, don't sweep

Don’t benchmark 50 configs. Predict the curve from 2.

Tuning a serving engine usually means sweeping dozens of configs by hand. squiz Predict learns your engine’s behavior from just 2–3 cheap probes, then reconstructs the whole cost / throughput / latency frontier and answers “what if I batch more, quantize, or change GPU?” — without extra runs.

Probes needed

to reconstruct the whole frontier — not 50 benchmark runs.

Predict from 2 probes

2-probe

from a couple of cheap probes squiz reconstructs the whole frontier; in our A10G runs held-out $/token landed within roughly 10–20% of measured (MAPE), regime-dependent — then you verify your own.

Finds waste for you

auto

squiz auto-locates your efficient knee and where throughput leaks — no manual config-hunting.

◇Cross-GPU benchmark

What a million tokens actually costs — measured on real GPUs.

Reference runs on Qwen2.5-7B (vLLM · bf16). Cold = unique prompts (the comparable floor); cached = shared-prefix (the prefix-cache lever). Your model + traffic differ — the only number that matters is yours, in one command.

GPU	instance	$/hr	cold floor	cached floor	prefix-cache lever
H100	p5.48xlarge (÷8)	$6.88/hr	$0.99 /1M · C=128	$0.85 /1M · C=96	1.2×
L40S	g6e.2xlarge	$2.24/hr	$1.25 /1M · C=64	$0.27 /1M · C=128	4.6×
L4	g6.2xlarge	$0.98/hr	$2.55 /1M · C=16	$0.69 /1M · C=32	3.7×
A10G	g5.2xlarge	$1.21/hr	$2.24 /1M · C=16	$0.26 /1M · C=96	8.8×

Measured by squiz on real GPUs · Qwen2.5-7B-Instruct · vLLM bf16 · cold = unique prompts (no prefix-cache hit), cached = shared prefix · $/1M computed from measured tok/s at the per-GPU $/hr shown (the rate each row was measured at) · C = concurrent requests.

Same hour, wildly different $

A fixed $/GPU-hr hides a 2–9× swing in $/token depending on batching + caching — the hourly bill is not your unit cost.

Prefix caching is the biggest lever

Shared system / RAG prefixes cut the floor multiple-fold (the lever column) — if your prompts actually share a prefix.

Cheapest $/hr ≠ cheapest $/token

An H100 costs far more per hour than an A10G — yet often wins on $/token at scale because it clears the batch. Measure, don't assume.

See your real $/token — free.

Connect a deployment in one command. Free forever, up to 3 deployments — upgrade when you want your whole fleet, alerting, and history.

See your $/token — free →

Your GPU reads 100% busy.You’re overpaying 2–9×.