Running GLM-5.2 on half the GPUs: a W4A16 + MTP quantization

TL;DR — We quantized GLM-5.2 (744B-parameter MoE) to 4-bit weights while keeping its multi-token-prediction (MTP) draft head in BF16. The result matches the FP8 release on quality, fits on four H200s instead of eight, and — because of MTP — is the fastest of the popular 4-bit GLM-5.2 quants in the interactive serving regime. It's public, MIT-licensed: canada-quant/GLM-5.2-W4A16-MTP.

Why

GLM-5.2 is one of the strongest open models you can run today, but it's big: a 744B mixture-of-experts network that needs roughly 1.49 TB of weights in BF16 — eight 141 GB H200s, fully occupied, to serve a single replica. That footprint is the thing standing between "great open model" and "great open model we can actually afford to serve."

The goal of this project was narrow and practical: shrink the serving footprint without giving up quality, and without giving up speed. Not a new model — every bit of capability here comes from the base GLM-5.2. Just a cheaper way to run it.

What we did

The artifact is a W4A16 quantization: routed-expert weights compressed to INT4 (group-size 128, GPTQ, via llm-compressor), with attention, the dense prefix layers, shared experts, the router, embeddings, and the LM head left in BF16. That drops the checkpoint from ~1.49 TB to ~405 GB — small enough for 4×H200.

The piece we were careful to keep is the MTP (multi-token-prediction) layer, preserved in BF16. MTP is GLM-5.2's built-in draft head for speculative decoding: the small head proposes several future tokens, the full model verifies them in one pass, and accepted tokens are exactly what the model would have produced anyway. It's a lossless speedup — it changes latency, not answers — and most 4-bit quants drop it. Preserving it (we inject it back at BF16 after quantization) is what makes this artifact fast where it counts.

A couple of details that mattered: we calibrate every expert (skipping rare experts quietly produces a coherent-looking but degraded model), and we calibrate on an in-distribution code/instruction mix rather than generic chat text.

Does it hold up?

We measured it against the official zai-org/GLM-5.2-FP8 release on the same harness, 8×H200:

Quality — W4A16+MTP vs FP8, same harness, 8×H200.
Task	W4A16+MTP	FP8
GSM8K (strict)	0.960	0.955
IFEval (prompt / inst strict)	0.909 / 0.911	0.891 / 0.903
MATH-500	0.954	0.958
RULER @ 32K / 64K	0.832 / 0.841	0.831 / 0.813
SWE-bench Verified	82.0%	82.2%

Within run-to-run noise on reasoning, instruction-following, long-context retrieval, and agentic coding. 4-bit weight quantization, done carefully, costs essentially nothing in quality here. It also still serves the full 1M-token context on 8×H200 (we retrieved a needle from a ~936K-token prompt), and runs at up to 128K context on just 4×H200.

Is it fast?

Against FP8, throughput is +48% at concurrency 1 and +32% at 8 (where MTP helps most), and ~13% slower at full saturation — an honest trade-off we show in both directions.

We also put it head-to-head with the most-downloaded H200-servable 4-bit GLM-5.2 quants (output tokens/s, same harness):

Output tokens/s across concurrency — bold is ours. Higher is better.
Concurrency	W4A16+MTP (ours)	AWQ-INT4	NVFP4
1	132	78	74
8	466	465	410
32	825	960	944

The story is clear and, we think, useful: in the interactive/agentic regime — low to mid concurrency, where latency matters — MTP puts us well ahead (+69–79% at concurrency 1). At full saturation the simpler no-MTP quants pull ahead by ~15%, because speculative decoding's overhead stops paying off once the batch is already full. We measured throughput only here; quality is close across all 4-bit quants of the same base model, so the real deployment differentiator is speed. (NVFP4 is a Blackwell-native FP4 format; on Hopper it runs without FP4 tensor cores — we did not measure it on Blackwell, where the picture may differ.)

The hard parts

Quantizing a 744B MoE and serving it isn't a one-command job, and we documented every wall we hit (F010–F029 in the repo) so the dead-ends don't have to be re-walked:

Calibration on a single box is bounded by where the model + its expert linearization + activations physically fit. The fix was per-Linear sequential targets — but GLM-5.2's sparse-attention indexer uses data-dependent control flow that can't be traced that way, which capped calibration context. We filed the traceability gap upstream.
The MTP head wasn't there to quantize — from_pretrained doesn't instantiate it — so it never got saved. We inject it back at BF16 post-quantization.
Serving the asymmetric INT4 MoE required expert-parallelism to dodge a tensor-parallel scale-sharding bug, and the sparse-attention kernels needed a specific CUDA toolchain.

None of these are glamorous, but they're the difference between a checkpoint and something that actually serves.

Limitations (honestly)

Faster than the field at low/mid concurrency; at full saturation the no-MTP quants are ~15% faster.
1M context needs all 8 H200s; 4×H200 covers up to ~128K.
The asymmetric weights require --enable-expert-parallel to serve correctly.
Validated on Hopper (H200). Blackwell serving needs additional kernel flags.

Why

What we did

Does it hold up?

Is it fast?

The hard parts

Limitations (honestly)

Get it