Running GLM-5.2 on half the GPUs: a W4A16 + MTP quantization

FP8-level quality, four H200s instead of eight, and the fastest popular 4-bit GLM-5.2 quant where latency matters.

TL;DR — We quantized GLM-5.2 (744B-parameter MoE) to 4-bit weights while keeping its multi-token-prediction (MTP) draft head in BF16. The result matches the FP8 release on quality, fits on four H200s instead of eight, and — because of MTP — is the fastest of the popular 4-bit GLM-5.2 quants in the interactive serving regime. It's public, MIT-licensed: canada-quant/GLM-5.2-W4A16-MTP.

Why

GLM-5.2 is one of the strongest open models you can run today, but it's big: a 744B mixture-of-experts network that needs roughly 1.49 TB of weights in BF16 — eight 141 GB H200s, fully occupied, to serve a single replica. That footprint is the thing standing between "great open model" and "great open model we can actually afford to serve."

The goal of this project was narrow and practical: shrink the serving footprint without giving up quality, and without giving up speed. Not a new model — every bit of capability here comes from the base GLM-5.2. Just a cheaper way to run it.

What we did

The artifact is a W4A16 quantization: routed-expert weights compressed to INT4 (group-size 128, GPTQ, via llm-compressor), with attention, the dense prefix layers, shared experts, the router, embeddings, and the LM head left in BF16. That drops the checkpoint from ~1.49 TB to ~405 GB — small enough for 4×H200.

The piece we were careful to keep is the MTP (multi-token-prediction) layer, preserved in BF16. MTP is GLM-5.2's built-in draft head for speculative decoding: the small head proposes several future tokens, the full model verifies them in one pass, and accepted tokens are exactly what the model would have produced anyway. It's a lossless speedup — it changes latency, not answers — and most 4-bit quants drop it. Preserving it (we inject it back at BF16 after quantization) is what makes this artifact fast where it counts.

A couple of details that mattered: we calibrate every expert (skipping rare experts quietly produces a coherent-looking but degraded model), and we calibrate on an in-distribution code/instruction mix rather than generic chat text.

Does it hold up?

We measured it against the official zai-org/GLM-5.2-FP8 release on the same harness, 8×H200:

Quality — W4A16+MTP vs FP8, same harness, 8×H200.
TaskW4A16+MTPFP8
GSM8K (strict)0.9600.955
IFEval (prompt / inst strict)0.909 / 0.9110.891 / 0.903
MATH-5000.9540.958
RULER @ 32K / 64K0.832 / 0.8410.831 / 0.813
SWE-bench Verified82.0%82.2%

Within run-to-run noise on reasoning, instruction-following, long-context retrieval, and agentic coding. 4-bit weight quantization, done carefully, costs essentially nothing in quality here. It also still serves the full 1M-token context on 8×H200 (we retrieved a needle from a ~936K-token prompt), and runs at up to 128K context on just 4×H200.

Is it fast?

Against FP8, throughput is +48% at concurrency 1 and +32% at 8 (where MTP helps most), and ~13% slower at full saturation — an honest trade-off we show in both directions.

We also put it head-to-head with the most-downloaded H200-servable 4-bit GLM-5.2 quants (output tokens/s, same harness):

Output tokens/s across concurrency — bold is ours. Higher is better.
ConcurrencyW4A16+MTP (ours)AWQ-INT4NVFP4
11327874
8466465410
32825960944

The story is clear and, we think, useful: in the interactive/agentic regime — low to mid concurrency, where latency matters — MTP puts us well ahead (+69–79% at concurrency 1). At full saturation the simpler no-MTP quants pull ahead by ~15%, because speculative decoding's overhead stops paying off once the batch is already full. We measured throughput only here; quality is close across all 4-bit quants of the same base model, so the real deployment differentiator is speed. (NVFP4 is a Blackwell-native FP4 format; on Hopper it runs without FP4 tensor cores — we did not measure it on Blackwell, where the picture may differ.)

The hard parts

Quantizing a 744B MoE and serving it isn't a one-command job, and we documented every wall we hit (F010–F029 in the repo) so the dead-ends don't have to be re-walked:

None of these are glamorous, but they're the difference between a checkpoint and something that actually serves.

Limitations (honestly)

Get it

canada-quant/GLM-5.2-W4A16-MTP — MIT, with the full recipe, evaluation methodology, and engineering log in the repo. Built on zai-org/GLM-5.2, quantized with llm-compressor, served with vLLM.