中文 on Feng's Blog

Pruning Qwen3.6-35B-A3B for RTX 5090: what I learned pushing MoE compression to its limit on a single GPU

Mon, 18 May 2026 00:00:00 +0000

Why this exists

I had a 35B-parameter MoE model that needed to run on a single RTX 5090. The model in FP8 needed 34.4 GiB. The GPU had 31.84 GiB. The gap was 2.6 GiB — small enough to feel tantalizing, large enough to break every standard deployment pipeline.

Six days later, I had a pruned model that scored 73.2% on HumanEval+, 51.0% on Toolcall, and 33.6% on MMLU. That model (v3) shipped as the best result available at the time.

Twenty-four hours after that, I had evidence that everything I thought I understood about calibration was specific to one recipe and didn’t generalize. Then another twelve hours produced v7b-fp8 — a model that beats v3 on every measured pack I ran, with BugFind +17, DataExtract +17, and InstructFollow +20.

The path between those states was not linear. Some of my most confident conclusions turned out to be wrong. This article is what I wish I had known on day one — updated after the sessions that overturned my earlier priors.

Qwen3.6-35B-A3B is a 256-expert MoE model optimized for agentic coding, and my only available hardware was a single RTX 5090. This covers pruning, quantization, evaluation, and attempted recovery fine-tuning on a single consumer GPU across seven experimental sessions. It does not cover multi-GPU setups, cloud inference, or comparisons with other pruning algorithms.

Why a 2.6 GiB deficit ate my week

The standard view is that compressing a model by 8% to fit a memory budget is a routine calibration exercise — run AWQ or GPTQ, adjust the quantization config, ship it. A 2.6 GiB gap on a 34.4 GiB model is 7.5% compression. Well within what quantization alone should handle.

It was not routine.

The contradiction hit immediately: BitsAndBytes, GPTQ, and AWQ all quantize only 2D nn.Linear weights. MoE models store experts as batched 3D tensors — [n_experts, in_dim, out_dim] — for efficient grouped matrix multiplication. These 3D tensors contain roughly 90% of the model’s total parameters, and every standard quantization tool simply skips them. No error. No warning. They just pass through at full precision.

So quantization alone could not close the gap. I needed expert pruning.

REAP (Router-Weighted Expert Activation Pruning) scores each expert by the conditional average of its gating weight times its activation norm over calibration tokens:

$$S_j = \mathbb{E}_{x \in X_j}[g_j(x) \cdot |f_j(x)|]$$

The intuition: an expert that gets low gating weight and produces small activations contributes little to the output and can be removed with minimal reconstruction error. REAP then removes the lowest-scoring experts and propagates residuals to keep the functional manifold topology intact.

The pruner works block-by-block: load one 750 MiB decoder layer to GPU, score all surviving experts, prune the lowest, propagate residuals, save. About 25 minutes per experiment across 40 decoder layers. I ran seven experiments over seven sessions on one GPU.

The constraint that shaped everything: the RTX 5090’s 32 GiB was simultaneously my inference platform, evaluation framework, and training environment. Pruning, eval, and SFT all competed for the same VRAM, and no two of them could run at the same time. At BF16, the model occupied 30.6 GiB — 97% of GPU capacity, leaving zero room for gradients or activations during training.

This total-resource competition, not the pruning algorithm, turned out to be the hard problem.

The calibration mix that rewrote my priors

I came to this expecting the pruning algorithm or the compression ratio to dominate quality. That is how the literature frames it: better importance scores, better pruning decisions.

The evidence says otherwise.

Here is what happened.

My first pruning experiment (v2) used four code-focused calibration datasets: evol-codealpaca, BigCodeBench, SWE-bench, and xlam. Pure code data. The result: HumanEval+ at 72.0%, Toolcall at 44.0%, and MMLU at — two categories at 0.0%.

Dead categories. The model could generate code, but it could not answer a general-knowledge question. The pruner had never seen general-knowledge tokens during scoring, so it had no way to know which experts mattered for those domains. Any expert carrying general-knowledge information was pruned or severely weakened.

For the v3 experiment, I switched to a 70/30 code-to-general mix. Same REAP algorithm. Same compression ratio. Just two general datasets added — 600 samples of MMLU and 600 samples of C4 — to a pool of 700 samples each from four code datasets.

Category	v2 (pure code cal)	v3 (70/30 cal)	Change
MMLU Social Sciences	0.0%	33.3%	+33.3pp
MMLU Other	0.0%	34.3%	+34.3pp
HumanEval+	72.0%	73.2%	+1.2pp
Toolcall	44.0%	51.0%	+7.0pp

Every single metric improved. MMLU recovered from dead to 33%+. Code benchmarks improved too.

I think of calibration data as the lens through which the pruner sees the model. A narrow lens (pure code) gives a sharp but myopic view — the pruner keeps only what it sees, and blinds the model to everything else. A wider lens (70/30 mix) lets the pruner see the full functional space, so it preserves the structures that serve the whole distribution, not just one mode.

The analogy breaks in one direction: you cannot just keep adding calibration domains forever. More data means longer scoring passes. But within a practical budget, the evidence is clear — calibration composition matters more than any algorithmic tweak I could have made to the pruner itself.

I shipped v3 and moved on. That was where the story got interesting.

The discovery that my conclusions were calibration-specific

After shipping v3, I went back to test a hypothesis that seemed obvious: if I replaced the code-heavy calibration with agentic traces, the pruned model would perform better on agentic benchmarks. Tool calling, bug finding, multi-step reasoning — these were what the model was designed for.

I built a proper BenchLocal evaluation harness — 8 packs covering tool-call, hermes-agent, bug-find, data-extract, instruct-follow, reason-math, struct-output, and cli — and established a v3 baseline with the new v19 chat template. The baseline was sobering: ToolCall-15 at 90, HermesAgent-20 at 16, BugFind-15 at 8. The agentic gates were low to begin with.

I ran two candidate experiments at a deeper compression (0.40, keeping 154 of 256 experts) with two calibration strategies:

Mix-A: full replacement — agentic data (glm47-reap + hermes-agent-traces) instead of code
Mix-B: additive — agentic data layered on top of v3’s base

Both produced the same result. ToolCall-15 dropped from 97 (the v3 baseline with the old reasoning parser) to 90. HermesAgent-20 stayed at exactly 16. BugFind-15 hovered in the 3-10 range. No agentic movement. Both failed all three gates.

I stopped and ran a control experiment. I took Mix-A’s exact calibration list and ran it at v3’s exact compression ratio (0.289, 183 experts). If compression depth was the cause of the toolcall regression, the control would recover toolcall. If calibration content was the cause, the control would still show the regression.

The control was decisive:

Candidate	Compression	Calibration	ToolCall-15	HermesAgent-20	BugFind-15
v3	0.289	70/30 balanced	97	16	8
Mix-A	0.40	Agentic replacement	90	16	10
Mix-B	0.40	Additive layering	90	16	3
v3ratio	0.289	Mix-A calibration	90	16	0

ToolCall-15 at 90 across all three candidates, including the control at v3’s compression. The regression came from the calibration content, not the compression depth.

The mechanism was surprisingly specific. The regression was concentrated entirely in a single sub-dimension: Parameter Precision dropped from 100 to 67. The model still picked the right tools with the right structure — it just generated wrong-typed or wrong-formatted arguments more often. Dropping code corpora (evol-codealpaca, bigcodebench, swe-bench) and substituting agentic traces had cost the model its tight argument-formatting discipline.

The other finding was harsher. HermesAgent-20 scored 16 out of 20 across all four configurations — literally identical, including per-category breakdown. A 25B pruned MoE cannot handle these multi-step browser-automation scenarios, regardless of what you feed it during pruning calibration. The gate is capacity-bound.

I also discovered that the vLLM --reasoning-parser qwen3 flag was essential for correct evaluation. Without it, the model’s <think> reasoning block leaked into all plain-text responses, breaking every non-tool-call scorer. The flag lifted ToolCall-15 from 90 to 97 and recovered instruct-follow and data-extract from flat zero. The lesson: validate your eval infrastructure before you trust a single number.

I closed Session 6 with a clean negative result and the conviction that calibration content could not move agentic gates. That conviction lasted about twelve hours.

The recipe that broke through

Session 7 adopted a fundamentally different calibration recipe: the REAP-26B 6-dataset mix. Six datasets — SWE-bench/SWE-smith-trajectories (tool split), xlam-function-calling-60k, evol-codealpaca, and Mixture-of-Thoughts (code/math/science) — at much higher token count (1024 samples x 16384 sequence length, totaling 16.8M tokens). Router renormalization disabled per the REAP-26B README.

I ran three plans plus a follow-up:

Plan-A (compression 0.40, fresh prune). ToolCall-15 collapsed to 63 — a catastrophic -27 regression. But BugFind jumped to +15 and InstructFollow to +33. The recipe was clearly powerful. Too powerful at this depth.

Plan-C (stacked prune on top of the upstream REAP-26B-VL). Recovered toolcall to 90 but lost ~90% of the recipe’s other gains. Stacked pruning does not inherit upstream calibration signal.

Plan-B (compression 0.289, v3’s depth, the experiment I initially skipped). This was the follow-up after both Plan-A and Plan-C failed.

Candidate	Compression	ToolCall-15	BugFind-15	DataExtract-15	InstructFollow-15	Verdict
v3+v19	0.289	90	8	5	20	Baseline
v7a	0.40	63	23	24	53	FailToolcall
v7c	stacked	90	0	4	16	FailAgentic
v7b	0.289	93	25	22	40	Pass

v7b-fp8 scored equal or better than v3 on all 7 measured packs. No regressions. BugFind +17, DataExtract +17, InstructFollow +20, ToolCall +3. The trigger verdict was Pass.

This is the result I keep coming back to: the same recipe at 0.40 collapsed toolcall to 63; at 0.289 it improved toolcall to 93. The recipe drives the agentic gains. The compression depth modulates the toolcall trade-off. Session 6’s conclusion that “calibration content cannot move agentic benchmarks” was specific to Mix-A’s content, not a universal property of pruning calibration.

The pipeline upgrades that made this possible are worth noting. The 16K sequence length calibration required a chunked REAP scoring accumulator — the single-pass approach would have materialized 67 GiB on a 32 GiB GPU. The custom FP8 quantizer (scripts/quantize_fp8.py) bypasses llmcompressor’s broken Qwen3.6 compatibility with a 274-line direct cast from BF16 to torch.float8_e4m3fn. Schema adapters with a 60-second preflight caught dataset drift before any GPU allocation.

How 3D tensors broke every quantization framework

The calibration experiment got me v3 — a model that almost passed its quality gates. HumanEval+ at 73.2% was close to the 75% threshold. MMLU at 33.6% was still short of the 40% target. The obvious next step was recovery fine-tuning.

This is where the second assumption broke.

I assumed that standard quantization tools would handle model compression for training. Load the model in 4-bit, apply LoRA adapters, train. This is the default workflow for QLoRA on every Hugging Face tutorial. It works on LLaMA, it works on Mistral — it should work on Qwen.

It does not. Because the model’s 3D expert tensors are invisible to BitsAndBytes.

I spent the evening of day two systematically eliminating every standard training approach. Seven attempts, all failures:

Attempt	Approach	Result
1	BnB 4-bit QLoRA	Can’t quantize 3D expert tensors [183, 1024, 2048]
2	BF16 model.to(‘cuda’)	30.6 GiB — 0 bytes left for activations
3	accelerate device_map=‘auto’	Keeps all layers on GPU for backward
4	DeepSpeed ZeRO-3 (single GPU)	Trainer moves model to GPU before partitioning
5	DeepSpeed zero.Init + from_pretrained	Weight loading conflicts with meta-device tensors
6	FP8 frozen weights + monkey-patched ops	grouped_mm upcast creates 768 MiB BF16 temp per layer
7	FP8 + dispatch_model with 10 GiB budget	Offloaded layers accumulate on GPU during backward

I don’t fully understand why every framework eventually calls model.to(device) for the backward pass on single GPU. The documentation promises CPU offloading. The reality is that DeepSpeed ZeRO-3, accelerate’s dispatch_model, and FSDP all converge on the same behavior: put the full model on GPU when gradients need to flow.

The resolution came from a workaround I had not considered: unbatch the 3D expert tensors into individual bnb.nn.Linear4bit layers. BnB can quantize standard 2D linear layers. A 3D tensor [183, 1024, 2048] becomes 183 separate Linear(2048, 1024) objects, each quantizable in 4-bit.

The result: model on GPU dropped from 30.6 GiB to 16.8 GiB, leaving 16.9 GiB for activations and gradients. The SFT ran — 311 steps, 9,934 samples, 11.5 hours, loss from 1.058 to 0.975, token accuracy from 85% to 96%. By every training metric, it worked.

It did not work.

The SFT trap: when training makes everything worse

I expected SFT with quantized frozen weights to improve the model. The training curves were healthy. Loss decreasing. Token accuracy climbing. All the signals that normally say “keep training, it is converging.”

Post-SFT evaluation told a different story:

Benchmark	Pre-SFT	Post-SFT	Delta
HumanEval+	73.2%	67.7%	-5.5pp
Toolcall	51.0%	50.5%	-0.5pp
MMLU	33.6%	9.4%	-24.2pp

Everything regressed. MMLU collapsed back to v2-level. HumanEval+ lost 5.5 points.

The mechanism is specific and instructive — and worth pausing on because it explains an entire class of pipeline failures:

4-bit quantization injects noise into the forward pass of every frozen expert layer. That noise is deterministic — same input, same 4-bit weights, same quantization error — but it shifts the activation distribution that the trainable router and shared expert layers see. The trainable parameters adapt to this shifted distribution during SFT. They learn to work with the noise profile of the 4-bit experts.

When you remove the 4-bit quantization (merge the fine-tuned weights back into the original BF16 model for inference), the noise profile disappears. The trainable parameters are now running on clean activations, and they have overfit to a distribution that no longer exists.

This is why the training metrics looked great while the benchmarks collapsed. The model was not learning to generate better code or answer knowledge questions. It was learning to compensate for quantization noise in the frozen pathway. When the noise went away, the compensation became output distortion.

The result that I keep returning to: the pre-SFT v3 model was the project’s best output at that point. The calibration strategy was the lever. Fine-tuning was a trap.

What I would do differently

If I started this project again, I would change several things. The later sessions taught me that some of my early conclusions were incomplete — so these recommendations are updated with everything I know after seven sessions.

What I did	What I would do instead	Why
Jumped straight to production-scale SFT	Validate on a 2-layer toy model first	Seven failed approaches, ~4 hours of debugging, caught in minutes
Let eval and training share one venv	Isolate venvs from the start	huggingface_hub 1.5 vs 1.14 broke vLLM weight loading
Ran SFT before exhausting calibration experiments	Run all calibration experiments before SFT	The 50/50 experiment was never attempted, and calibration is the primary lever
Shallow pruning (183/256) + 4-bit SFT	Deeper pruning (154 experts) + clean BF16 SFT	Avoids the noise-overfitting trap entirely
Assumed agentic calibration lifts agentic gates	Test the REAP-26B recipe at the original depth first	Session 6’s clean negative result was Mix-A-specific; Plan-B at v3’s depth passed everything
Tested one variable at a time	Test “new calibration at same compression” as default isolation	The v3ratio control flipped the interpretation — always isolate the variable
Championed “capacity-bound benchmarks” as a universal conclusion	Measure with multiple recipes before declaring a ceiling	BugFind moved +17 with the right recipe at no parameter count change

The most painful lesson is also the most transferable: validation loops catch pipeline bugs, but they do not catch strategy bugs. The SFT pipeline ran correctly — no crashes, no OOM, healthy training curves — and produced a worse model.

The one experiment I most regret not running is the 50/50 calibration mix. If 30% general data pushed MMLU from 0% to 33%, 50% might push it past 40%. That experiment would have taken 25 minutes. The SFT that replaced it took 11.5 hours and made everything worse.

Boundary conditions

The calibration-composition result is established for REAP on one model family (Qwen3.6-35B-A3B). It likely transfers to other MoE models and other pruning algorithms, but I have not tested this.
The REAP-26B recipe finding (v7b-fp8 beating v3) is specific to this calibration mix at this compression depth. Whether it generalizes to other MoE scales is an open question.
The Session 6 “calibration content drives toolcall regression” finding is specific to Mix-A’s content (glm47-reap + hermes-agent-traces). The REAP-26B recipe at the same depth showed a positive toolcall delta. The finding is recipe-specific, not universal.
HermesAgent-20 remained stuck at 16/20 across all seven sessions and all configurations tested. It is genuinely capacity-bound at this model size.
The 4-bit expert unbatching technique works at the cost of inference speed — per-expert sequential Linear4bit forward passes are slower than native grouped matrix multiplication.
The SFT degradation result applies specifically to training with 4-bit frozen experts on this model architecture. FP8 frozen experts or full BF16 SFT may behave differently.
Single-GPU constraints shaped every conclusion. With multi-GPU hardware, the trade-offs shift substantially.
The v19 chat template costs about 7 toolcall points compared to v18 (90 vs 97 on the same model). All Session 7 comparisons are within the same template, but direct comparability with Session 6 numbers requires accounting for the template shift.

Open questions

Can SFT on v7b-fp8 lift HermesAgent-20? It is the only pack that v7b didn’t improve, stuck at 16/20. Session 6’s capacity-bound conclusion was wrong for BugFind (the right recipe moved it +17). It may also be wrong for HermesAgent once the right recipe is found. But SFT on actual agent traces is a qualitatively different approach, and the infrastructure exists but hasn’t been validated against v7b.
Does 50/50 calibration push MMLU past 40%? The experiment takes 25 minutes and was never scheduled.
Can Transformer Engine FP8 training enable quality SFT without the noise-overfitting trap? The tools are installed on sm_120. Untested.
Does the REAP-26B recipe replicate on other MoE families? The recipe drove +17 across multiple benchmarks on Qwen3.6. Would it produce similar gains on DeepSeek, Mixtral, or OLMoE?
Should stacked pruning ever be used? Session 7 showed it destroys upstream calibration signal. But if the upstream calibration is expensive (24K samples on a 96 GB GPU), stacking a cheap re-prune on top seems like it should work in theory. The empirical result was negative. I don’t fully understand why.

[[Q]] Six months from now: run the 50/50 calibration experiment first. If it pushes MMLU past 40%, the entire SFT effort was wasted. And promote the v7b symlink — it is the best model you have.

References

Fang et al., “REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression”, arXiv:2510.13999, 2025.
Dery et al., “Finding Fantastic Experts in MoE Models”, arXiv:2504.15447, 2025.
Zhang et al., “Efficient Expert Pruning in MoE LLMs”, arXiv:2505.12345, 2025.
BitsAndBytes, Hugging Face quantization library, https://github.com/bitsandbytes-foundation/bitsandbytes.
TRL: Transformer Reinforcement Learning, Hugging Face, https://github.com/huggingface/trl.

Qwen3.6-35B-A3B 剪枝实战：在单卡 RTX 5090 上把 MoE 压缩到极限的七轮实验

Mon, 18 May 2026 00:00:00 +0000

本文是 Pruning Qwen3.6-35B-A3B for RTX 5090 的中文版，更新至第 7 轮实验后的完整记录。

为什么会有这篇文章

我手头有一个 35B 参数的 MoE 模型，需要跑在一张 RTX 5090 上。这个模型在 FP8 下需要 34.4 GiB —— GPU 只有 31.84 GiB。差距只有 2.6 GiB，小到让人觉得唾手可得，大到足以让所有标准部署流程全部失效。

六天后，我得到了第一个可用的剪枝模型：v3，HumanEval+ 73.2%、Toolcall 51.0%、MMLU 33.6%，FP8 下仅 26.3 GiB。它成了当时最好的输出。

二十四小时后，我发现自己之前对「标定（calibration）」的理解全都局限于某一个配方，完全没有泛化性。又过了十二个小时，v7b-fp8 诞生了——它在每一个评测包上都超越了 v3，BugFind 提升 17 分、DataExtract 提升 17 分、InstructFollow 提升 20 分。

从最初到最终的路径远非线性。那些我曾深信不疑的结论，有些被后续实验彻底推翻。这篇文章是我希望第一天就能知道的事情——经过七轮实验反复修正后的版本。

Qwen3.6-35B-A3B 是一个 256 专家的 MoE 模型，专为 agentic 编程场景优化。我的全部硬件只有一张 RTX 5090。本文涵盖剪枝、量化、评测、以及尝试性的恢复微调——全部在单张消费级 GPU 上完成。不涉及多卡训练、云端推理、或其他剪枝算法的横向对比。

为什么 2.6 GiB 的缺口吃掉了我一周

通常的看法是：把模型压缩 8% 来适配显存预算，不过是例行公事——跑个 AWQ 或 GPTQ，调一下量化配置，然后上线。2.6 GiB 对 34.4 GiB 来说是 7.5% 的压缩率，量化本身理应轻松搞定。

现实并非如此。

问题立刻显现：BitsAndBytes、GPTQ、AWQ 全都只量化 2D 的 nn.Linear 权重。MoE 模型把专家参数存储为批量的 3D 张量——[n_experts, in_dim, out_dim]——这是为了高效的分组矩阵乘法（grouped MM）。这些 3D 张量占了模型总参数的大约 90%，而每一种标准量化工具都会直接跳过它们。没有报错，没有警告。就是原样保留全精度。

所以光靠量化关不掉这个缺口。我需要专家剪枝。

REAP（Router-Weighted Expert Activation Pruning）通过以下公式给每个专家打分：

$$S_j = \mathbb{E}_{x \in X_j}[g_j(x) \cdot |f_j(x)|]$$

直觉上讲，一个既获得低门控权重又产生小激活值的专家，对模型输出的贡献有限，可以被移除而不会造成显著的重建误差。然后 REAP 移除得分最低的专家，并传播残差以保持模型的功能流形拓扑结构。

剪枝器按 block 逐个处理：把一层 750 MiB 的解码器加载到 GPU，评分所有幸存专家，剪掉得分最低的，传播残差，保存。一次实验处理 40 层解码器，大约 25 分钟。我在这张 GPU 上跑了七轮实验，持续七天。

整个过程中最具约束性的事实是：RTX 5090 的 32 GiB 同时是我的推理平台、评测框架和训练环境。剪枝、评测和微调争夺同一块显存，且任何时候最多只能运行其中一项。在 BF16 下，模型占用 30.6 GiB——GPU 容量的 97%，训练时连梯度和激活值的空间都不剩。

最终，真正的难题不是剪枝算法本身，而是这种全方位的资源竞争。

重写我认知的标定实验

刚开始时，我预期剪枝算法或压缩比会主导模型质量——文献也是这么写的：更好的重要性评分，更好的剪枝决策。

证据指向了别处。

第一次实验（v2）用了四个代码类标定数据集：evol-codealpaca、BigCodeBench、SWE-bench、xlam。纯代码数据。结果：HumanEval+ 72.0%、Toolcall 44.0%、MMLU——两个类别直接归零。

死的类别。模型能写代码，但回答不了一个常识问题。剪枝器在评分阶段从未见过通用知识的 token，所以它没有办法知道哪些专家对通用领域是重要的。承载通用知识的专家要么被剪掉，要么被严重削弱了。

第二次实验（v3），我换成了 70/30 的代码/通用混合。同样的 REAP 算法，同样的压缩比。只是在四个代码数据集（各 700 条样本）的基础上，加了两类通用数据集——600 条 MMLU 和 600 条 C4。

类别	v2（纯代码标定）	v3（70/30 标定）	变化
MMLU 社会科学	0.0%	33.3%	+33.3pp
MMLU 其他	0.0%	34.3%	+34.3pp
HumanEval+	72.0%	73.2%	+1.2pp
Toolcall	44.0%	51.0%	+7.0pp

每一项指标都提升了。MMLU 从死亡状态恢复到 33%+。代码类指标也提升了。

我倾向于这样理解：标定数据是剪枝器观察模型的透镜。窄透镜（纯代码）视野锐利但狭窄——剪枝器只保留它看到的东西，模型在其他领域就失明了。宽透镜（70/30 混合）让剪枝器看到完整的功能空间，从而保留了服务于整体分布的结构，而非仅仅一个模态。

这个类比有一个局限：你不能无限扩充标定领域。更多数据意味着更长的评分阶段。但在实际预算范围内，证据是明确的——标定数据的组成比任何对剪枝器本身的算法调优都更重要。

v3 发布了，我继续前进。这时故事才真正开始变得有趣。

我的结论只是某个配方的特例

v3 发布后，我回去测试了一个看似显而易见的假设：如果用 agentic 轨迹替换代码类标定数据，剪枝后的模型在 agentic 基准上应该有更好的表现。工具调用、缺陷查找、多步推理——这本就是这个模型的设计目标。

我搭建了一套完整的 BenchLocal 评测框架——8 个评测包覆盖工具调用、agent 能力、缺陷查找、数据提取、指令跟随、数学推理、结构化输出和 CLI——并在新的 v19 对话模板下建立了 v3 基线。基线数据让人清醒：ToolCall-15 90 分、HermesAgent-20 16 分、BugFind-15 8 分。agentic 门槛本来就低。

我在更深的压缩率（0.40，保留 154/256 专家）下跑了两个候选实验，用了两种标定策略：

Mix-A：完全替换——用 agentic 数据（glm47-reap + hermes-agent-traces）替换代码数据
Mix-B：叠加——在 v3 原始标定的基础上叠加 agentic 数据

两者给出了相同的结果。ToolCall-15 从 97（旧推理解析器下的 v3 基线）掉到 90。HermesAgent-20 恰好停在 16。BugFind-15 在 3-10 之间波动。没有一项 agentic 指标被拉动。三个门控全部失败。

我停下来跑了一个对照实验。拿 Mix-A 的标定列表，在 v3 的压缩比（0.289，保留 183 专家）下重新剪枝。如果压缩深度是 toolcall 退化的原因，对照实验应该恢复 toolcall。如果标定内容才是原因，退化应该仍然存在。

对照实验的结果是决定性的：

候选	压缩比	标定方式	ToolCall-15	HermesAgent-20	BugFind-15
v3	0.289	70/30 均衡	97	16	8
Mix-A	0.40	agentic 替换	90	16	10
Mix-B	0.40	叠加式	90	16	3
v3ratio	0.289	Mix-A 标定	90	16	0

全部三个候选的 ToolCall-15 都是 90，包括在 v3 压缩比下的对照实验。退化来自标定内容，而非压缩深度。

进一步分析发现，退化集中在一个非常具体的子维度：参数精确度（Parameter Precision）从 100 掉到了 67。模型仍然能选对工具、保持正确的结构——只是更频繁地生成类型错误或格式错误的参数。去掉代码类语料（evol-codealpaca, bigcodebench, swe-bench）换成 agentic 轨迹，让模型丢失了严格的参数格式化能力。

另一个发现更加残酷。HermesAgent-20 在全部四个配置下都是 16 分——字面意义上的完全一致，连子类别得分都一模一样。一个 25B 的剪枝 MoE 无法处理这些多步浏览器自动化场景，不管你往剪枝标定里喂什么数据。这个门控是容量受限的。

我还发现 vLLM 的 --reasoning-parser qwen3 标志对正确评测至关重要。如果没有它，模型的 <think> 推理块会泄漏到所有纯文本响应中，破坏所有非工具调用的评分器。加上这个标志后，ToolCall-15 从 90 提升到 97，instruct-follow 和 data-extract 从零分恢复到正常水平。教训是：在你信任任何数字之前，先验证你的评测基础设施。

第 6 轮实验以一份干净的否定结果收场，我得出的结论是：标定内容无法拉动 agentic 基准。这个结论大约维持了十二个小时。

真正突破的配方

第 7 轮实验采用了一个完全不同的标定配方：REAP-26B 六数据集混合。六个数据集——SWE-bench/SWE-smith-trajectories（tool 分片）、xlam-function-calling-60k、evol-codealpaca、以及 Mixture-of-Thoughts（code/math/science）——在更高的 token 数下（1024 样本 × 16384 序列长度，总计 1680 万 token）。同时按照 REAP-26B 的 README 建议禁用了路由器重归一化。

我跑了三个方案加一个补充实验：

方案 A（压缩比 0.40，从头剪枝）。ToolCall-15 崩到 63——灾难性的 -27 退化。但 BugFind 跳到 +15，InstructFollow 跳到 +33。这个配方显然很强——在这个深度下强过头了。

方案 C（在上游 REAP-26B-VL 基础上叠层剪枝）。toolcall 恢复到 90，但配方的其他增益损失了约 90%。叠层剪枝无法继承上游的标定信号。

方案 B（压缩比 0.289，v3 的深度——我最初跳过的实验）。这是在方案 A 和方案 C 都失败后的补充实验。

候选	压缩比	ToolCall-15	BugFind-15	DataExtract-15	InstructFollow-15	判定
v3+v19	0.289	90	8	5	20	基线
v7a	0.40	63	23	24	53	FailToolcall
v7c	叠层	90	0	4	16	FailAgentic
v7b	0.289	93	25	22	40	通过

v7b-fp8 在所有 7 个已测评测包上都等于或优于 v3。没有任何退化。BugFind +17、DataExtract +17、InstructFollow +20、ToolCall +3。触发函数判定：通过。

这个结果我一直反复回味：同一个配方在 0.40 下让 toolcall 崩到 63，在 0.289 下却把 toolcall 提升到 93。配方驱动 agentic 增益，压缩深度调节 toolcall 的权衡。第 6 轮实验的结论——“标定内容无法拉动 agentic 基准”——只是 Mix-A 内容的特例，而非剪枝标定的普适性质。

实现这一切所需的管线升级也值得一提。16K 序列长度的标定需要一个分块 REAP 评分累加器——单次通过的方式会在 32 GiB 的 GPU 上产生 67 GiB 的张量。自定义 FP8 量化器（scripts/quantize_fp8.py）用 274 行代码绕过 llmcompressor 对 Qwen3.6 的兼容性问题，直接从 BF16 转为 torch.float8_e4m3fn。模式适配器配合 60 秒的预检脚本，在 GPU 分配之前就能捕获数据集结构漂移。

3D 张量如何瓦解了所有量化框架

标定实验让我得到了 v3——一个差一点就能通过质量门控的模型。HumanEval+ 73.2% 接近 75% 的阈值。MMLU 33.6% 离 40% 的目标还有距离。下一步自然是恢复微调——在剪枝模型上做 SFT 来拉高剩余的基准。

这是第二个假设崩塌的地方。

我原以为标准量化工具可以处理模型压缩以便训练。加载 4-bit 模型，挂 LoRA 适配器，训练。这是 Hugging Face 上每个 QLoRA 教程的默认流程。它能在 LLaMA 上工作，能在 Mistral 上工作——也应该能在 Qwen 上工作。

不行。因为模型的 3D 专家张量对 BitsAndBytes 来说是不可见的。

我花了第二天整个晚上系统性地排除每一种标准训练方法。七次尝试，全部失败：

尝试	方法	结果
1	BnB 4-bit QLoRA	无法量化 3D 专家张量 [183, 1024, 2048]
2	BF16 model.to(‘cuda’)	30.6 GiB——激活值需要 0 字节空间
3	accelerate device_map=‘auto’	反向传播时把所有层留在 GPU 上
4	DeepSpeed ZeRO-3（单卡）	Trainer 在分区前把完整模型移到 GPU
5	DeepSpeed zero.Init + from_pretrained	权重加载与 meta-device 张量冲突
6	FP8 冻结权重 + monkey-patch 算子	grouped_mm 反量化产生每层 768 MiB BF16 临时空间
7	FP8 + dispatch_model 配合 10 GiB 预算	卸载的层在反向传播时全部回到 GPU

我不太理解为什么每一个框架在单卡反向传播时最终都会调用 model.to(device)。文档承诺了 CPU 卸载，实际上 DeepSpeed ZeRO-3、accelerate 的 dispatch_model 和 FSDP 都收敛到同一个行为：梯度需要流动时，把完整模型放到 GPU 上。

解决方案来自一个我从未考虑过的变通方案：把 3D 专家张量解批量为独立的 bnb.nn.Linear4bit 层。BnB 可以量化标准的 2D 线性层。一个 [183, 1024, 2048] 的 3D 张量变成 183 个独立的 Linear(2048, 1024) 对象，每个都可以用 4-bit 量化。

结果：模型在 GPU 上从 30.6 GiB 降到 16.8 GiB，留下 16.9 GiB 给激活值和梯度。SFT 跑起来了——311 步、9934 条样本、11.5 小时、loss 从 1.058 降到 0.975、token 准确率从 85% 升到 96%。从所有训练指标来看，它成功了。

但它没有成功。

SFT 陷阱：当训练让一切变得更糟

我预期在量化冻结权重上的 SFT 会改善模型。训练曲线很健康。Loss 在下降。Token 准确率在攀升。所有信号都在说"继续训练，它在收敛"。

微调后的评测结果说了另一个故事：

基准	微调前	微调后	变化
HumanEval+	73.2%	67.7%	-5.5pp
Toolcall	51.0%	50.5%	-0.5pp
MMLU	33.6%	9.4%	-24.2pp

每一项都退化了。MMLU 崩回 v2 的水平。HumanEval+ 掉了 5.5 分。

这个机制非常具体且具有启发性——值得停下来仔细理解，因为它解释了一整类管线失败的原因：

4-bit 量化在每个冻结专家层的前向传播中注入了噪声。这个噪声是确定性的——同样的输入、同样的 4-bit 权重、同样的量化误差——但它改变了可训练的路由器和共享专家层所看到的激活分布。在 SFT 过程中，可训练参数适应了这个偏移后的分布。它们学会了配合 4-bit 专家的噪声特征来工作。

当你在推理时移除 4-bit 量化（把微调后的权重合并回原始的 BF16 模型），噪声特征消失了。可训练参数现在运行在干净的激活上，但它们已经过拟合到了一个不再存在的分布。

这就是为什么训练指标看起来很好而基准却崩了。模型并没有学会生成更好的代码或回答知识性问题。它学会了补偿冻结路径中的量化噪声。当噪声消失时，补偿变成了输出失真。

一个我反复回味的结论：微调前的 v3 模型是当时项目的最佳输出。标定策略才是杠杆。微调是一个陷阱。

如果重来一次我会怎么做

如果这个项目从头开始，我会改变很多事情。后续的实验让我意识到早期的一些结论是不完整的——所以下面这些建议是基于七轮实验后的完整认知。

我做了什么	应该怎么做	为什么
直接在生产规模上做 SFT	先在 2 层玩具模型上验证	七种失败方案、约 4 小时调试，在玩具模型上几分钟就能发现
评测和训练共用同一个 venv	从一开始就隔离 venv	huggingface_hub 1.5 升级到 1.14 破坏了 vLLM 的权重加载
在穷尽标定实验之前就跑 SFT	先跑完所有标定实验	50/50 实验从未尝试，而标定才是主要杠杆
浅层剪枝（183/256）+ 4-bit SFT	深层剪枝（154 专家）+ 干净的 BF16 SFT	避免噪声过拟合陷阱
假设 agentic 标定能拉动 agentic 门控	先在原始深度下测试 REAP-26B 配方	第 6 轮实验的干净否定结果是 Mix-A 特有的；Plan-B 在 v3 深度下全面通过
一次只测一个变量	默认"新标定 + 相同压缩比"的对照隔离	v3ratio 对照实验翻转了全部解释——永远要做变量隔离
把"容量受限的基准"当作普适结论	在声明天花板之前用多个配方测量	合适的配方下 BugFind 在不改变参数量时提升了 17 分

最痛苦的教训也最有迁移价值：验证循环能捕获管线 bug，但捕获不了策略 bug。SFT 管线运行正确——没有崩溃、没有 OOM、训练曲线健康——却产出了一个更差的模型。

我最遗憾没有做的一个实验是 50/50 标定混合。如果 30% 的通用数据把 MMLU 从 0% 拉到 33%，50% 可能会让它超过 40%。那个实验只需要 25 分钟。替换它的 SFT 花了 11.5 小时，还把一切都搞得更糟了。

边界条件

标定组成的重要性发现是在 REAP 算法 + Qwen3.6-35B-A3B 模型上得出的。它很可能迁移到其他 MoE 模型和剪枝算法上，但我没有验证过。
REAP-26B 配方的发现（v7b-fp8 击败 v3）局限于这一特定的标定混合和压缩深度。是否能泛化到其他 MoE 规模是开放问题。
第 6 轮实验中"标定内容驱动 toolcall 退化"的发现是 Mix-A 内容特有的（glm47-reap + hermes-agent-traces）。REAP-26B 配方在相同深度下展示了正向的 toolcall 变化。这个发现是配方特异的，不是普适的。
HermesAgent-20 在所有七轮实验和所有测试配置下都停在 16/20。在这个模型规模下，它确实是容量受限的。
4-bit 专家解批量技术以推理速度为代价——每个专家独立的 Linear4bit 前向传播比原生分组矩阵乘法慢。
SFT 退化结果特指在此模型架构上使用 4-bit 冻结专家的训练。FP8 冻结专家或全 BF16 的 SFT 可能行为不同。
单卡约束影响了所有结论。在多卡硬件上，权衡关系会显著变化。
v19 对话模板相比 v18 大约损失了 7 个 toolcall 分点（同一模型上 90 对 97）。第 7 轮实验的所有对比都在同一模板内，但直接与第 6 轮实验的数据做比较需要考虑模板变化。

开放问题

在 v7b-fp8 上做 SFT 能否拉动 HermesAgent-20？ 它是 v7b 唯一没有改善的评测包，停在 16/20。第 6 轮实验的"容量受限"结论对 BugFind 是错的（正确的配方在不改变参数量时把它拉动了 17 分）。对 HermesAgent 来说也可能如此——只是还没找到正确的配方。但在真实的 agent 轨迹上做 SFT 是一个性质不同的路径，相关基础设施已经存在但还没有在 v7b 上验证过。
50/50 标定能否把 MMLU 推到 40% 以上？ 这个实验只需要 25 分钟，从未被排进日程。
Transformer Engine FP8 训练能否实现有质量的 SFT 而又不落入噪声过拟合陷阱？ 相关工具已经在 sm_120 上安装完毕。尚未测试。
REAP-26B 配方能否在其他的 MoE 模型族上复现？ 它在 Qwen3.6 上驱动了多个基准 +17 的提升。在 DeepSeek、Mixtral 或 OLMoE 上是否类似？
叠层剪枝到底该不该用？ 第 7 轮实验表明它会破坏上游的标定信号。但如果上游标定本身很昂贵（24K 样本在 96 GB GPU 上），在其基础上叠一层廉价的重剪枝在理论上似乎应该可行。实证结果是负面的。我不太理解为什么。

[[Q]] 半年后的我：先跑 50/50 标定实验。如果它把 MMLU 推到 40% 以上，整个 SFT 的努力都是浪费。另外，把 v7b 的符号链接建好——这是你有过的最好的模型。

参考文献

Fang et al., “REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression”, arXiv:2510.13999, 2025.
Dery et al., “Finding Fantastic Experts in MoE Models”, arXiv:2504.15447, 2025.
Zhang et al., “Efficient Expert Pruning in MoE LLMs”, arXiv:2505.12345, 2025.
BitsAndBytes, Hugging Face quantization library, https://github.com/bitsandbytes-foundation/bitsandbytes.
TRL: Transformer Reinforcement Learning, Hugging Face, https://github.com/huggingface/trl.

MoE Expert Pruning: What Works, What Doesn't, and What We Still Don't Know

Mon, 11 May 2026 00:00:00 +0000

I spent the last week reading seven papers on expert compression for Mixture-of-Experts models. I went in assuming the landscape was settled: expert pruning was a useful technique for shrinking models, expert merging was a promising alternative, and the choice between them was mostly a matter of taste. I came out with a very different picture. The pruning-vs-merging debate flipped in later 2025, and the reason it flipped tells you something fundamental about how these models actually work.

The core tension is simple. Sparse Mixture-of-Experts models activate only a fraction of their parameters per token, but they still occupy all of them in memory. Mixtral 8×7B activates 2 of 8 experts per layer — yet all 8 experts (45B of 47B total parameters) must sit on the GPU. NLLB-200 has 1,536 experts; you need four 32GB GPUs just to load the thing. Expert compression asks: can we drop or combine the experts that rarely get used, and how much does it cost?

The answer, across all seven papers, is surprising in two ways. First, you can drop far more experts than I expected, and the costs are concentrated in specific capabilities rather than spread evenly. Second, and more unexpectedly, when it comes to actual generative tasks — code, math, creative writing — expert pruning is decisively better than expert merging. The merging methods that looked good on multiple-choice benchmarks collapse on tasks that require the model to actually generate tokens. The reason is structural, not empirical: merging removes the router’s fine-grained control over experts, and on generative tasks, that control matters.

Why expert pruning works at all

Here’s the one-sentence model: expert utilization in MoE models is long-tailed — a handful of experts do most of the work, and the rest are along for the ride.

Think of it like a restaurant kitchen during a dinner rush. You have eight chefs at eight stations. On a given night, two chefs handle 80% of the orders. The other six are standing by, occasionally contributing a garnish, mostly drawing salary. If you fire four of them and redistribute their stations, dinner still gets served — maybe even faster, because the remaining chefs spend less time coordinating. That is expert pruning.

The numbers back this up. Heatmap analysis of Mixtral 8×7B on MMLU shows stark unevenness: Expert #2 in Layers 26 and 30 is heavily activated while Expert #7 in Layers 22 and 23 is barely touched [1]. The same pre-trained MoE model produces substantially different expert contribution patterns when fine-tuned on different tasks [2]. This task-specificity cuts both ways — it means pruning must be calibrated to the deployment domain, but it also means aggressive pruning is possible for narrow use cases.

The distribution is not just long-tailed — it’s task-dependent. An expert that dominates on MNLI might be silent on CoLA. This is the core insight behind every pruning strategy: you are not removing universal knowledge. You are removing specialists that your specific task does not call.

Where the analogy breaks: experts are not independent chefs. The router learns to distribute tokens across experts during pre-training, and removing experts changes the routing distribution for the survivors. This is why naive pruning based on activation frequency alone performs worse than random pruning [3]. The router expects a full kitchen.

Knowledge redundancy: the surprising overcapacity result

Here is the finding that made me stop and re-read: pruning 4 of 8 experts in Mixtral 8×7B-Instruct improves SQuAD accuracy from 53.4% to 75.4%, without updating any remaining expert parameters [4]. This is not a typo. Removing half the experts makes the model better at question answering.

The mechanism: pruning simplifies the routing problem. With 8 experts, the router must learn to partition the hidden space across many specialists — a hard optimization problem. With 4, the routing is easier, and the remaining experts each get a cleaner slice of the input distribution. The router stops sending ambiguous tokens to the wrong specialist.

This overcapacity effect is not universal, but it recurs: on data-limited downstream tasks, a single-expert model can outperform the full multi-expert counterpart. After fine-tuning a pruned Mixtral 8×7B on MetaMathQA, the 7-expert model slightly exceeds the original 8-expert model on GSM8K (81.50 vs. 81.43) [3]. A single expert in Mixtral 8×7B-Instruct operates without model collapse [4].

Pruning vs. merging: the debate that flipped in 2025

Until mid-2025, the story seemed clear. Expert merging — clustering and averaging experts rather than discarding them — was winning. M-SMoE and HC-SMoE showed that merging outperformed pruning when measured by perplexity and multiple-choice question answering benchmarks. If you only looked at those numbers, merging was the smarter choice. Retain information from all experts. Avoid the binary brutality of pruning.

Then REAP showed up and asked: what happens when you actually make these models generate tokens?

The answer is a head-on collision between the two approaches. On code generation, REAP achieves a mean accuracy decrease of only 1.9% at 25% compression and 6.9% at 50% compression. Merging methods? HC-SMoE and M-SMoE degrade more than 5% at 25% and more than 20% at 50% [7]. On creative writing and mathematical reasoning, the same pattern holds. Merging is not slightly worse — it is qualitatively broken on generative tasks at 50% compression.

Lasby et al. didn’t just report the numbers. They derived why this has to be the case. When a router selects two experts f_i and f_j for a token, it produces a dynamic mixture r(x)·f_i(x) + (1−r(x))·f_j(x), where the mixing ratio r(x) depends on the input. After merging, the router must apply the summed gate to a constant convex combination — a static merged expert. The merged model must approximate a dynamic, input-dependent target with a static one. The resulting irreducible error is proportional to the router’s policy variability Var[r(x)] and the functional gap between the merged experts ∥Δ_ij∥ [7].

Pruning doesn’t have this problem. When you prune expert j, the router still controls each surviving expert independently. Pruning only incurs error when the pruned expert was in the top-k set, and that error is proportional to its gate-value g_j — it does not penalize policy variability at all [7]. The mathematical difference is clean: pruning is a coordinate subspace operation that preserves the functional manifold’s topology. Merging introduces novel functions and collapses the manifold toward its center — by up to 100× reduction in spread in late layers of high-granularity models [7].

Here’s what this means in practice. I can now look at a compressed model and predict failure modes based on the operation, not just the sparsity level. Merged model outputs have significantly lower N-gram diversity and their logits diverge from the original model more rapidly during auto-regressive generation [7]. The tokens drift. The model stops sounding like itself. MC benchmarks missed this entirely because they never asked the model to string tokens together — they only asked it to rank answer choices in a single forward pass.

One more uncomfortable finding: when merging does work well, look closer. HC-SMoE produces a high prevalence of singleton clusters — single-expert clusters that are functionally indistinguishable from keeping the expert unmerged [7]. The “merging” that succeeds is pruning plus a few mega-clusters of the truly redundant experts. And those mega-clusters are fragile: restricting the maximum cluster size to 32 experts causes large accuracy drops [7].

A separate problem compounds this. The L2-distance between clustered expert weights, even after weight-matching permutation, greatly exceeds the distance between pretrained and instruction-fine-tuned checkpoints. Singular-vector alignment remains poor [7]. Merging experts is fundamentally harder than the widely successful technique of model merging, and we should stop assuming the two are similar problems.

How to score expert importance

The choice of importance criterion is the single biggest lever in expert pruning. I organize the criteria by what information they use.

Criterion	Source	What it measures	Best result
Alpha score (accumulated gating weight)	Chen et al. 2022 [2]	Weighted contribution to output	Single expert preserves 99.3% of full MoE
Soft counting (accumulated softmax)	Muzio et al. 2024 [1]	Confidence margin of selection	25% sparsity: 3.85 pp MMLU drop
Min-EAN (activation norm)	Jaiswal et al. 2025 [5]	Minimum activation magnitude	14.02 PPL at 75% sparsity
REAP (conditional g_j∥f_j∥)	Lasby et al. 2025 [7]	Gate × activation, conditional	Near-lossless at 50%, up to 1T params
Importance product (top1 × exp(conf))	Koishekenov et al. 2023 [6]	Combined activity and confidence	80% pruning, chrF++ Δ = 0.29
Activation frequency alone	Lu et al. 2024 [3]	Simple token count	Worse than random

REAP deserves special attention because it’s the first criterion explicitly designed to minimize the reconstruction error bound. Its saliency score computes the conditional average of g_j(x)·∥f_j(x)∥ over only those tokens where expert j is active [7]. This decouples functional impact from usage frequency — a specialist expert that activates rarely but contributes heavily when it does won’t be pruned just because it’s infrequent. Min-EAN held the previous crown among 16 criteria benchmarked by MC-Suite [5]. REAP now looks like the new baseline for generative tasks, especially at scale.

The easy heuristic is still wrong. Simple activation frequency — counting how many tokens each expert processes — does worse than random selection [3]. The router’s assignment frequency is not the same as contribution.

Domain-specific calibration delivers the biggest gap I’ve seen in any compression result. When REAP calibrates on C4 (general pre-training data) instead of domain-specific data (evol-codealpaca for code), code generation accuracy collapses — some compressed models produce 0% accuracy, failing to output coherent code at all [7]. This is not a matter of degree. The calibrating dataset determines whether the pruned model works or is completely useless on the target task. And this was already visible in earlier work: using MATH instead of C4 for calibration shifts expert selections in 28 of 32 layers of Mixtral 8×7B [3].

Pruning strategies: the choices that matter

Once you have an importance score, you need to decide how to use it. Three choices define your strategy.

Choice	Options	Trade-off
Scope	Global vs. layer-wise	Global: better quality but variable per-layer counts. Layer-wise: fixed memory layout but lower ceiling
Schedule	One-shot vs. iterative	One-shot: fast but importance rankings are stale post-pruning. Iterative: ~2× better PPL but needs re-estimation
Timing	Eager vs. staged	Eager: more optimization steps for survivors. Staged: better importance estimates from longer observation

Global vs. layer-wise. Global pruning — sorting all experts across all layers by a single importance ranking — outperforms layer-wise on quality because it avoids the constraint of keeping a fixed number per layer [1]. But it creates deployment headaches: variable per-layer expert counts mean variable memory usage across tasks, requiring model recreation for each configuration [6]. Layer-wise pruning gives predictable memory layouts at the cost of some quality.

One-shot vs. iterative. One-shot pruning drops experts in a single pass. The problem: after you remove experts, the importance rankings of the survivors change. Iterative pruning re-estimates importance after each round, achieving ~2× better perplexity. Add task-agnostic finetuning between rounds and you get ~3× better [5]. One-shot and iterative pruning identify substantially different subsets of experts at the same sparsity level — they produce effectively different subnetworks [5]. REAP demonstrates that with the right criterion, one-shot pruning can be remarkably effective even at 50% compression on models up to 1T parameters [7], but the iterative advantage likely still holds.

Eager vs. staged. Eager (progressive) pruning drops experts early using a dynamic threshold T = β / Z where Z is the number of surviving experts [2]. The earlier you drop, the more training steps you can dedicate to the selected expert. Eager consistently wins [2].

The NLLB-200 special case: language-specific pruning

The NLLB-200 translation model surfaces a phenomenon that the Mixtral papers miss: language-specific expert emergence. In the decoder, Jaccard similarity of selected experts is 68–87% for the same target language versus only 13–39% for different target languages [6]. Per-language pruning (source language for encoder, target language for decoder) performs as well as per-language-pair pruning while requiring only L configurations instead of L² [6]. An unbalanced 3:1 encoder-to-decoder ratio yields the best quality [6].

Beyond pruning: complementary techniques

Static expert pruning rarely stands alone. Four complementary techniques compound its gains, and there’s now a clearer distinction between approaches that help and approaches that hurt.

Expert merging (the post-pruning variant — and why it’s different)

EEP’s expert merging is not the same thing as HC-SMoE or M-SMoE. EEP merges pruned expert knowledge into survivors after pruning, using learned Router Mapping and Expert Merging matrices, adding 5–7% accuracy improvement [4]. This is a knowledge transfer operation — the pruned experts are already gone and their useful information is folded into the survivors. It’s fundamentally different from the HC-SMoE/M-SMoE approach of replacing entire expert groups with merged averages, which removes router independence and causes the collapse described above. The EEP variant is a net positive. The HC-SMoE variant is not, unless you’re only evaluating on multiple-choice.

Dynamic expert skipping

Static pruning removes experts permanently. Dynamic skipping removes them conditionally — dropping the second-ranked expert for a token when its routing weight is below a threshold β times the top expert’s weight, yielding ~50% skipping probability [3]. The key finding: skipping is complementary to pruning. A model pruned to 6 experts with skipping achieves the same speedup as pruning alone to 4 experts, but with higher accuracy [3]. You get the speedup without the full accuracy cost.

Active expert reduction and finetuning

Switching from top-2 to top-1 expert activation reduces forward-pass FLOPs by ~27% in Mixtral [1], but zero-shot top-1 routing drops SST5 accuracy from 50.8% to 42.6%. Recovery via entropy-based gating regularization plus annealing top-k reduction closes most of this gap (51.8% vs. 53.6% top-2) [1].

Task-agnostic finetuning (~1M tokens; benefits saturate) corrects the skewed load distribution caused by removing router entries. It doesn’t change which experts are selected — it mitigates impact through load rebalancing. This finetuning is central enough that iterative prune-estimate-finetune cycles produce what Jaiswal et al. call MoE Lottery Subnetworks [5].

Quantization after pruning

Pruning combines naturally with quantization without additional steps, unlike merging which requires block-scale reconciliation for block quantization formats [7]. Combining REAP with 4-bit quantization on Kimi-K2 achieves 87.5% total size reduction — a compression rate neither technique can reach alone [7].

What the numbers actually say

Across all seven papers, the efficiency-performance trade-off is more favorable than I expected.

At moderate sparsity (25–50% experts removed), the accuracy cost on generative tasks is remarkably low — provided you prune, not merge. REAP achieves a 1.9% mean accuracy decrease at 25% compression and 6.9% at 50% on coding benchmarks [7]. On Qwen3-Coder-480B and Kimi-K2, pruning 50% of experts drops code generation accuracy by only 1.2% [7]. On SWE-Bench (agentic software engineering), REAP-pruned Kimi-K2 at 50% compression actually slightly exceeds the baseline (0.576 vs. 0.554) [7].

Compare with merging at the same compression: HC-SMoE and M-SMoE see >5% accuracy decrease at 25% and >20% at 50% on the same coding benchmarks [7]. Merging looks reasonable on MC benchmarks (~4% decrease at 25%) but the MC numbers don’t predict generative performance. This gap — between discriminative and generative evaluation — is what the pre-REAP literature missed.

At high sparsity (75–80% experts removed), the numbers depend heavily on task type and recovery technique. At 75% sparsity, Min-EAN achieves 14.02 PPL versus 34.47 random [5]. NLLB-200 at 80% pruning achieves chrF++ 36.61 versus 36.81 full — a delta of −0.2 [6]. Expert dropping predominantly degrades instruction-following, not pretraining knowledge or reasoning; these capabilities can be substantially restored through K-shot examples or fine-tuning [5].

The fastest path to deployment, based on the evidence, is: Base model → expert pruning → finetuning → instruction tuning. Expert dropping yields greater benefits before instruction tuning than after [5]. With SFT after pruning, high-sparsity models can outperform full counterparts on easier tasks like BoolQ and ARC-easy.

Where the standard story breaks

The standard story: expert utilization is long-tailed. You prune the tail. Light finetuning recovers the loss. Any compression method that reduces the expert count should work about as well.

This story is wrong in ways that matter, and REAP is the paper that forced the correction.

Pruning and merging are not interchangeable. They produce qualitatively different models with different failure modes. Merging loses the router’s input-dependent control — an irreducible error proportional to the router’s policy variability. Pruning preserves it. On discriminative tasks, the difference is hidden because ranking answers in a single forward pass doesn’t require the model to maintain coherent generation. On generative tasks, the difference is dramatic [7].

Discriminative metrics like perplexity and MC accuracy are poor proxies for generative quality. This sounds obvious in retrospect, but the field relied on these metrics to claim merging > pruning. Jaiswal et al. had already warned that perplexity can be misleading for compressed LLMs [5]. REAP proved it with a clean experiment: merging methods that looked competitive on MC benchmarks collapsed on code generation to the point of producing 0% accuracy outputs [7]. If you evaluate a compressed model only on MC, you haven’t evaluated it at all for real-world use.

Expert-level sparsification still beats weight pruning, and the argument is now stronger. Across equivalent sparsity levels, dropping whole experts outperforms Wanda by ~3.6% average accuracy and ~16.2% on ARC-c [5]. And expert pruning preserves manifold topology while weight pruning may not — the geometric analysis from REAP [7] provides a structural argument for why whole-expert removal is the more principled approach.

High-vocabulary-coverage experts hurt when dropped — the specialist-generalist tension is real. If an expert handles many distinct tokens, removing it does outsized damage [5]. This suggests that pre-training methods that push experts toward specialization may make pruning easier in one sense (more experts are “dispensable”) but harder in another (the remaining generalist experts carry structural load that can’t be removed).

Dominant experts have lower stable-rank — a clean signal for identification but not yet exploited for additional compression [5].

The second-pass degradation puzzle. Two-pass eager-drop pruning degrades performance compared to a single pass, with average GLUE dropping by 0.58 points [2]. More iteration is not always better. But the REAP paper shows that a strong criterion in a single pass can go remarkably far — one-shot REAP on a 1T-parameter model at 50% compression is near-lossless on code [7]. The lesson isn’t “one-shot > iterative.” It’s that criterion quality and scale-appropriate calibration dominate the one-shot vs. iterative trade-off.

Boundary conditions

The enumeration-based pruning approach in Lu et al. [3] works for 4–8 experts per layer but becomes computationally intractable at 32+ experts. The combinatorial explosion is unresolved.
Gradient-free methods like EEP’s evolutionary strategy [4] have been studied only on the Mixtral family. Whether they generalize to architectures with many more experts is unknown.
HC-SMoE’s mega-clusters containing tens of experts are fragile — restricting maximum cluster size to 32 causes large accuracy drops [7]. Coherently merging many experts remains an open problem.
Hallucination and over-generation have been observed in pruned translation models, with global threshold methods more sensitive than fixed-per-layer pruning [6].
All seven papers study static expert counts per layer. None address dynamic architectures where expert count varies by input complexity.
Qwen2-MoE experts are notably homogeneous — the “expert specialization” narrative is architecture-dependent [4].
Merging methods require recording activations from every expert for every token during calibration, making them more expensive at scale than pruning methods [7].
The pruning-vs-merging analysis from REAP [7] applies to one-shot, no-fine-tuning compression. Whether fine-tuning after merging can recover the policy variability loss is not addressed.

Open questions

The REAP criterion’s derivation minimizes a reconstruction error bound assuming one-shot pruning. Can the same router-gate × activation-norm logic be extended to iterative pruning, and does it produce even better results?
Merging fails on generative tasks because it removes router independence. But could you train a model to be merge-friendly — by regularizing expert functional similarity or router policy smoothness — and get the memory savings of merging without the generative collapse?
What is the interaction between expert pruning and quantization at scale? REAP showed the combination works [7], but only on one model family (Kimi-K2 at 4-bit). Do pruned experts tolerate lower-bit quantization better or worse than full experts?
The “MoE Lottery Subnetworks” framing [5] has only been studied up to Mixtral 8×22B. Does it hold at the scale REAP demonstrated (480B–1T parameters)?
The vocabulary coverage finding [5] — high-coverage experts hurt when dropped — implies a tension with specialization. If you make experts more specialized, you might make pruning easier, but you risk creating fragile specialists that cannot be removed. Which direction wins?
No paper in this set studies pruning during pre-training rather than post-training. Could you train an MoE from scratch knowing it will be pruned and get a better result?

[[Q]] Six months from now: has the community converged on REAP as the default one-shot pruning criterion, or has the merging community produced a variant that recovers router independence and closes the generative-task gap?

References

Muzio et al., “SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts”, arXiv:2404.05089, 2024.
Chen et al., “Task-Specific Expert Pruning for Sparse Mixture-of-Experts”, arXiv:2206.00277, 2022.
Lu et al., “Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models”, ACL 2024.
Liu et al., “Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs”, arXiv:2407.00945, 2024.
Jaiswal et al., “Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations”, arXiv:2504.05586, 2025.
Koishekenov et al., “Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model”, ACL 2023.
Lasby et al., “REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression”, arXiv:2510.13999, 2025.

LLM/VLM Compression Foundations

Sun, 10 May 2026 00:00:00 +0000

I started looking at model compression because the numbers didn’t add up. My GPU has 24GB of VRAM and the models I want to run need 40GB. The gap is a factor of two, which quantization claims to solve. But then I found papers about pruning, and distillation, and token compression, and hardware-aware NAS, and suddenly the question wasn’t “which technique” but “which combination, in what order, for which hardware.”

This article is my attempt to organize what I’ve learned into a coherent map. It is not a survey — there are good surveys for that. It is a working notebook: what I understand, what surprised me, and what I still can’t explain.

Thesis: Compression works because neural networks are overparameterized for the expressivity they actually use. The hard part is knowing which bits are the ones that don’t matter — and that answer depends on what you’re compressing (text vs. vision-language), how you remove it (prune, quantize, or distill), what order you apply the steps (P-KD-Q), and what hardware runs the result.

Scope: This covers the foundations of LLM and VLM compression — the three pillars (pruning, quantization, distillation), token compression, NAS for compression, the empirical ordering evidence, failure modes, and hardware decision rules. It does not cover training-from-scratch efficiency, inference serving systems (vLLM, TensorRT-LLM) beyond their connection to compression, or retrieval-augmented generation.

Prerequisites: This assumes familiarity with transformer architectures, basic neural network training (backpropagation, gradient descent), floating-point representation, and cross-entropy loss.

1. Overparameterization is the precondition

If models weren’t overparameterized, compression wouldn’t work. The Lottery Ticket Hypothesis established this formally in 2018: dense, randomly-initialized networks contain subnetworks that, trained in isolation, match the full network’s accuracy. For modern LLMs, the numbers are concrete — up to 30% of parameters can be pruned with negligible loss, and models hold 98-99% of original capabilities at just 15% pruning.

This overparameterization isn’t a mistake. Sparse architectures are hard to train from scratch. We train dense and then compress because that’s what the optimization surface allows.

The shape of the redundancy matters, and it differs by modality. This is something I initially underestimated — I thought all redundancy was weight-level, but the token-level and modality-dependent patterns are just as important for practical compression decisions.

Modality	Redundancy pattern	Scale
Images	Spatial — neighboring patches share textures/colors	—
Video	Spatiotemporal — consecutive frames share backgrounds; at 10fps, 1000 tokens/frame, a 90-min video yields ~54M tokens	54M tokens/video
Audio	Salient info concentrates in sparse, brief segments and specific frequency bands	—
MLLM sequences	>50% of tokens get minimal attention; multimodal tokens are >80% of sequences in reasoning tasks	>80% of sequence

All compression exploits some version of this: there are bits you can throw away because they don’t change the output. The question is which bits.

2. The three pillars, and two newer additions

The literature converges on five categories. Three dominate practice:

Method	Mechanism	What it reduces	Tuning required
Quantization	Lower-bit weight/activation representation	Memory, potentially speed	Often tuning-free for LLMs
Pruning	Remove unimportant weights or structures	Parameters, compute	Recovery training at high ratios
Distillation	Transfer knowledge from large → small model	Parameters, compute	Training a student

Token compression and Neural Architecture Search sit alongside these — newer, less universal, but important for specific scenarios.

2.1 Quantization: the hardware-sensitive frontier

Quantization converts float32/float16 weights to fewer bits. The fundamental tension: non-uniform quantization achieves higher accuracy because weights aren’t uniformly distributed, but uniform quantization gets hardware support. You cannot have both accuracy and hardware efficiency simultaneously with existing methods.

A critical asymmetry drives the research: weights are easy to quantize, activations are hard because of outlier distributions. SmoothQuant addresses this by migrating quantization difficulty from activations to weights via per-channel scaling:

$$Y = X \cdot \text{diag}(s)^{-1} \times \text{diag}(s) \cdot W$$

This enables W8A8 quantization with minimal accuracy loss and a 2× throughput gain. The idea is simple — smooth the activation outliers into the weights where they do less damage — but the execution requires careful per-channel scaling factors.

The outlier problem, quantified

ICQuant reveals the structure of the problem in a way I find unusually clean: the top 5% of weight outliers consume about 50% of the total value range — meaning one full quantization bit gets wasted on just 5% of the weights. About 97% of weight channels have uniformly-distributed outlier positions (verified across Llama2/3/4 and Qwen2.5 families), which enables a per-channel partitioning strategy: separate codebooks for outliers and inliers, combined with index coding that costs ≈0.3 bits/weight vs. ≈1 bit for prior approaches.

The production baseline and the frontier

FP8 (E4M3) on NVIDIA H100/B200 is the modern production baseline — essentially lossless 50% memory reduction from FP16. 4-bit PTQ (AWQ, GPTQ) achieves virtually lossless quantization for models above 70B parameters. QuIP/QuIP# pushes to 2 bits by multiplying weight and Hessian matrices with randomized Hadamard transforms to make entries approximately i.i.d. Gaussian, enabling E8 lattice codebook quantization.

At the extreme frontier: LittleBit reaches 0.1 bits/weight through latent factorization; iFairy uses complex numbers {±1, ±i} for 2-bit “multiplication-free” inference via sign flips.

Edge and VLM-specific quantization

Edge deployment demands specialized methods. Q-VLM minimizes cross-layer dependency errors in LVLMs using activation entropy as a proxy. MBQ accounts for differential sensitivity between vision and language tokens, achieving up to 1.4× decoding speedup with a custom W3 kernel. P4Q introduces learnable prompts and a lightweight low-bit adapter to realign post-quantization feature distributions.

KV-cache quantization deserves separate mention. In PaLM-540B with batch size 512 and context length 2048, the KV cache alone needs 3TB — three times the model parameters. KIVI-style KV-cache quantization is now table-stakes for long-context serving.

2.2 Pruning: three strategies, three hardware outcomes

Pruning’s real-world impact depends entirely on the pattern, because hardware can only exploit certain sparsity structures:

Pruning type	What’s removed	Hardware speedup	Examples
Unstructured	Individual weights	None without sparse kernels	SparseGPT, Wanda
Semi-structured	Fixed patterns (2:4, 4:8)	Yes on NVIDIA Ampere+	SparseGPT 2:4, Wanda N:M
Structured	Whole layers/heads/channels	Yes on commodity hardware	LLM-Pruner, NIRVANA, UKMP

The key insight — and the one I keep coming back to — is that unstructured sparsity achieves the best accuracy but delivers zero speedup without special hardware. Structured pruning physically reduces matrix dimensions — immediate gains on any hardware, at higher accuracy cost. Semi-structured 2:4 sparsity is NVIDIA’s compromise: hardware-supported on Ampere GPUs, but one-shot methods like SparseGPT and Wanda still suffer at 60-80% sparsity or with tight 2:4 constraints.

Beyond uniform sparsity: per-dimension pruning

A critical limitation of prior methods is uniform sparsity within layers — all output dimensions of a weight matrix get the same pruning ratio. TRIM demonstrates this is deeply suboptimal: individual output dimensions differ significantly in sensitivity. By assigning unique per-row sparsity ratios via iterative metric-driven adjustment, TRIM reduces OPT-13B perplexity at 80% sparsity from 6461 (Wanda-based OWL) to 324 — over 95% reduction in perplexity at the same sparsity level.

NIRVANA redesigns structured pruning by combining magnitude-scaled gradient saliency ($|\partial f / \partial W \cdot W|$) with Adam-based NTK stability guarantees. The dual criterion balances output preservation with training stability — Proposition 4.1 proves $|\hat{\Theta} - \Theta| \leq O(\varepsilon)$ under the SignGD kernel.

Key design choices:

Adaptive sparsity allocation: parameter $\gamma$ controls MLP vs. attention pruning rates ($v_{\text{MLP}} = \gamma \cdot v_{\text{Attn}}$)
Hardware-aware dimension alignment: all hidden dimensions forced to multiples of 8 for Tensor Core compatibility
Global joint ranking across all layers/modules with a safeguard retaining ≥1 unit per layer to prevent collapse

At 50% sparsity, NIRVANA achieves WikiText2 perplexity (PPL) of 48.94 vs. 215.94 for LLM-Pruner on Llama3.1-8B. Ablation reveals that magnitude-based scoring alone causes extreme collapse (PPL ≈ 10⁵–10⁶), and removing adaptive allocation $\gamma$ raises PPL from 48.94 to 102.00.

FastForward Pruning reformulates sparsity allocation as a single-step RL problem. The RL state is defined solely by the global target sparsity $\sigma_t$ (enabling transfer learning), with a ratio-based reward function ( $PPL_{\text{dense}} / PPL_{\text{pruned}}$ ) that is scale-invariant for portability across model sizes. Results: 3.4× faster than EAS (6.13 vs. 23.6 GPU-hr) on LLaMA-V1 7B at 20% sparsity, with better PPL (6.64 vs. 6.89).

VLM-specific structured pruning: UKMP

Text-only pruning methods fail on LVLMs because they treat the language backbone in isolation, ignoring the vision-language interface. UKMP (AAAI 2025) introduces the first unified structured pruning framework purpose-built for LVLMs.

The UKMI metric combines three innovations:

Adaptive dual normalization: block-wise normalization (by parameter volume) prevents large modules from dominating; modality-wise normalization balances vision and language components
First-order gradient saliency: UKMP discards the second-order Fisher term because the convergence assumption of second-order derivatives is invalid when parameters are frozen — they retain first-order gradients
Angle distribution entropy: entropy over 100 cosine bins weights the Taylor importance, penalizing parameters whose removal would cause large angular shifts in feature space

Recovery uses a weight recalling module: low-rank $P_2 Q_2 W^p$ transformation parallel to LoRA, trained through three-phase progressive distillation (vision-only MSE → vision+language MSE → task loss + KL). This module is reparameterizable — it folds into base weights after training at no inference cost.

At 50% pruning, UKMP achieves 47.81% VQAv2 accuracy (vs. 36.40% next-best) and 96.92 NoCaps CIDEr (vs. 85.51). Even at 20% pruning, the pruned BLIP-2 beats similarly-sized full BLIP-2 on OK-VQA and GQA.

2.3 Distillation: transfer without the baggage

Knowledge distillation trains a smaller student model to mimic a larger teacher. The three challenges: what knowledge to transfer, which algorithm to use, and how to design the student-teacher pair.

White-box distillation using KL-divergence at high temperature ($\tau = 4.0$) reveals the teacher’s confidence across the full vocabulary, enabling finer-grained transfer than black-box methods relying on text outputs alone.

Curriculum distillation with selective reflection

SRD (Selective Reflection Distillation) demonstrates that not all training samples contribute equally — and that curriculum ordering matters. Easy-to-hard curriculum significantly outperforms reverse hard-to-easy ordering. An increasing temperature schedule ($\tau_0 = 1 \to \tau_n = 2$) is a key effectiveness driver; reversing it severely degrades results.

SRD achieves up to 39% training time reduction while using 75% of data, and consistently improves ROUGE-L by 3.92–15.53% across all 7 tested KD methods on 5 benchmarks. It is plug-and-play — no changes to model architectures, loss functions, or KD algorithms. It even enables distilled students to surpass teacher performance (26.07 vs. 25.15 ROUGE-L for OpenLLaMA2).

VLM-specific distillation

VLMs present unique challenges because cross-modal alignment must be preserved.

Switch-KD (CVPR 2026) unifies vision-language knowledge transfer within a shared text-probability space. The Visual-Switch Distillation pathway switches student visual outputs into the teacher’s language pathway ($S\text{-ViT} \to T\text{-Projector} \to T\text{-LLM}$), producing visual-switch logits that represent the teacher’s output distribution conditioned on student-encoded visual representations. This is supervised by DBiLD loss, which uses the Kneedle algorithm for adaptive top-k boundary detection and bidirectional reverse KL alignment on pairwise logit differences — outperforming forward KL by 0.5 points.

Switch-KD-0.5B achieves +3.6 Avg10 over TinyLLaVA-0.5B across 10 multimodal benchmarks and matches the 3B teacher with half the parameters. However, it requires feature-space and vocabulary consistency between teacher and student.

Align-KD rests on a critical architectural finding: cross-modal alignment in VLMs occurs primarily at the first attention layer’s text-query-vision component ($A_{1, t \leftarrow v}$). Distilling only this targeted attention map achieves the same performance as distilling all maps while saving up to 50% computation. Distilling the wrong component is harmful: vision-query-vision attention KD collapses performance to 43.7 (vs. 64.4 baseline).

Bridging black-box and white-box distillation

The strongest teachers (GPT-4, proprietary models) are black-box — only text outputs available via API. White-box KD requires internal parameters. GrayKD (AAAI 2026) bridges this with a single-stage framework using no proxy teacher. Black-box rationales are injected through a lightweight cross-attention module — student hidden states as queries, rationale embeddings as keys/values, with 15% random masking for augmentation.

The efficiency gain is dramatic: GrayKD uses 610M parameters total vs. 2.06B for conventional KD pipelines. GrayKD Triple achieves 27.64 Avg Rouge-L, beating PromptKD + White Teacher (26.44) — despite using the same black-box GPT-4o-mini teacher as lower-scoring methods. Rationale diversity is the dominant factor: switching from multi-rationale to single-rationale reuse drops Rouge-L by 1.14 points.

2.4 Token compression: compressing the input, not the model

Token compression operates upstream of the three traditional pillars: instead of compressing model weights, it compresses the input. Approaches are categorized by modality (image/video/audio) and mechanism (transformation-based, similarity-based, attention-based, query-based). The key advantage: token compression is post-optimization, requiring no retraining.

I find this category theoretically elegant but practically limited — it only helps when tokens dominate the compute budget, which is true for video and long-context multimodal tasks but less so for standard image+text inference.

2.5 NAS for compression

CompressNAS treats Tucker rank selection as a global search problem, using an MSE-based accuracy proxy comparing decomposed vs. reference layer feature vectors. Existing zero-cost proxies (NASWOT, GraSP, SNIP, ZiCo) fail monotonic trends at higher ranks. CompressNAS builds two lookup tables ($\Delta\text{acc}$, $\Delta\text{flash}$) and uses ILP-based NAS to select ranks globally given a hardware budget — 8× compression of ResNet-18 on ImageNet with <4% accuracy drop.

LLM-NAS solves a problem I hadn’t considered: LLM-driven architecture search exhibits exploration bias, repeatedly proposing designs within a narrow region of the search space. The fix is three innovations:

Complexity-driven partitioning into 6 disjoint niches defined by architectural complexity (nor_conv_3×3 count)
LLM-powered prompt co-evolution — prompts and architectures co-evolve across rounds
XGBoost zero-cost predictor aggregating 13 proxy metrics with Spearman correlation ~0.90 to ground truth

Search takes 3 minutes and 120 API calls vs. 2–17 GPU-days for supernet baselines. Removing partitioning drops hypervolume from 0.978 to 0.516. Removing the LLM entirely drops it to 0.843.

3. The P-KD-Q ordering: sequence matters

A systematic study on Qwen2.5-3B shows that compression ordering is not a detail — it determines whether the pipeline works at all.

Sequence	Compression	G-Eval	PPL	Verdict
P-KD-Q	3.68×	0.733	5.048	Best
KD-P-Q	3.68×	0.644	—	Intermediate
P-Q-KD	3.68×	0.610	—	Intermediate
KD-Q-P	3.68×	—	53.4	Collapse
Q-P-KD	3.68×	0.060	34.5	Near-zero
Q-KD-P	3.68×	0.080	24.1	Near-zero

The mechanism is specific and instructive — and worth pausing on because it explains an entire class of pipeline failures: NF4 quantization produces inference-only models incompatible with gradient-based training. Any sequence with Q before training steps is dead on arrival. The P-KD-Q sequence lets each step compound: pruning reduces the search space, distillation transfers knowledge to the pruned architecture, quantization reduces precision with minimal added loss.

A practical note: quantization alone achieves 3.00× compression (5886→1959 MB). Adding pruning and distillation adds only 0.68× more (to 3.68×) at significant complexity cost. For many use cases, quantization alone is the right answer.

4. Where compression fails

4.1 The alignment cliff in VLMs

VLM compression has a failure mode absent in text-only LLMs. At low compression ratios, structural pruning damages multimodal alignment (vision ↔ language) more than the language backbone; at high ratios, both degrade. This means for mild compression, fine-tuning only the multimodal projector is sufficient — you are repairing the alignment bridge, not the entire model.

UKMP addresses this directly through modality-wise adaptive normalization and its weight recalling module’s progressive three-phase distillation. Text-only importance metrics (magnitude, gradient) cannot detect which parameters mediate the vision-language interface. The convergence assumption of Fisher information is also invalid for VLMs: frozen parameters retain first-order gradients, making second-order importance estimates actively misleading.

4.2 Extreme sparsity collapse

One-shot pruning methods degrade severely at 60-80% sparsity with semi-structured patterns. NIRVANA’s ablation shows magnitude-based scoring alone causes PPL ≈ 10⁵–10⁶ at 50% sparsity. Attention-only pruning causes catastrophic collapse; joint pruning of attention and MLP yields the smoothest degradation.

4.3 Early quantization destroys trainability

Applying NF4 quantization before any other technique destroys trainability. Q-KD-P and Q-P-KD sequences achieve near-zero G-Eval scores (0.080, 0.060). The gradient-free nature of NF4-quantized models means they cannot participate in subsequent distillation or pruning recovery.

4.4 Layer sensitivity isn’t uniform

In partial 2:4 sparsification, later layers are more sensitive than earlier ones — skipping the last third of the model yields the best accuracy. For LVLMs, widthwise pruning of attention heads and MLP neurons outperforms wholesale layer removal. And within a single layer, individual output dimensions differ dramatically in sensitivity.

5. How hardware changes everything

5.1 The hardware-taxonomy mismatch

The compression technique that looks best on paper often delivers zero real-world speedup. Models optimized for GPU do not run fast on CPU and mobile, and vice versa.

If your target is…	Prefer…	Avoid…
Datacenter GPU (A100/H100)	Semi-structured 2:4 + quantization	Pure unstructured
Edge/CPU/Mobile	Structured pruning (widthwise)	Any unstructured or semi-structured
Long-context serving	KV-cache quantization	—
Extreme compression (≤45%)	Structured + distillation recovery	One-shot pruning alone

5.2 Memory bandwidth is the real bottleneck

During autoregressive decode, each token generation requires loading the entire model from memory — a classic memory-bandwidth-bound operation. This explains why quantization helps more than pruning for decode latency (smaller weights mean less data movement), why KV-cache quantization becomes critical at long contexts, and why joint algorithm-hardware optimization is the only path to order-of-magnitude gains.

The Titanus accelerator takes this to the extreme: chiplet-based digital computing-in-memory stores all static weights on-chip, eliminating repeated weight reloading during decode — a 39.4× reduction in off-chip memory access.

5.3 The edge reality

CLIP-B/16 at 149.6M parameters already exceeds Jetson Nano’s 4GB RAM (no dedicated GPU), causing frequent memory swaps that kill real-time performance. Edge deployment demands the full toolbox: pre-deployment compression, efficient fine-tuning, runtime optimization, and careful security/privacy handling.

5.4 A practical decision flow

For compressing an existing LVLM:

Extremely low resources, no recovery training: Widthwise pruning only. Accept accuracy loss.
Moderate compression (≤30%): Layerwise pruning + multimodal projector fine-tuning (5% of original data suffices).
High compression (≤45%): Widthwise pruning + supervised fine-tuning + hidden-state distillation.
For any combination: 4-bit quantization adds ~+0.1 PPL on top of sparsity.
Always: P-KD-Q ordering. Never quantize before training.

6. Where we are in 2025

The compression field has matured from experimental techniques into an engineering discipline with clear tiers:

Tier	Category	Examples	Status
Production	Foundational pruning	SparseGPT, Wanda, LLM-Pruner	Deployed
Production	4-bit quantization	AWQ, GPTQ, NF4 (QLoRA)	Deployed
Production	Inference engines	vLLM (PagedAttention), TensorRT-LLM	Deployed
Production	FP8 baseline	H100/B200 hardware-native	Deployed
Experimental	Extreme pruning	TRIM, NIRVANA, FastForward	Active research
Experimental	VLM-specific pruning	UKMP (UKMI + weight recalling)	Active research
Experimental	Ultra-low-bit quant	LittleBit (0.1-bit), ICQuant, iFairy	Active research
Experimental	Curriculum distillation	SRD	Active research
Experimental	VLM distillation	Switch-KD, Align-KD	Active research
Experimental	Black-box KD	GrayKD (610M params, no proxy)	Active research
Experimental	NAS for compression	CompressNAS, LLM-NAS, HAT	Active research

I expect several of the experimental rows to move to production within 12-18 months. TRIM-style per-dimension pruning and ICQuant-style index coding are both conceptually simple enough to integrate into existing pipelines. UKMP’s modality-aware pruning is clearly the right approach for VLMs — the question is whether it generalizes beyond BLIP-2 to LLaVA-style architectures.

7. A unifying mental model

Compression works because networks use fewer bits of information than they allocate parameters. The art is knowing which parameters carry that information. The answer depends on four things:

What redundancy exists: token-level, weight-level, layer-level — and modality-dependent. Video has spatiotemporal redundancy. MLLM sequences are >80% multimodal tokens that receive minimal attention. Within a single weight matrix, individual rows differ in sensitivity by orders of magnitude.
How you remove it: quantize, prune, or distill. Quantization targets precision. Pruning targets structure. Distillation targets knowledge transfer. Token compression targets the input directly.
What order you apply techniques: P-KD-Q is empirically optimal. Any sequence with Q before training steps fails catastrophically. This is not a heuristic — it follows from NF4’s gradient-free nature.
What hardware runs it: This is what determines whether a 50% parameter reduction translates to a 50% latency reduction or no reduction at all. Unstructured sparsity wins on accuracy but loses on every hardware metric. Structured pruning is the opposite. Semi-structured 2:4 is NVIDIA’s compromise.

What I find striking about the failure modes is how cleanly they carve the parameter space. In VLMs, the vision-language alignment is more fragile than the language backbone. In deep transformers, later layers carry disproportionate importance, and individual output dimensions within the same layer differ dramatically. Hardware is not an implementation detail — it defines which removal patterns become faster.

The frontier is hybrid, sequential, and precision-extreme: combining pruning, distillation, and quantization in the right order, with per-dimension granularity, pushing quantization to fractions of a bit — while ensuring the pipeline remains trainable throughout.

Boundary conditions

This model assumes the pretrained model is available. If you are training from scratch with a compression target, the entire framework shifts — you would design the architecture sparse from the start rather than compressing post-hoc.
The P-KD-Q ordering evidence comes from a single systematic study on Qwen2.5-3B. I have not seen replication on larger models or different architectures. The mechanism (NF4 gradient-free) is general, but the magnitude of the ordering effect at other scales is unknown.
UKMP has been validated primarily on BLIP-2. Its generalization to LLaVA, InternVL, or other VLM architectures is an open question.
Token compression is effective when tokens dominate compute (long video, long context). For single-image QA, the gains are modest.
The practical decision flow (Section 5.4) assumes access to fine-tuning resources. For truly zero-shot deployment, only quantization and token compression apply.
I have not covered training-time efficiency (mixed precision, gradient accumulation, ZeRO, FSDP), which interacts with compression in deployment pipelines but is a separate topic.

Open questions

Does the P-KD-Q ordering effect replicate on models above 70B parameters? The mechanism is architecture-agnostic, but the magnitude could scale differently.
Why do some layers tolerate 4-bit quantization while structurally similar layers fall apart at 6-bit? I suspect effective rank or singular value distribution, but I have not tested this.
Can UKMP’s modality-aware importance metric be adapted to video-language models, where the redundancy patterns are spatiotemporally structured rather than spatially structured?
What is the minimum viable recovery data for structured pruning of LVLMs at 50%+ sparsity? The current evidence says 5% of original data for moderate compression and full SFT for high compression, but the boundary between these regimes is fuzzy.
At what point does token compression become more practical than model compression for video understanding? The 54M-token number for 90-minute video suggests the crossover exists but I do not know where.

[[Q]] Six months from now: has UKMP been extended to LLaVA-style architectures, and does the modality-aware importance metric generalize beyond BLIP-2?

References

Frankle & Carbin, “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks”, ICLR 2019.
Xiao et al., “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models”, ICML 2023.
Frantar & Alistarh, “SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot”, ICML 2023.
Sun et al., “Wanda: A Simple and Effective Pruning Approach for Large Language Models”, ICLR 2024.
Ashkboos et al., “SliceGPT: Compress Large Language Models by Deleting Rows and Columns”, ICLR 2024.
Ma et al., “LLM-Pruner: On the Structural Pruning of Large Language Models”, NeurIPS 2023.
NIRVANA: “NIRVANA: Neural Implicit Removal via Verifiable Adam-based NTK Alignment for Structured Pruning”, 2025.
TRIM: “TRIM: Per-Dimension Structured Pruning for Large Language Models”, 2025.
FastForward: “FastForward Pruning: Efficient LLM Pruning via Single-Step Reinforcement Learning”, 2025.
UKMP: “UKMP: Unified Knowledge Maintenance Pruning for Vision-Language Models”, AAAI 2025.
ICQuant: “ICQuant: Index Coding Quantization for Large Language Models”, 2025.
Switch-KD: “Switch-KD: Knowledge Distillation with Visual Switch for Efficient Vision-Language Models”, CVPR 2026.
Align-KD: “Align-KD: Shallow-Layer Attention Alignment for Mobile Vision-Language Models”, 2025.
GrayKD: “GrayKD: Gray-Box Knowledge Distillation for Large Language Models”, AAAI 2026.
SRD: “Selective Reflection Distillation: Curriculum Knowledge Distillation for LLMs”, 2025.
CompressNAS: “CompressNAS: Neural Architecture Search for Model Compression”, 2025.
LLM-NAS: “LLM-NAS: Large Language Models for Neural Architecture Search”, 2025.
Compression Ordering: “A Systematic Study of Compression Ordering for Large Language Models”, 2025.
Multimodal Token Compression Survey, arXiv 2507.20198, 2025.
Efficient VLM Survey: “Efficient Vision-Language Models: A Survey”, 2025.

About

Mon, 01 Jan 0001 00:00:00 +0000

A former physicist turned machine learning engineer, I have a passion for learning and sharing knowledge. With a background in physics, I bring a unique perspective to software development, combining analytical thinking with creativity. I enjoy exploring new technologies and applying them to solve real-world problems. In my free time, I like to read, travel, and experiment with new frontiers in deep learning and artificial intelligence. This blog is a platform for me to share my insights, experiences, and projects in the world of software engineering and beyond. I hope to inspire others to pursue their passions and contribute to the ever-evolving field of technology.