Sparsification on Feng's Blog

MoE Expert Pruning: What Works, What Doesn't, and What We Still Don't Know

Mon, 11 May 2026 00:00:00 +0000

I spent the last week reading seven papers on expert compression for Mixture-of-Experts models. I went in assuming the landscape was settled: expert pruning was a useful technique for shrinking models, expert merging was a promising alternative, and the choice between them was mostly a matter of taste. I came out with a very different picture. The pruning-vs-merging debate flipped in later 2025, and the reason it flipped tells you something fundamental about how these models actually work.

The core tension is simple. Sparse Mixture-of-Experts models activate only a fraction of their parameters per token, but they still occupy all of them in memory. Mixtral 8×7B activates 2 of 8 experts per layer — yet all 8 experts (45B of 47B total parameters) must sit on the GPU. NLLB-200 has 1,536 experts; you need four 32GB GPUs just to load the thing. Expert compression asks: can we drop or combine the experts that rarely get used, and how much does it cost?

The answer, across all seven papers, is surprising in two ways. First, you can drop far more experts than I expected, and the costs are concentrated in specific capabilities rather than spread evenly. Second, and more unexpectedly, when it comes to actual generative tasks — code, math, creative writing — expert pruning is decisively better than expert merging. The merging methods that looked good on multiple-choice benchmarks collapse on tasks that require the model to actually generate tokens. The reason is structural, not empirical: merging removes the router’s fine-grained control over experts, and on generative tasks, that control matters.

Why expert pruning works at all

Here’s the one-sentence model: expert utilization in MoE models is long-tailed — a handful of experts do most of the work, and the rest are along for the ride.

Think of it like a restaurant kitchen during a dinner rush. You have eight chefs at eight stations. On a given night, two chefs handle 80% of the orders. The other six are standing by, occasionally contributing a garnish, mostly drawing salary. If you fire four of them and redistribute their stations, dinner still gets served — maybe even faster, because the remaining chefs spend less time coordinating. That is expert pruning.

The numbers back this up. Heatmap analysis of Mixtral 8×7B on MMLU shows stark unevenness: Expert #2 in Layers 26 and 30 is heavily activated while Expert #7 in Layers 22 and 23 is barely touched [1]. The same pre-trained MoE model produces substantially different expert contribution patterns when fine-tuned on different tasks [2]. This task-specificity cuts both ways — it means pruning must be calibrated to the deployment domain, but it also means aggressive pruning is possible for narrow use cases.

The distribution is not just long-tailed — it’s task-dependent. An expert that dominates on MNLI might be silent on CoLA. This is the core insight behind every pruning strategy: you are not removing universal knowledge. You are removing specialists that your specific task does not call.

Where the analogy breaks: experts are not independent chefs. The router learns to distribute tokens across experts during pre-training, and removing experts changes the routing distribution for the survivors. This is why naive pruning based on activation frequency alone performs worse than random pruning [3]. The router expects a full kitchen.

Knowledge redundancy: the surprising overcapacity result

Here is the finding that made me stop and re-read: pruning 4 of 8 experts in Mixtral 8×7B-Instruct improves SQuAD accuracy from 53.4% to 75.4%, without updating any remaining expert parameters [4]. This is not a typo. Removing half the experts makes the model better at question answering.

The mechanism: pruning simplifies the routing problem. With 8 experts, the router must learn to partition the hidden space across many specialists — a hard optimization problem. With 4, the routing is easier, and the remaining experts each get a cleaner slice of the input distribution. The router stops sending ambiguous tokens to the wrong specialist.

This overcapacity effect is not universal, but it recurs: on data-limited downstream tasks, a single-expert model can outperform the full multi-expert counterpart. After fine-tuning a pruned Mixtral 8×7B on MetaMathQA, the 7-expert model slightly exceeds the original 8-expert model on GSM8K (81.50 vs. 81.43) [3]. A single expert in Mixtral 8×7B-Instruct operates without model collapse [4].

Pruning vs. merging: the debate that flipped in 2025

Until mid-2025, the story seemed clear. Expert merging — clustering and averaging experts rather than discarding them — was winning. M-SMoE and HC-SMoE showed that merging outperformed pruning when measured by perplexity and multiple-choice question answering benchmarks. If you only looked at those numbers, merging was the smarter choice. Retain information from all experts. Avoid the binary brutality of pruning.

Then REAP showed up and asked: what happens when you actually make these models generate tokens?

The answer is a head-on collision between the two approaches. On code generation, REAP achieves a mean accuracy decrease of only 1.9% at 25% compression and 6.9% at 50% compression. Merging methods? HC-SMoE and M-SMoE degrade more than 5% at 25% and more than 20% at 50% [7]. On creative writing and mathematical reasoning, the same pattern holds. Merging is not slightly worse — it is qualitatively broken on generative tasks at 50% compression.

Lasby et al. didn’t just report the numbers. They derived why this has to be the case. When a router selects two experts f_i and f_j for a token, it produces a dynamic mixture r(x)·f_i(x) + (1−r(x))·f_j(x), where the mixing ratio r(x) depends on the input. After merging, the router must apply the summed gate to a constant convex combination — a static merged expert. The merged model must approximate a dynamic, input-dependent target with a static one. The resulting irreducible error is proportional to the router’s policy variability Var[r(x)] and the functional gap between the merged experts ∥Δ_ij∥ [7].

Pruning doesn’t have this problem. When you prune expert j, the router still controls each surviving expert independently. Pruning only incurs error when the pruned expert was in the top-k set, and that error is proportional to its gate-value g_j — it does not penalize policy variability at all [7]. The mathematical difference is clean: pruning is a coordinate subspace operation that preserves the functional manifold’s topology. Merging introduces novel functions and collapses the manifold toward its center — by up to 100× reduction in spread in late layers of high-granularity models [7].

Here’s what this means in practice. I can now look at a compressed model and predict failure modes based on the operation, not just the sparsity level. Merged model outputs have significantly lower N-gram diversity and their logits diverge from the original model more rapidly during auto-regressive generation [7]. The tokens drift. The model stops sounding like itself. MC benchmarks missed this entirely because they never asked the model to string tokens together — they only asked it to rank answer choices in a single forward pass.

One more uncomfortable finding: when merging does work well, look closer. HC-SMoE produces a high prevalence of singleton clusters — single-expert clusters that are functionally indistinguishable from keeping the expert unmerged [7]. The “merging” that succeeds is pruning plus a few mega-clusters of the truly redundant experts. And those mega-clusters are fragile: restricting the maximum cluster size to 32 experts causes large accuracy drops [7].

A separate problem compounds this. The L2-distance between clustered expert weights, even after weight-matching permutation, greatly exceeds the distance between pretrained and instruction-fine-tuned checkpoints. Singular-vector alignment remains poor [7]. Merging experts is fundamentally harder than the widely successful technique of model merging, and we should stop assuming the two are similar problems.

How to score expert importance

The choice of importance criterion is the single biggest lever in expert pruning. I organize the criteria by what information they use.

Criterion	Source	What it measures	Best result
Alpha score (accumulated gating weight)	Chen et al. 2022 [2]	Weighted contribution to output	Single expert preserves 99.3% of full MoE
Soft counting (accumulated softmax)	Muzio et al. 2024 [1]	Confidence margin of selection	25% sparsity: 3.85 pp MMLU drop
Min-EAN (activation norm)	Jaiswal et al. 2025 [5]	Minimum activation magnitude	14.02 PPL at 75% sparsity
REAP (conditional g_j∥f_j∥)	Lasby et al. 2025 [7]	Gate × activation, conditional	Near-lossless at 50%, up to 1T params
Importance product (top1 × exp(conf))	Koishekenov et al. 2023 [6]	Combined activity and confidence	80% pruning, chrF++ Δ = 0.29
Activation frequency alone	Lu et al. 2024 [3]	Simple token count	Worse than random

REAP deserves special attention because it’s the first criterion explicitly designed to minimize the reconstruction error bound. Its saliency score computes the conditional average of g_j(x)·∥f_j(x)∥ over only those tokens where expert j is active [7]. This decouples functional impact from usage frequency — a specialist expert that activates rarely but contributes heavily when it does won’t be pruned just because it’s infrequent. Min-EAN held the previous crown among 16 criteria benchmarked by MC-Suite [5]. REAP now looks like the new baseline for generative tasks, especially at scale.

The easy heuristic is still wrong. Simple activation frequency — counting how many tokens each expert processes — does worse than random selection [3]. The router’s assignment frequency is not the same as contribution.

Domain-specific calibration delivers the biggest gap I’ve seen in any compression result. When REAP calibrates on C4 (general pre-training data) instead of domain-specific data (evol-codealpaca for code), code generation accuracy collapses — some compressed models produce 0% accuracy, failing to output coherent code at all [7]. This is not a matter of degree. The calibrating dataset determines whether the pruned model works or is completely useless on the target task. And this was already visible in earlier work: using MATH instead of C4 for calibration shifts expert selections in 28 of 32 layers of Mixtral 8×7B [3].

Pruning strategies: the choices that matter

Once you have an importance score, you need to decide how to use it. Three choices define your strategy.

Choice	Options	Trade-off
Scope	Global vs. layer-wise	Global: better quality but variable per-layer counts. Layer-wise: fixed memory layout but lower ceiling
Schedule	One-shot vs. iterative	One-shot: fast but importance rankings are stale post-pruning. Iterative: ~2× better PPL but needs re-estimation
Timing	Eager vs. staged	Eager: more optimization steps for survivors. Staged: better importance estimates from longer observation

Global vs. layer-wise. Global pruning — sorting all experts across all layers by a single importance ranking — outperforms layer-wise on quality because it avoids the constraint of keeping a fixed number per layer [1]. But it creates deployment headaches: variable per-layer expert counts mean variable memory usage across tasks, requiring model recreation for each configuration [6]. Layer-wise pruning gives predictable memory layouts at the cost of some quality.

One-shot vs. iterative. One-shot pruning drops experts in a single pass. The problem: after you remove experts, the importance rankings of the survivors change. Iterative pruning re-estimates importance after each round, achieving ~2× better perplexity. Add task-agnostic finetuning between rounds and you get ~3× better [5]. One-shot and iterative pruning identify substantially different subsets of experts at the same sparsity level — they produce effectively different subnetworks [5]. REAP demonstrates that with the right criterion, one-shot pruning can be remarkably effective even at 50% compression on models up to 1T parameters [7], but the iterative advantage likely still holds.

Eager vs. staged. Eager (progressive) pruning drops experts early using a dynamic threshold T = β / Z where Z is the number of surviving experts [2]. The earlier you drop, the more training steps you can dedicate to the selected expert. Eager consistently wins [2].

The NLLB-200 special case: language-specific pruning

The NLLB-200 translation model surfaces a phenomenon that the Mixtral papers miss: language-specific expert emergence. In the decoder, Jaccard similarity of selected experts is 68–87% for the same target language versus only 13–39% for different target languages [6]. Per-language pruning (source language for encoder, target language for decoder) performs as well as per-language-pair pruning while requiring only L configurations instead of L² [6]. An unbalanced 3:1 encoder-to-decoder ratio yields the best quality [6].

Beyond pruning: complementary techniques

Static expert pruning rarely stands alone. Four complementary techniques compound its gains, and there’s now a clearer distinction between approaches that help and approaches that hurt.

Expert merging (the post-pruning variant — and why it’s different)

EEP’s expert merging is not the same thing as HC-SMoE or M-SMoE. EEP merges pruned expert knowledge into survivors after pruning, using learned Router Mapping and Expert Merging matrices, adding 5–7% accuracy improvement [4]. This is a knowledge transfer operation — the pruned experts are already gone and their useful information is folded into the survivors. It’s fundamentally different from the HC-SMoE/M-SMoE approach of replacing entire expert groups with merged averages, which removes router independence and causes the collapse described above. The EEP variant is a net positive. The HC-SMoE variant is not, unless you’re only evaluating on multiple-choice.

Dynamic expert skipping

Static pruning removes experts permanently. Dynamic skipping removes them conditionally — dropping the second-ranked expert for a token when its routing weight is below a threshold β times the top expert’s weight, yielding ~50% skipping probability [3]. The key finding: skipping is complementary to pruning. A model pruned to 6 experts with skipping achieves the same speedup as pruning alone to 4 experts, but with higher accuracy [3]. You get the speedup without the full accuracy cost.

Active expert reduction and finetuning

Switching from top-2 to top-1 expert activation reduces forward-pass FLOPs by ~27% in Mixtral [1], but zero-shot top-1 routing drops SST5 accuracy from 50.8% to 42.6%. Recovery via entropy-based gating regularization plus annealing top-k reduction closes most of this gap (51.8% vs. 53.6% top-2) [1].

Task-agnostic finetuning (~1M tokens; benefits saturate) corrects the skewed load distribution caused by removing router entries. It doesn’t change which experts are selected — it mitigates impact through load rebalancing. This finetuning is central enough that iterative prune-estimate-finetune cycles produce what Jaiswal et al. call MoE Lottery Subnetworks [5].

Quantization after pruning

Pruning combines naturally with quantization without additional steps, unlike merging which requires block-scale reconciliation for block quantization formats [7]. Combining REAP with 4-bit quantization on Kimi-K2 achieves 87.5% total size reduction — a compression rate neither technique can reach alone [7].

What the numbers actually say

Across all seven papers, the efficiency-performance trade-off is more favorable than I expected.

At moderate sparsity (25–50% experts removed), the accuracy cost on generative tasks is remarkably low — provided you prune, not merge. REAP achieves a 1.9% mean accuracy decrease at 25% compression and 6.9% at 50% on coding benchmarks [7]. On Qwen3-Coder-480B and Kimi-K2, pruning 50% of experts drops code generation accuracy by only 1.2% [7]. On SWE-Bench (agentic software engineering), REAP-pruned Kimi-K2 at 50% compression actually slightly exceeds the baseline (0.576 vs. 0.554) [7].

Compare with merging at the same compression: HC-SMoE and M-SMoE see >5% accuracy decrease at 25% and >20% at 50% on the same coding benchmarks [7]. Merging looks reasonable on MC benchmarks (~4% decrease at 25%) but the MC numbers don’t predict generative performance. This gap — between discriminative and generative evaluation — is what the pre-REAP literature missed.

At high sparsity (75–80% experts removed), the numbers depend heavily on task type and recovery technique. At 75% sparsity, Min-EAN achieves 14.02 PPL versus 34.47 random [5]. NLLB-200 at 80% pruning achieves chrF++ 36.61 versus 36.81 full — a delta of −0.2 [6]. Expert dropping predominantly degrades instruction-following, not pretraining knowledge or reasoning; these capabilities can be substantially restored through K-shot examples or fine-tuning [5].

The fastest path to deployment, based on the evidence, is: Base model → expert pruning → finetuning → instruction tuning. Expert dropping yields greater benefits before instruction tuning than after [5]. With SFT after pruning, high-sparsity models can outperform full counterparts on easier tasks like BoolQ and ARC-easy.

Where the standard story breaks

The standard story: expert utilization is long-tailed. You prune the tail. Light finetuning recovers the loss. Any compression method that reduces the expert count should work about as well.

This story is wrong in ways that matter, and REAP is the paper that forced the correction.

Pruning and merging are not interchangeable. They produce qualitatively different models with different failure modes. Merging loses the router’s input-dependent control — an irreducible error proportional to the router’s policy variability. Pruning preserves it. On discriminative tasks, the difference is hidden because ranking answers in a single forward pass doesn’t require the model to maintain coherent generation. On generative tasks, the difference is dramatic [7].

Discriminative metrics like perplexity and MC accuracy are poor proxies for generative quality. This sounds obvious in retrospect, but the field relied on these metrics to claim merging > pruning. Jaiswal et al. had already warned that perplexity can be misleading for compressed LLMs [5]. REAP proved it with a clean experiment: merging methods that looked competitive on MC benchmarks collapsed on code generation to the point of producing 0% accuracy outputs [7]. If you evaluate a compressed model only on MC, you haven’t evaluated it at all for real-world use.

Expert-level sparsification still beats weight pruning, and the argument is now stronger. Across equivalent sparsity levels, dropping whole experts outperforms Wanda by ~3.6% average accuracy and ~16.2% on ARC-c [5]. And expert pruning preserves manifold topology while weight pruning may not — the geometric analysis from REAP [7] provides a structural argument for why whole-expert removal is the more principled approach.

High-vocabulary-coverage experts hurt when dropped — the specialist-generalist tension is real. If an expert handles many distinct tokens, removing it does outsized damage [5]. This suggests that pre-training methods that push experts toward specialization may make pruning easier in one sense (more experts are “dispensable”) but harder in another (the remaining generalist experts carry structural load that can’t be removed).

Dominant experts have lower stable-rank — a clean signal for identification but not yet exploited for additional compression [5].

The second-pass degradation puzzle. Two-pass eager-drop pruning degrades performance compared to a single pass, with average GLUE dropping by 0.58 points [2]. More iteration is not always better. But the REAP paper shows that a strong criterion in a single pass can go remarkably far — one-shot REAP on a 1T-parameter model at 50% compression is near-lossless on code [7]. The lesson isn’t “one-shot > iterative.” It’s that criterion quality and scale-appropriate calibration dominate the one-shot vs. iterative trade-off.

Boundary conditions

The enumeration-based pruning approach in Lu et al. [3] works for 4–8 experts per layer but becomes computationally intractable at 32+ experts. The combinatorial explosion is unresolved.
Gradient-free methods like EEP’s evolutionary strategy [4] have been studied only on the Mixtral family. Whether they generalize to architectures with many more experts is unknown.
HC-SMoE’s mega-clusters containing tens of experts are fragile — restricting maximum cluster size to 32 causes large accuracy drops [7]. Coherently merging many experts remains an open problem.
Hallucination and over-generation have been observed in pruned translation models, with global threshold methods more sensitive than fixed-per-layer pruning [6].
All seven papers study static expert counts per layer. None address dynamic architectures where expert count varies by input complexity.
Qwen2-MoE experts are notably homogeneous — the “expert specialization” narrative is architecture-dependent [4].
Merging methods require recording activations from every expert for every token during calibration, making them more expensive at scale than pruning methods [7].
The pruning-vs-merging analysis from REAP [7] applies to one-shot, no-fine-tuning compression. Whether fine-tuning after merging can recover the policy variability loss is not addressed.

Open questions

The REAP criterion’s derivation minimizes a reconstruction error bound assuming one-shot pruning. Can the same router-gate × activation-norm logic be extended to iterative pruning, and does it produce even better results?
Merging fails on generative tasks because it removes router independence. But could you train a model to be merge-friendly — by regularizing expert functional similarity or router policy smoothness — and get the memory savings of merging without the generative collapse?
What is the interaction between expert pruning and quantization at scale? REAP showed the combination works [7], but only on one model family (Kimi-K2 at 4-bit). Do pruned experts tolerate lower-bit quantization better or worse than full experts?
The “MoE Lottery Subnetworks” framing [5] has only been studied up to Mixtral 8×22B. Does it hold at the scale REAP demonstrated (480B–1T parameters)?
The vocabulary coverage finding [5] — high-coverage experts hurt when dropped — implies a tension with specialization. If you make experts more specialized, you might make pruning easier, but you risk creating fragile specialists that cannot be removed. Which direction wins?
No paper in this set studies pruning during pre-training rather than post-training. Could you train an MoE from scratch knowing it will be pruned and get a better result?

[[Q]] Six months from now: has the community converged on REAP as the default one-shot pruning criterion, or has the merging community produced a variant that recovers router independence and closes the generative-task gap?

References

Muzio et al., “SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts”, arXiv:2404.05089, 2024.
Chen et al., “Task-Specific Expert Pruning for Sparse Mixture-of-Experts”, arXiv:2206.00277, 2022.
Lu et al., “Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models”, ACL 2024.
Liu et al., “Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs”, arXiv:2407.00945, 2024.
Jaiswal et al., “Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations”, arXiv:2504.05586, 2025.
Koishekenov et al., “Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model”, ACL 2023.
Lasby et al., “REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression”, arXiv:2510.13999, 2025.

LLM/VLM Compression Foundations

Sun, 10 May 2026 00:00:00 +0000

I started looking at model compression because the numbers didn’t add up. My GPU has 24GB of VRAM and the models I want to run need 40GB. The gap is a factor of two, which quantization claims to solve. But then I found papers about pruning, and distillation, and token compression, and hardware-aware NAS, and suddenly the question wasn’t “which technique” but “which combination, in what order, for which hardware.”

This article is my attempt to organize what I’ve learned into a coherent map. It is not a survey — there are good surveys for that. It is a working notebook: what I understand, what surprised me, and what I still can’t explain.

Thesis: Compression works because neural networks are overparameterized for the expressivity they actually use. The hard part is knowing which bits are the ones that don’t matter — and that answer depends on what you’re compressing (text vs. vision-language), how you remove it (prune, quantize, or distill), what order you apply the steps (P-KD-Q), and what hardware runs the result.

Scope: This covers the foundations of LLM and VLM compression — the three pillars (pruning, quantization, distillation), token compression, NAS for compression, the empirical ordering evidence, failure modes, and hardware decision rules. It does not cover training-from-scratch efficiency, inference serving systems (vLLM, TensorRT-LLM) beyond their connection to compression, or retrieval-augmented generation.

Prerequisites: This assumes familiarity with transformer architectures, basic neural network training (backpropagation, gradient descent), floating-point representation, and cross-entropy loss.

1. Overparameterization is the precondition

If models weren’t overparameterized, compression wouldn’t work. The Lottery Ticket Hypothesis established this formally in 2018: dense, randomly-initialized networks contain subnetworks that, trained in isolation, match the full network’s accuracy. For modern LLMs, the numbers are concrete — up to 30% of parameters can be pruned with negligible loss, and models hold 98-99% of original capabilities at just 15% pruning.

This overparameterization isn’t a mistake. Sparse architectures are hard to train from scratch. We train dense and then compress because that’s what the optimization surface allows.

The shape of the redundancy matters, and it differs by modality. This is something I initially underestimated — I thought all redundancy was weight-level, but the token-level and modality-dependent patterns are just as important for practical compression decisions.

Modality	Redundancy pattern	Scale
Images	Spatial — neighboring patches share textures/colors	—
Video	Spatiotemporal — consecutive frames share backgrounds; at 10fps, 1000 tokens/frame, a 90-min video yields ~54M tokens	54M tokens/video
Audio	Salient info concentrates in sparse, brief segments and specific frequency bands	—
MLLM sequences	>50% of tokens get minimal attention; multimodal tokens are >80% of sequences in reasoning tasks	>80% of sequence

All compression exploits some version of this: there are bits you can throw away because they don’t change the output. The question is which bits.

2. The three pillars, and two newer additions

The literature converges on five categories. Three dominate practice:

Method	Mechanism	What it reduces	Tuning required
Quantization	Lower-bit weight/activation representation	Memory, potentially speed	Often tuning-free for LLMs
Pruning	Remove unimportant weights or structures	Parameters, compute	Recovery training at high ratios
Distillation	Transfer knowledge from large → small model	Parameters, compute	Training a student

Token compression and Neural Architecture Search sit alongside these — newer, less universal, but important for specific scenarios.

2.1 Quantization: the hardware-sensitive frontier

Quantization converts float32/float16 weights to fewer bits. The fundamental tension: non-uniform quantization achieves higher accuracy because weights aren’t uniformly distributed, but uniform quantization gets hardware support. You cannot have both accuracy and hardware efficiency simultaneously with existing methods.

A critical asymmetry drives the research: weights are easy to quantize, activations are hard because of outlier distributions. SmoothQuant addresses this by migrating quantization difficulty from activations to weights via per-channel scaling:

$$Y = X \cdot \text{diag}(s)^{-1} \times \text{diag}(s) \cdot W$$

This enables W8A8 quantization with minimal accuracy loss and a 2× throughput gain. The idea is simple — smooth the activation outliers into the weights where they do less damage — but the execution requires careful per-channel scaling factors.

The outlier problem, quantified

ICQuant reveals the structure of the problem in a way I find unusually clean: the top 5% of weight outliers consume about 50% of the total value range — meaning one full quantization bit gets wasted on just 5% of the weights. About 97% of weight channels have uniformly-distributed outlier positions (verified across Llama2/3/4 and Qwen2.5 families), which enables a per-channel partitioning strategy: separate codebooks for outliers and inliers, combined with index coding that costs ≈0.3 bits/weight vs. ≈1 bit for prior approaches.

The production baseline and the frontier

FP8 (E4M3) on NVIDIA H100/B200 is the modern production baseline — essentially lossless 50% memory reduction from FP16. 4-bit PTQ (AWQ, GPTQ) achieves virtually lossless quantization for models above 70B parameters. QuIP/QuIP# pushes to 2 bits by multiplying weight and Hessian matrices with randomized Hadamard transforms to make entries approximately i.i.d. Gaussian, enabling E8 lattice codebook quantization.

At the extreme frontier: LittleBit reaches 0.1 bits/weight through latent factorization; iFairy uses complex numbers {±1, ±i} for 2-bit “multiplication-free” inference via sign flips.

Edge and VLM-specific quantization

Edge deployment demands specialized methods. Q-VLM minimizes cross-layer dependency errors in LVLMs using activation entropy as a proxy. MBQ accounts for differential sensitivity between vision and language tokens, achieving up to 1.4× decoding speedup with a custom W3 kernel. P4Q introduces learnable prompts and a lightweight low-bit adapter to realign post-quantization feature distributions.

KV-cache quantization deserves separate mention. In PaLM-540B with batch size 512 and context length 2048, the KV cache alone needs 3TB — three times the model parameters. KIVI-style KV-cache quantization is now table-stakes for long-context serving.

2.2 Pruning: three strategies, three hardware outcomes

Pruning’s real-world impact depends entirely on the pattern, because hardware can only exploit certain sparsity structures:

Pruning type	What’s removed	Hardware speedup	Examples
Unstructured	Individual weights	None without sparse kernels	SparseGPT, Wanda
Semi-structured	Fixed patterns (2:4, 4:8)	Yes on NVIDIA Ampere+	SparseGPT 2:4, Wanda N:M
Structured	Whole layers/heads/channels	Yes on commodity hardware	LLM-Pruner, NIRVANA, UKMP

The key insight — and the one I keep coming back to — is that unstructured sparsity achieves the best accuracy but delivers zero speedup without special hardware. Structured pruning physically reduces matrix dimensions — immediate gains on any hardware, at higher accuracy cost. Semi-structured 2:4 sparsity is NVIDIA’s compromise: hardware-supported on Ampere GPUs, but one-shot methods like SparseGPT and Wanda still suffer at 60-80% sparsity or with tight 2:4 constraints.

Beyond uniform sparsity: per-dimension pruning

A critical limitation of prior methods is uniform sparsity within layers — all output dimensions of a weight matrix get the same pruning ratio. TRIM demonstrates this is deeply suboptimal: individual output dimensions differ significantly in sensitivity. By assigning unique per-row sparsity ratios via iterative metric-driven adjustment, TRIM reduces OPT-13B perplexity at 80% sparsity from 6461 (Wanda-based OWL) to 324 — over 95% reduction in perplexity at the same sparsity level.

NIRVANA redesigns structured pruning by combining magnitude-scaled gradient saliency ($|\partial f / \partial W \cdot W|$) with Adam-based NTK stability guarantees. The dual criterion balances output preservation with training stability — Proposition 4.1 proves $|\hat{\Theta} - \Theta| \leq O(\varepsilon)$ under the SignGD kernel.

Key design choices:

Adaptive sparsity allocation: parameter $\gamma$ controls MLP vs. attention pruning rates ($v_{\text{MLP}} = \gamma \cdot v_{\text{Attn}}$)
Hardware-aware dimension alignment: all hidden dimensions forced to multiples of 8 for Tensor Core compatibility
Global joint ranking across all layers/modules with a safeguard retaining ≥1 unit per layer to prevent collapse

At 50% sparsity, NIRVANA achieves WikiText2 perplexity (PPL) of 48.94 vs. 215.94 for LLM-Pruner on Llama3.1-8B. Ablation reveals that magnitude-based scoring alone causes extreme collapse (PPL ≈ 10⁵–10⁶), and removing adaptive allocation $\gamma$ raises PPL from 48.94 to 102.00.

FastForward Pruning reformulates sparsity allocation as a single-step RL problem. The RL state is defined solely by the global target sparsity $\sigma_t$ (enabling transfer learning), with a ratio-based reward function ( $PPL_{\text{dense}} / PPL_{\text{pruned}}$ ) that is scale-invariant for portability across model sizes. Results: 3.4× faster than EAS (6.13 vs. 23.6 GPU-hr) on LLaMA-V1 7B at 20% sparsity, with better PPL (6.64 vs. 6.89).

VLM-specific structured pruning: UKMP

Text-only pruning methods fail on LVLMs because they treat the language backbone in isolation, ignoring the vision-language interface. UKMP (AAAI 2025) introduces the first unified structured pruning framework purpose-built for LVLMs.

The UKMI metric combines three innovations:

Adaptive dual normalization: block-wise normalization (by parameter volume) prevents large modules from dominating; modality-wise normalization balances vision and language components
First-order gradient saliency: UKMP discards the second-order Fisher term because the convergence assumption of second-order derivatives is invalid when parameters are frozen — they retain first-order gradients
Angle distribution entropy: entropy over 100 cosine bins weights the Taylor importance, penalizing parameters whose removal would cause large angular shifts in feature space

Recovery uses a weight recalling module: low-rank $P_2 Q_2 W^p$ transformation parallel to LoRA, trained through three-phase progressive distillation (vision-only MSE → vision+language MSE → task loss + KL). This module is reparameterizable — it folds into base weights after training at no inference cost.

At 50% pruning, UKMP achieves 47.81% VQAv2 accuracy (vs. 36.40% next-best) and 96.92 NoCaps CIDEr (vs. 85.51). Even at 20% pruning, the pruned BLIP-2 beats similarly-sized full BLIP-2 on OK-VQA and GQA.

2.3 Distillation: transfer without the baggage

Knowledge distillation trains a smaller student model to mimic a larger teacher. The three challenges: what knowledge to transfer, which algorithm to use, and how to design the student-teacher pair.

White-box distillation using KL-divergence at high temperature ($\tau = 4.0$) reveals the teacher’s confidence across the full vocabulary, enabling finer-grained transfer than black-box methods relying on text outputs alone.

Curriculum distillation with selective reflection

SRD (Selective Reflection Distillation) demonstrates that not all training samples contribute equally — and that curriculum ordering matters. Easy-to-hard curriculum significantly outperforms reverse hard-to-easy ordering. An increasing temperature schedule ($\tau_0 = 1 \to \tau_n = 2$) is a key effectiveness driver; reversing it severely degrades results.

SRD achieves up to 39% training time reduction while using 75% of data, and consistently improves ROUGE-L by 3.92–15.53% across all 7 tested KD methods on 5 benchmarks. It is plug-and-play — no changes to model architectures, loss functions, or KD algorithms. It even enables distilled students to surpass teacher performance (26.07 vs. 25.15 ROUGE-L for OpenLLaMA2).

VLM-specific distillation

VLMs present unique challenges because cross-modal alignment must be preserved.

Switch-KD (CVPR 2026) unifies vision-language knowledge transfer within a shared text-probability space. The Visual-Switch Distillation pathway switches student visual outputs into the teacher’s language pathway ($S\text{-ViT} \to T\text{-Projector} \to T\text{-LLM}$), producing visual-switch logits that represent the teacher’s output distribution conditioned on student-encoded visual representations. This is supervised by DBiLD loss, which uses the Kneedle algorithm for adaptive top-k boundary detection and bidirectional reverse KL alignment on pairwise logit differences — outperforming forward KL by 0.5 points.

Switch-KD-0.5B achieves +3.6 Avg10 over TinyLLaVA-0.5B across 10 multimodal benchmarks and matches the 3B teacher with half the parameters. However, it requires feature-space and vocabulary consistency between teacher and student.

Align-KD rests on a critical architectural finding: cross-modal alignment in VLMs occurs primarily at the first attention layer’s text-query-vision component ($A_{1, t \leftarrow v}$). Distilling only this targeted attention map achieves the same performance as distilling all maps while saving up to 50% computation. Distilling the wrong component is harmful: vision-query-vision attention KD collapses performance to 43.7 (vs. 64.4 baseline).

Bridging black-box and white-box distillation

The strongest teachers (GPT-4, proprietary models) are black-box — only text outputs available via API. White-box KD requires internal parameters. GrayKD (AAAI 2026) bridges this with a single-stage framework using no proxy teacher. Black-box rationales are injected through a lightweight cross-attention module — student hidden states as queries, rationale embeddings as keys/values, with 15% random masking for augmentation.

The efficiency gain is dramatic: GrayKD uses 610M parameters total vs. 2.06B for conventional KD pipelines. GrayKD Triple achieves 27.64 Avg Rouge-L, beating PromptKD + White Teacher (26.44) — despite using the same black-box GPT-4o-mini teacher as lower-scoring methods. Rationale diversity is the dominant factor: switching from multi-rationale to single-rationale reuse drops Rouge-L by 1.14 points.

2.4 Token compression: compressing the input, not the model

Token compression operates upstream of the three traditional pillars: instead of compressing model weights, it compresses the input. Approaches are categorized by modality (image/video/audio) and mechanism (transformation-based, similarity-based, attention-based, query-based). The key advantage: token compression is post-optimization, requiring no retraining.

I find this category theoretically elegant but practically limited — it only helps when tokens dominate the compute budget, which is true for video and long-context multimodal tasks but less so for standard image+text inference.

2.5 NAS for compression

CompressNAS treats Tucker rank selection as a global search problem, using an MSE-based accuracy proxy comparing decomposed vs. reference layer feature vectors. Existing zero-cost proxies (NASWOT, GraSP, SNIP, ZiCo) fail monotonic trends at higher ranks. CompressNAS builds two lookup tables ($\Delta\text{acc}$, $\Delta\text{flash}$) and uses ILP-based NAS to select ranks globally given a hardware budget — 8× compression of ResNet-18 on ImageNet with <4% accuracy drop.

LLM-NAS solves a problem I hadn’t considered: LLM-driven architecture search exhibits exploration bias, repeatedly proposing designs within a narrow region of the search space. The fix is three innovations:

Complexity-driven partitioning into 6 disjoint niches defined by architectural complexity (nor_conv_3×3 count)
LLM-powered prompt co-evolution — prompts and architectures co-evolve across rounds
XGBoost zero-cost predictor aggregating 13 proxy metrics with Spearman correlation ~0.90 to ground truth

Search takes 3 minutes and 120 API calls vs. 2–17 GPU-days for supernet baselines. Removing partitioning drops hypervolume from 0.978 to 0.516. Removing the LLM entirely drops it to 0.843.

3. The P-KD-Q ordering: sequence matters

A systematic study on Qwen2.5-3B shows that compression ordering is not a detail — it determines whether the pipeline works at all.

Sequence	Compression	G-Eval	PPL	Verdict
P-KD-Q	3.68×	0.733	5.048	Best
KD-P-Q	3.68×	0.644	—	Intermediate
P-Q-KD	3.68×	0.610	—	Intermediate
KD-Q-P	3.68×	—	53.4	Collapse
Q-P-KD	3.68×	0.060	34.5	Near-zero
Q-KD-P	3.68×	0.080	24.1	Near-zero

The mechanism is specific and instructive — and worth pausing on because it explains an entire class of pipeline failures: NF4 quantization produces inference-only models incompatible with gradient-based training. Any sequence with Q before training steps is dead on arrival. The P-KD-Q sequence lets each step compound: pruning reduces the search space, distillation transfers knowledge to the pruned architecture, quantization reduces precision with minimal added loss.

A practical note: quantization alone achieves 3.00× compression (5886→1959 MB). Adding pruning and distillation adds only 0.68× more (to 3.68×) at significant complexity cost. For many use cases, quantization alone is the right answer.

4. Where compression fails

4.1 The alignment cliff in VLMs

VLM compression has a failure mode absent in text-only LLMs. At low compression ratios, structural pruning damages multimodal alignment (vision ↔ language) more than the language backbone; at high ratios, both degrade. This means for mild compression, fine-tuning only the multimodal projector is sufficient — you are repairing the alignment bridge, not the entire model.

UKMP addresses this directly through modality-wise adaptive normalization and its weight recalling module’s progressive three-phase distillation. Text-only importance metrics (magnitude, gradient) cannot detect which parameters mediate the vision-language interface. The convergence assumption of Fisher information is also invalid for VLMs: frozen parameters retain first-order gradients, making second-order importance estimates actively misleading.

4.2 Extreme sparsity collapse

One-shot pruning methods degrade severely at 60-80% sparsity with semi-structured patterns. NIRVANA’s ablation shows magnitude-based scoring alone causes PPL ≈ 10⁵–10⁶ at 50% sparsity. Attention-only pruning causes catastrophic collapse; joint pruning of attention and MLP yields the smoothest degradation.

4.3 Early quantization destroys trainability

Applying NF4 quantization before any other technique destroys trainability. Q-KD-P and Q-P-KD sequences achieve near-zero G-Eval scores (0.080, 0.060). The gradient-free nature of NF4-quantized models means they cannot participate in subsequent distillation or pruning recovery.

4.4 Layer sensitivity isn’t uniform

In partial 2:4 sparsification, later layers are more sensitive than earlier ones — skipping the last third of the model yields the best accuracy. For LVLMs, widthwise pruning of attention heads and MLP neurons outperforms wholesale layer removal. And within a single layer, individual output dimensions differ dramatically in sensitivity.

5. How hardware changes everything

5.1 The hardware-taxonomy mismatch

The compression technique that looks best on paper often delivers zero real-world speedup. Models optimized for GPU do not run fast on CPU and mobile, and vice versa.

If your target is…	Prefer…	Avoid…
Datacenter GPU (A100/H100)	Semi-structured 2:4 + quantization	Pure unstructured
Edge/CPU/Mobile	Structured pruning (widthwise)	Any unstructured or semi-structured
Long-context serving	KV-cache quantization	—
Extreme compression (≤45%)	Structured + distillation recovery	One-shot pruning alone

5.2 Memory bandwidth is the real bottleneck

During autoregressive decode, each token generation requires loading the entire model from memory — a classic memory-bandwidth-bound operation. This explains why quantization helps more than pruning for decode latency (smaller weights mean less data movement), why KV-cache quantization becomes critical at long contexts, and why joint algorithm-hardware optimization is the only path to order-of-magnitude gains.

The Titanus accelerator takes this to the extreme: chiplet-based digital computing-in-memory stores all static weights on-chip, eliminating repeated weight reloading during decode — a 39.4× reduction in off-chip memory access.

5.3 The edge reality

CLIP-B/16 at 149.6M parameters already exceeds Jetson Nano’s 4GB RAM (no dedicated GPU), causing frequent memory swaps that kill real-time performance. Edge deployment demands the full toolbox: pre-deployment compression, efficient fine-tuning, runtime optimization, and careful security/privacy handling.

5.4 A practical decision flow

For compressing an existing LVLM:

Extremely low resources, no recovery training: Widthwise pruning only. Accept accuracy loss.
Moderate compression (≤30%): Layerwise pruning + multimodal projector fine-tuning (5% of original data suffices).
High compression (≤45%): Widthwise pruning + supervised fine-tuning + hidden-state distillation.
For any combination: 4-bit quantization adds ~+0.1 PPL on top of sparsity.
Always: P-KD-Q ordering. Never quantize before training.

6. Where we are in 2025

The compression field has matured from experimental techniques into an engineering discipline with clear tiers:

Tier	Category	Examples	Status
Production	Foundational pruning	SparseGPT, Wanda, LLM-Pruner	Deployed
Production	4-bit quantization	AWQ, GPTQ, NF4 (QLoRA)	Deployed
Production	Inference engines	vLLM (PagedAttention), TensorRT-LLM	Deployed
Production	FP8 baseline	H100/B200 hardware-native	Deployed
Experimental	Extreme pruning	TRIM, NIRVANA, FastForward	Active research
Experimental	VLM-specific pruning	UKMP (UKMI + weight recalling)	Active research
Experimental	Ultra-low-bit quant	LittleBit (0.1-bit), ICQuant, iFairy	Active research
Experimental	Curriculum distillation	SRD	Active research
Experimental	VLM distillation	Switch-KD, Align-KD	Active research
Experimental	Black-box KD	GrayKD (610M params, no proxy)	Active research
Experimental	NAS for compression	CompressNAS, LLM-NAS, HAT	Active research

I expect several of the experimental rows to move to production within 12-18 months. TRIM-style per-dimension pruning and ICQuant-style index coding are both conceptually simple enough to integrate into existing pipelines. UKMP’s modality-aware pruning is clearly the right approach for VLMs — the question is whether it generalizes beyond BLIP-2 to LLaVA-style architectures.

7. A unifying mental model

Compression works because networks use fewer bits of information than they allocate parameters. The art is knowing which parameters carry that information. The answer depends on four things:

What redundancy exists: token-level, weight-level, layer-level — and modality-dependent. Video has spatiotemporal redundancy. MLLM sequences are >80% multimodal tokens that receive minimal attention. Within a single weight matrix, individual rows differ in sensitivity by orders of magnitude.
How you remove it: quantize, prune, or distill. Quantization targets precision. Pruning targets structure. Distillation targets knowledge transfer. Token compression targets the input directly.
What order you apply techniques: P-KD-Q is empirically optimal. Any sequence with Q before training steps fails catastrophically. This is not a heuristic — it follows from NF4’s gradient-free nature.
What hardware runs it: This is what determines whether a 50% parameter reduction translates to a 50% latency reduction or no reduction at all. Unstructured sparsity wins on accuracy but loses on every hardware metric. Structured pruning is the opposite. Semi-structured 2:4 is NVIDIA’s compromise.

What I find striking about the failure modes is how cleanly they carve the parameter space. In VLMs, the vision-language alignment is more fragile than the language backbone. In deep transformers, later layers carry disproportionate importance, and individual output dimensions within the same layer differ dramatically. Hardware is not an implementation detail — it defines which removal patterns become faster.

The frontier is hybrid, sequential, and precision-extreme: combining pruning, distillation, and quantization in the right order, with per-dimension granularity, pushing quantization to fractions of a bit — while ensuring the pipeline remains trainable throughout.

Boundary conditions

This model assumes the pretrained model is available. If you are training from scratch with a compression target, the entire framework shifts — you would design the architecture sparse from the start rather than compressing post-hoc.
The P-KD-Q ordering evidence comes from a single systematic study on Qwen2.5-3B. I have not seen replication on larger models or different architectures. The mechanism (NF4 gradient-free) is general, but the magnitude of the ordering effect at other scales is unknown.
UKMP has been validated primarily on BLIP-2. Its generalization to LLaVA, InternVL, or other VLM architectures is an open question.
Token compression is effective when tokens dominate compute (long video, long context). For single-image QA, the gains are modest.
The practical decision flow (Section 5.4) assumes access to fine-tuning resources. For truly zero-shot deployment, only quantization and token compression apply.
I have not covered training-time efficiency (mixed precision, gradient accumulation, ZeRO, FSDP), which interacts with compression in deployment pipelines but is a separate topic.

Open questions

Does the P-KD-Q ordering effect replicate on models above 70B parameters? The mechanism is architecture-agnostic, but the magnitude could scale differently.
Why do some layers tolerate 4-bit quantization while structurally similar layers fall apart at 6-bit? I suspect effective rank or singular value distribution, but I have not tested this.
Can UKMP’s modality-aware importance metric be adapted to video-language models, where the redundancy patterns are spatiotemporally structured rather than spatially structured?
What is the minimum viable recovery data for structured pruning of LVLMs at 50%+ sparsity? The current evidence says 5% of original data for moderate compression and full SFT for high compression, but the boundary between these regimes is fuzzy.
At what point does token compression become more practical than model compression for video understanding? The 54M-token number for 90-minute video suggests the crossover exists but I do not know where.

[[Q]] Six months from now: has UKMP been extended to LLaVA-style architectures, and does the modality-aware importance metric generalize beyond BLIP-2?

References

Frankle & Carbin, “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks”, ICLR 2019.
Xiao et al., “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models”, ICML 2023.
Frantar & Alistarh, “SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot”, ICML 2023.
Sun et al., “Wanda: A Simple and Effective Pruning Approach for Large Language Models”, ICLR 2024.
Ashkboos et al., “SliceGPT: Compress Large Language Models by Deleting Rows and Columns”, ICLR 2024.
Ma et al., “LLM-Pruner: On the Structural Pruning of Large Language Models”, NeurIPS 2023.
NIRVANA: “NIRVANA: Neural Implicit Removal via Verifiable Adam-based NTK Alignment for Structured Pruning”, 2025.
TRIM: “TRIM: Per-Dimension Structured Pruning for Large Language Models”, 2025.
FastForward: “FastForward Pruning: Efficient LLM Pruning via Single-Step Reinforcement Learning”, 2025.
UKMP: “UKMP: Unified Knowledge Maintenance Pruning for Vision-Language Models”, AAAI 2025.
ICQuant: “ICQuant: Index Coding Quantization for Large Language Models”, 2025.
Switch-KD: “Switch-KD: Knowledge Distillation with Visual Switch for Efficient Vision-Language Models”, CVPR 2026.
Align-KD: “Align-KD: Shallow-Layer Attention Alignment for Mobile Vision-Language Models”, 2025.
GrayKD: “GrayKD: Gray-Box Knowledge Distillation for Large Language Models”, AAAI 2026.
SRD: “Selective Reflection Distillation: Curriculum Knowledge Distillation for LLMs”, 2025.
CompressNAS: “CompressNAS: Neural Architecture Search for Model Compression”, 2025.
LLM-NAS: “LLM-NAS: Large Language Models for Neural Architecture Search”, 2025.
Compression Ordering: “A Systematic Study of Compression Ordering for Large Language Models”, 2025.
Multimodal Token Compression Survey, arXiv 2507.20198, 2025.
Efficient VLM Survey: “Efficient Vision-Language Models: A Survey”, 2025.

About

Mon, 01 Jan 0001 00:00:00 +0000

A former physicist turned machine learning engineer, I have a passion for learning and sharing knowledge. With a background in physics, I bring a unique perspective to software development, combining analytical thinking with creativity. I enjoy exploring new technologies and applying them to solve real-world problems. In my free time, I like to read, travel, and experiment with new frontiers in deep learning and artificial intelligence. This blog is a platform for me to share my insights, experiences, and projects in the world of software engineering and beyond. I hope to inspire others to pursue their passions and contribute to the ever-evolving field of technology.