Feng's Blog

Pruning Qwen3.6-35B-A3B for RTX 5090: what I learned pushing MoE compression to its limit on a single GPU
2026-05-18
Seven sessions of REAP-pruning a 256-expert MoE model on a single GPU taught me that calibration data composition matters more than the pruning algorithm, that agentic calibration cannot lift capacity-bound agentic benchmarks, and that the right recipe at the right compression depth can move +17 BugFind points at no parameter count change.
expert-pruning REAP RTX5090 Qwen calibration-data 4-bit-unbatching agentic-benchmarks benchlocal
Qwen3.6-35B-A3B 剪枝实战：在单卡 RTX 5090 上把 MoE 压缩到极限的七轮实验
2026-05-18
七轮 REAP 剪枝实验在一张消费级 GPU 上压缩 256 专家 MoE 模型的经验记录。核心发现：标定数据组成远比剪枝算法本身重要；面向 agentic 的标定无法提升容量受限的 agentic 基准；正确的配方配合合适的压缩深度，能在不改变参数量前提下将 BugFind 提升 17 分。
expert-pruning REAP RTX5090 Qwen 标定数据 4-bit-unbatching agentic-benchmarks benchlocal 中文
MoE Expert Pruning: What Works, What Doesn't, and What We Still Don't Know
2026-05-11
A survey of expert pruning techniques for sparse Mixture-of-Experts language models, covering why pruning works, the pruning-vs-merging debate, how to score expert importance, the strategies that matter, and where the standard story breaks.
moe expert-pruning sparsification efficient-inference model-deployment mixtral nllb
LLM/VLM Compression Foundations
2026-05-10
A working notebook on compressing LLMs and VLMs — why overparameterization is the precondition, how pruning, quantization, and distillation interact, why P-KD-Q ordering dominates, and where compression breaks.
pruning quantization knowledge-distillation token-compression vision-language-models neural-architecture-search hardware-aware-ml