Pruning Qwen3.6-35B-A3B for RTX 5090: what I learned pushing MoE compression to its limit on a single GPU
2026-05-18
Six days of REAP-pruning a 256-expert MoE model to fit 32 GiB taught me that calibration data composition — not the pruning algorithm — is the primary quality lever, and that SFT with 4-bit frozen experts actively degrades generalization.
expert-pruningREAPRTX5090Qwencalibration-data4-bit-unbatchingpost-SFT-degradation
MoE Expert Pruning: What Works, What Doesn't, and What We Still Don't Know
2026-05-11
A survey of expert pruning techniques for sparse Mixture-of-Experts language models, covering why pruning works, the pruning-vs-merging debate, how to score expert importance, the strategies that matter, and where the standard story breaks.
moeexpert-pruningsparsificationefficient-inferencemodel-deploymentmixtralnllb
LLM/VLM Compression Foundations
2026-05-10
A working notebook on compressing LLMs and VLMs — why overparameterization is the precondition, how pruning, quantization, and distillation interact, why P-KD-Q ordering dominates, and where compression breaks.
pruningquantizationknowledge-distillationtoken-compressionvision-language-modelsneural-architecture-searchhardware-aware-ml