4-Bit-Unbatching

Pruning Qwen3.6-35B-A3B for RTX 5090: what I learned pushing MoE compression to its limit on a single GPU
2026-05-18
Six days of REAP-pruning a 256-expert MoE model to fit 32 GiB taught me that calibration data composition — not the pruning algorithm — is the primary quality lever, and that SFT with 4-bit frozen experts actively degrades generalization.
expert-pruning REAP RTX5090 Qwen calibration-data 4-bit-unbatching post-SFT-degradation