Pruning Qwen3.6-35B-A3B for RTX 5090: what I learned pushing MoE compression to its limit on a single GPU
2026-05-18
Seven sessions of REAP-pruning a 256-expert MoE model on a single GPU taught me that calibration data composition matters more than the pruning algorithm, that agentic calibration cannot lift capacity-bound agentic benchmarks, and that the right recipe at the right compression depth can move +17 BugFind points at no parameter count change.
expert-pruningREAPRTX5090Qwencalibration-data4-bit-unbatchingagentic-benchmarksbenchlocal