Efficient-Inference

Two ways to diffuse text: DFlash block diffusion vs DiffusionGemma
2026-06-10
A side-by-side comparison of two recent approaches that apply diffusion to text generation — DFlash using a tiny block diffusion model as a speculative drafter, and DiffusionGemma building a standalone text diffusion model on the Gemma 4 backbone.
diffusion-language-models speculative-decoding efficient-inference text-generation parallel-decoding
MoE Expert Pruning: What Works, What Doesn't, and What We Still Don't Know
2026-05-11
A survey of expert pruning techniques for sparse Mixture-of-Experts language models, covering why pruning works, the pruning-vs-merging debate, how to score expert importance, the strategies that matter, and where the standard story breaks.
moe expert-pruning sparsification efficient-inference model-deployment mixtral nllb