Feng's Blog
CategoriesTags
  • MoE Expert Pruning: What Works, What Doesn't, and What We Still Don't Know

    2026-05-11

    A survey of expert pruning techniques for sparse Mixture-of-Experts language models, covering why pruning works, the pruning-vs-merging debate, how to score expert importance, the strategies that matter, and where the standard story breaks.

    moeexpert-pruningsparsificationefficient-inferencemodel-deploymentmixtralnllb

  • LLM/VLM Compression Foundations

    2026-05-10

    A working notebook on compressing LLMs and VLMs — why overparameterization is the precondition, how pruning, quantization, and distillation interact, why P-KD-Q ordering dominates, and where compression breaks.

    pruningquantizationknowledge-distillationtoken-compressionvision-language-modelsneural-architecture-searchhardware-aware-ml

Feng's Blog

Powered by Hugo & Notepadium