<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Pruning on Feng's Blog</title><link>http://fengwang.github.io/tags/pruning/</link><description>Recent content in Pruning on Feng's Blog</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Sun, 10 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="http://fengwang.github.io/tags/pruning/index.xml" rel="self" type="application/rss+xml"/><item><title>LLM/VLM Compression Foundations</title><link>http://fengwang.github.io/posts/llm-vlm-compression-foundations-clean/</link><pubDate>Sun, 10 May 2026 00:00:00 +0000</pubDate><guid>http://fengwang.github.io/posts/llm-vlm-compression-foundations-clean/</guid><description>&lt;p&gt;I started looking at model compression because the numbers didn&amp;rsquo;t add up. My GPU has 24GB of VRAM and the models I want to run need 40GB. The gap is a factor of two, which quantization claims to solve. But then I found papers about pruning, and distillation, and token compression, and hardware-aware NAS, and suddenly the question wasn&amp;rsquo;t &amp;ldquo;which technique&amp;rdquo; but &amp;ldquo;which combination, in what order, for which hardware.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;This article is my attempt to organize what I&amp;rsquo;ve learned into a coherent map. It is not a survey — there are good surveys for that. It is a working notebook: what I understand, what surprised me, and what I still can&amp;rsquo;t explain.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Thesis:&lt;/strong&gt; Compression works because neural networks are overparameterized for the expressivity they actually use. The hard part is knowing which bits are the ones that don&amp;rsquo;t matter — and that answer depends on what you&amp;rsquo;re compressing (text vs. vision-language), how you remove it (prune, quantize, or distill), what order you apply the steps (P-KD-Q), and what hardware runs the result.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Scope:&lt;/strong&gt; This covers the foundations of LLM and VLM compression — the three pillars (pruning, quantization, distillation), token compression, NAS for compression, the empirical ordering evidence, failure modes, and hardware decision rules. It does not cover training-from-scratch efficiency, inference serving systems (vLLM, TensorRT-LLM) beyond their connection to compression, or retrieval-augmented generation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prerequisites:&lt;/strong&gt; This assumes familiarity with transformer architectures, basic neural network training (backpropagation, gradient descent), floating-point representation, and cross-entropy loss.&lt;/p&gt;
&lt;h2 id="1-overparameterization-is-the-precondition"&gt;1. Overparameterization is the precondition&lt;/h2&gt;
&lt;p&gt;If models weren&amp;rsquo;t overparameterized, compression wouldn&amp;rsquo;t work. The Lottery Ticket Hypothesis established this formally in 2018: dense, randomly-initialized networks contain subnetworks that, trained in isolation, match the full network&amp;rsquo;s accuracy. For modern LLMs, the numbers are concrete — up to 30% of parameters can be pruned with negligible loss, and models hold 98-99% of original capabilities at just 15% pruning.&lt;/p&gt;
&lt;p&gt;This overparameterization isn&amp;rsquo;t a mistake. Sparse architectures are hard to train from scratch. We train dense and then compress because that&amp;rsquo;s what the optimization surface allows.&lt;/p&gt;
&lt;p&gt;The shape of the redundancy matters, and it differs by modality. This is something I initially underestimated — I thought all redundancy was weight-level, but the token-level and modality-dependent patterns are just as important for practical compression decisions.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Modality&lt;/th&gt;
&lt;th&gt;Redundancy pattern&lt;/th&gt;
&lt;th&gt;Scale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Images&lt;/td&gt;
&lt;td&gt;Spatial — neighboring patches share textures/colors&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Video&lt;/td&gt;
&lt;td&gt;Spatiotemporal — consecutive frames share backgrounds; at 10fps, 1000 tokens/frame, a 90-min video yields ~54M tokens&lt;/td&gt;
&lt;td&gt;54M tokens/video&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audio&lt;/td&gt;
&lt;td&gt;Salient info concentrates in sparse, brief segments and specific frequency bands&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MLLM sequences&lt;/td&gt;
&lt;td&gt;&amp;gt;50% of tokens get minimal attention; multimodal tokens are &amp;gt;80% of sequences in reasoning tasks&lt;/td&gt;
&lt;td&gt;&amp;gt;80% of sequence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;All compression exploits some version of this: there are bits you can throw away because they don&amp;rsquo;t change the output. The question is which bits.&lt;/p&gt;
&lt;h2 id="2-the-three-pillars-and-two-newer-additions"&gt;2. The three pillars, and two newer additions&lt;/h2&gt;
&lt;p&gt;The literature converges on five categories. Three dominate practice:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;What it reduces&lt;/th&gt;
&lt;th&gt;Tuning required&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Quantization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Lower-bit weight/activation representation&lt;/td&gt;
&lt;td&gt;Memory, potentially speed&lt;/td&gt;
&lt;td&gt;Often tuning-free for LLMs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pruning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Remove unimportant weights or structures&lt;/td&gt;
&lt;td&gt;Parameters, compute&lt;/td&gt;
&lt;td&gt;Recovery training at high ratios&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Distillation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Transfer knowledge from large → small model&lt;/td&gt;
&lt;td&gt;Parameters, compute&lt;/td&gt;
&lt;td&gt;Training a student&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Token compression and Neural Architecture Search sit alongside these — newer, less universal, but important for specific scenarios.&lt;/p&gt;
&lt;h3 id="21-quantization-the-hardware-sensitive-frontier"&gt;2.1 Quantization: the hardware-sensitive frontier&lt;/h3&gt;
&lt;p&gt;Quantization converts float32/float16 weights to fewer bits. The fundamental tension: non-uniform quantization achieves higher accuracy because weights aren&amp;rsquo;t uniformly distributed, but uniform quantization gets hardware support. You cannot have both accuracy and hardware efficiency simultaneously with existing methods.&lt;/p&gt;
&lt;p&gt;A critical asymmetry drives the research: &lt;strong&gt;weights are easy to quantize, activations are hard&lt;/strong&gt; because of outlier distributions. SmoothQuant addresses this by migrating quantization difficulty from activations to weights via per-channel scaling:&lt;/p&gt;
&lt;p&gt;$$Y = X \cdot \text{diag}(s)^{-1} \times \text{diag}(s) \cdot W$$&lt;/p&gt;
&lt;p&gt;This enables W8A8 quantization with minimal accuracy loss and a 2× throughput gain. The idea is simple — smooth the activation outliers into the weights where they do less damage — but the execution requires careful per-channel scaling factors.&lt;/p&gt;
&lt;h4 id="the-outlier-problem-quantified"&gt;The outlier problem, quantified&lt;/h4&gt;
&lt;p&gt;ICQuant reveals the structure of the problem in a way I find unusually clean: &lt;strong&gt;the top 5% of weight outliers consume about 50% of the total value range&lt;/strong&gt; — meaning one full quantization bit gets wasted on just 5% of the weights. About 97% of weight channels have uniformly-distributed outlier positions (verified across Llama2/3/4 and Qwen2.5 families), which enables a per-channel partitioning strategy: separate codebooks for outliers and inliers, combined with index coding that costs ≈0.3 bits/weight vs. ≈1 bit for prior approaches.&lt;/p&gt;
&lt;h4 id="the-production-baseline-and-the-frontier"&gt;The production baseline and the frontier&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;FP8 (E4M3) on NVIDIA H100/B200&lt;/strong&gt; is the modern production baseline — essentially lossless 50% memory reduction from FP16. &lt;strong&gt;4-bit PTQ (AWQ, GPTQ)&lt;/strong&gt; achieves virtually lossless quantization for models above 70B parameters. QuIP/QuIP# pushes to 2 bits by multiplying weight and Hessian matrices with randomized Hadamard transforms to make entries approximately i.i.d. Gaussian, enabling E8 lattice codebook quantization.&lt;/p&gt;
&lt;p&gt;At the extreme frontier: LittleBit reaches 0.1 bits/weight through latent factorization; iFairy uses complex numbers {±1, ±i} for 2-bit &amp;ldquo;multiplication-free&amp;rdquo; inference via sign flips.&lt;/p&gt;
&lt;h4 id="edge-and-vlm-specific-quantization"&gt;Edge and VLM-specific quantization&lt;/h4&gt;
&lt;p&gt;Edge deployment demands specialized methods. Q-VLM minimizes cross-layer dependency errors in LVLMs using activation entropy as a proxy. MBQ accounts for differential sensitivity between vision and language tokens, achieving up to 1.4× decoding speedup with a custom W3 kernel. P4Q introduces learnable prompts and a lightweight low-bit adapter to realign post-quantization feature distributions.&lt;/p&gt;
&lt;p&gt;KV-cache quantization deserves separate mention. In PaLM-540B with batch size 512 and context length 2048, the KV cache alone needs 3TB — three times the model parameters. KIVI-style KV-cache quantization is now table-stakes for long-context serving.&lt;/p&gt;
&lt;h3 id="22-pruning-three-strategies-three-hardware-outcomes"&gt;2.2 Pruning: three strategies, three hardware outcomes&lt;/h3&gt;
&lt;p&gt;Pruning&amp;rsquo;s real-world impact depends entirely on the pattern, because hardware can only exploit certain sparsity structures:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pruning type&lt;/th&gt;
&lt;th&gt;What&amp;rsquo;s removed&lt;/th&gt;
&lt;th&gt;Hardware speedup&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Unstructured&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Individual weights&lt;/td&gt;
&lt;td&gt;None without sparse kernels&lt;/td&gt;
&lt;td&gt;SparseGPT, Wanda&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Semi-structured&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fixed patterns (2:4, 4:8)&lt;/td&gt;
&lt;td&gt;Yes on NVIDIA Ampere+&lt;/td&gt;
&lt;td&gt;SparseGPT 2:4, Wanda N:M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Structured&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Whole layers/heads/channels&lt;/td&gt;
&lt;td&gt;Yes on commodity hardware&lt;/td&gt;
&lt;td&gt;LLM-Pruner, NIRVANA, UKMP&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The key insight — and the one I keep coming back to — is that unstructured sparsity achieves the best accuracy but delivers zero speedup without special hardware. Structured pruning physically reduces matrix dimensions — immediate gains on any hardware, at higher accuracy cost. Semi-structured 2:4 sparsity is NVIDIA&amp;rsquo;s compromise: hardware-supported on Ampere GPUs, but one-shot methods like SparseGPT and Wanda still suffer at 60-80% sparsity or with tight 2:4 constraints.&lt;/p&gt;
&lt;h4 id="beyond-uniform-sparsity-per-dimension-pruning"&gt;Beyond uniform sparsity: per-dimension pruning&lt;/h4&gt;
&lt;p&gt;A critical limitation of prior methods is uniform sparsity within layers — all output dimensions of a weight matrix get the same pruning ratio. TRIM demonstrates this is deeply suboptimal: individual output dimensions differ significantly in sensitivity. By assigning unique per-row sparsity ratios via iterative metric-driven adjustment, TRIM reduces OPT-13B perplexity at 80% sparsity from 6461 (Wanda-based OWL) to 324 — &lt;strong&gt;over 95% reduction in perplexity&lt;/strong&gt; at the same sparsity level.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;NIRVANA&lt;/strong&gt; redesigns structured pruning by combining magnitude-scaled gradient saliency ($|\partial f / \partial W \cdot W|$) with Adam-based NTK stability guarantees. The dual criterion balances output preservation with training stability — Proposition 4.1 proves $|\hat{\Theta} - \Theta| \leq O(\varepsilon)$ under the SignGD kernel.&lt;/p&gt;
&lt;p&gt;Key design choices:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Adaptive sparsity allocation&lt;/strong&gt;: parameter $\gamma$ controls MLP vs. attention pruning rates ($v_{\text{MLP}} = \gamma \cdot v_{\text{Attn}}$)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hardware-aware dimension alignment&lt;/strong&gt;: all hidden dimensions forced to multiples of 8 for Tensor Core compatibility&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Global joint ranking&lt;/strong&gt; across all layers/modules with a safeguard retaining ≥1 unit per layer to prevent collapse&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At 50% sparsity, NIRVANA achieves WikiText2 perplexity (PPL) of 48.94 vs. 215.94 for LLM-Pruner on Llama3.1-8B. Ablation reveals that magnitude-based scoring alone causes extreme collapse (PPL ≈ 10⁵–10⁶), and removing adaptive allocation $\gamma$ raises PPL from 48.94 to 102.00.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;FastForward Pruning&lt;/strong&gt; reformulates sparsity allocation as a single-step RL problem. The RL state is defined solely by the global target sparsity $\sigma_t$ (enabling transfer learning), with a ratio-based reward function ( $PPL_{\text{dense}} / PPL_{\text{pruned}}$ ) that is scale-invariant for portability across model sizes. Results: 3.4× faster than EAS (6.13 vs. 23.6 GPU-hr) on LLaMA-V1 7B at 20% sparsity, with better PPL (6.64 vs. 6.89).&lt;/p&gt;
&lt;h4 id="vlm-specific-structured-pruning-ukmp"&gt;VLM-specific structured pruning: UKMP&lt;/h4&gt;
&lt;p&gt;Text-only pruning methods fail on LVLMs because they treat the language backbone in isolation, ignoring the vision-language interface. &lt;strong&gt;UKMP (AAAI 2025)&lt;/strong&gt; introduces the first unified structured pruning framework purpose-built for LVLMs.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;UKMI metric&lt;/strong&gt; combines three innovations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Adaptive dual normalization&lt;/strong&gt;: block-wise normalization (by parameter volume) prevents large modules from dominating; modality-wise normalization balances vision and language components&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;First-order gradient saliency&lt;/strong&gt;: UKMP discards the second-order Fisher term because the convergence assumption of second-order derivatives is invalid when parameters are frozen — they retain first-order gradients&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Angle distribution entropy&lt;/strong&gt;: entropy over 100 cosine bins weights the Taylor importance, penalizing parameters whose removal would cause large angular shifts in feature space&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Recovery uses a &lt;strong&gt;weight recalling module&lt;/strong&gt;: low-rank $P_2 Q_2 W^p$ transformation parallel to LoRA, trained through three-phase progressive distillation (vision-only MSE → vision+language MSE → task loss + KL). This module is reparameterizable — it folds into base weights after training at no inference cost.&lt;/p&gt;
&lt;p&gt;At 50% pruning, UKMP achieves 47.81% VQAv2 accuracy (vs. 36.40% next-best) and 96.92 NoCaps CIDEr (vs. 85.51). Even at 20% pruning, the pruned BLIP-2 beats similarly-sized full BLIP-2 on OK-VQA and GQA.&lt;/p&gt;
&lt;h3 id="23-distillation-transfer-without-the-baggage"&gt;2.3 Distillation: transfer without the baggage&lt;/h3&gt;
&lt;p&gt;Knowledge distillation trains a smaller student model to mimic a larger teacher. The three challenges: what knowledge to transfer, which algorithm to use, and how to design the student-teacher pair.&lt;/p&gt;
&lt;p&gt;White-box distillation using KL-divergence at high temperature ($\tau = 4.0$) reveals the teacher&amp;rsquo;s confidence across the full vocabulary, enabling finer-grained transfer than black-box methods relying on text outputs alone.&lt;/p&gt;
&lt;h4 id="curriculum-distillation-with-selective-reflection"&gt;Curriculum distillation with selective reflection&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;SRD (Selective Reflection Distillation)&lt;/strong&gt; demonstrates that not all training samples contribute equally — and that curriculum ordering matters. Easy-to-hard curriculum significantly outperforms reverse hard-to-easy ordering. An increasing temperature schedule ($\tau_0 = 1 \to \tau_n = 2$) is a key effectiveness driver; reversing it severely degrades results.&lt;/p&gt;
&lt;p&gt;SRD achieves up to 39% training time reduction while using 75% of data, and consistently improves ROUGE-L by 3.92–15.53% across all 7 tested KD methods on 5 benchmarks. It is plug-and-play — no changes to model architectures, loss functions, or KD algorithms. It even enables distilled students to surpass teacher performance (26.07 vs. 25.15 ROUGE-L for OpenLLaMA2).&lt;/p&gt;
&lt;h4 id="vlm-specific-distillation"&gt;VLM-specific distillation&lt;/h4&gt;
&lt;p&gt;VLMs present unique challenges because cross-modal alignment must be preserved.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Switch-KD (CVPR 2026)&lt;/strong&gt; unifies vision-language knowledge transfer within a shared text-probability space. The Visual-Switch Distillation pathway switches student visual outputs into the teacher&amp;rsquo;s language pathway ($S\text{-ViT} \to T\text{-Projector} \to T\text{-LLM}$), producing visual-switch logits that represent the teacher&amp;rsquo;s output distribution conditioned on student-encoded visual representations. This is supervised by &lt;strong&gt;DBiLD loss&lt;/strong&gt;, which uses the Kneedle algorithm for adaptive top-k boundary detection and bidirectional reverse KL alignment on pairwise logit differences — outperforming forward KL by 0.5 points.&lt;/p&gt;
&lt;p&gt;Switch-KD-0.5B achieves +3.6 Avg10 over TinyLLaVA-0.5B across 10 multimodal benchmarks and matches the 3B teacher with half the parameters. However, it requires feature-space and vocabulary consistency between teacher and student.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Align-KD&lt;/strong&gt; rests on a critical architectural finding: cross-modal alignment in VLMs occurs primarily at the first attention layer&amp;rsquo;s text-query-vision component ($A_{1, t \leftarrow v}$). Distilling only this targeted attention map achieves the same performance as distilling all maps while saving up to 50% computation. Distilling the wrong component is harmful: vision-query-vision attention KD collapses performance to 43.7 (vs. 64.4 baseline).&lt;/p&gt;
&lt;h4 id="bridging-black-box-and-white-box-distillation"&gt;Bridging black-box and white-box distillation&lt;/h4&gt;
&lt;p&gt;The strongest teachers (GPT-4, proprietary models) are black-box — only text outputs available via API. White-box KD requires internal parameters. &lt;strong&gt;GrayKD (AAAI 2026)&lt;/strong&gt; bridges this with a single-stage framework using no proxy teacher. Black-box rationales are injected through a lightweight cross-attention module — student hidden states as queries, rationale embeddings as keys/values, with 15% random masking for augmentation.&lt;/p&gt;
&lt;p&gt;The efficiency gain is dramatic: GrayKD uses 610M parameters total vs. 2.06B for conventional KD pipelines. GrayKD Triple achieves 27.64 Avg Rouge-L, beating PromptKD + White Teacher (26.44) — despite using the same black-box GPT-4o-mini teacher as lower-scoring methods. Rationale diversity is the dominant factor: switching from multi-rationale to single-rationale reuse drops Rouge-L by 1.14 points.&lt;/p&gt;
&lt;h3 id="24-token-compression-compressing-the-input-not-the-model"&gt;2.4 Token compression: compressing the input, not the model&lt;/h3&gt;
&lt;p&gt;Token compression operates upstream of the three traditional pillars: instead of compressing model weights, it compresses the input. Approaches are categorized by modality (image/video/audio) and mechanism (transformation-based, similarity-based, attention-based, query-based). The key advantage: token compression is post-optimization, requiring no retraining.&lt;/p&gt;
&lt;p&gt;I find this category theoretically elegant but practically limited — it only helps when tokens dominate the compute budget, which is true for video and long-context multimodal tasks but less so for standard image+text inference.&lt;/p&gt;
&lt;h3 id="25-nas-for-compression"&gt;2.5 NAS for compression&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;CompressNAS&lt;/strong&gt; treats Tucker rank selection as a global search problem, using an MSE-based accuracy proxy comparing decomposed vs. reference layer feature vectors. Existing zero-cost proxies (NASWOT, GraSP, SNIP, ZiCo) fail monotonic trends at higher ranks. CompressNAS builds two lookup tables ($\Delta\text{acc}$, $\Delta\text{flash}$) and uses ILP-based NAS to select ranks globally given a hardware budget — 8× compression of ResNet-18 on ImageNet with &amp;lt;4% accuracy drop.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;LLM-NAS&lt;/strong&gt; solves a problem I hadn&amp;rsquo;t considered: LLM-driven architecture search exhibits exploration bias, repeatedly proposing designs within a narrow region of the search space. The fix is three innovations:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Complexity-driven partitioning&lt;/strong&gt; into 6 disjoint niches defined by architectural complexity (nor_conv_3×3 count)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LLM-powered prompt co-evolution&lt;/strong&gt; — prompts and architectures co-evolve across rounds&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;XGBoost zero-cost predictor&lt;/strong&gt; aggregating 13 proxy metrics with Spearman correlation ~0.90 to ground truth&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Search takes 3 minutes and 120 API calls vs. 2–17 GPU-days for supernet baselines. Removing partitioning drops hypervolume from 0.978 to 0.516. Removing the LLM entirely drops it to 0.843.&lt;/p&gt;
&lt;h2 id="3-the-p-kd-q-ordering-sequence-matters"&gt;3. The P-KD-Q ordering: sequence matters&lt;/h2&gt;
&lt;p&gt;A systematic study on Qwen2.5-3B shows that compression ordering is not a detail — it determines whether the pipeline works at all.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Sequence&lt;/th&gt;
&lt;th&gt;Compression&lt;/th&gt;
&lt;th&gt;G-Eval&lt;/th&gt;
&lt;th&gt;PPL&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P-KD-Q&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3.68×&lt;/td&gt;
&lt;td&gt;0.733&lt;/td&gt;
&lt;td&gt;5.048&lt;/td&gt;
&lt;td&gt;Best&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KD-P-Q&lt;/td&gt;
&lt;td&gt;3.68×&lt;/td&gt;
&lt;td&gt;0.644&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Intermediate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P-Q-KD&lt;/td&gt;
&lt;td&gt;3.68×&lt;/td&gt;
&lt;td&gt;0.610&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Intermediate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KD-Q-P&lt;/td&gt;
&lt;td&gt;3.68×&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;53.4&lt;/td&gt;
&lt;td&gt;Collapse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q-P-KD&lt;/td&gt;
&lt;td&gt;3.68×&lt;/td&gt;
&lt;td&gt;0.060&lt;/td&gt;
&lt;td&gt;34.5&lt;/td&gt;
&lt;td&gt;Near-zero&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q-KD-P&lt;/td&gt;
&lt;td&gt;3.68×&lt;/td&gt;
&lt;td&gt;0.080&lt;/td&gt;
&lt;td&gt;24.1&lt;/td&gt;
&lt;td&gt;Near-zero&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The mechanism is specific and instructive — and worth pausing on because it explains an entire class of pipeline failures: NF4 quantization produces inference-only models incompatible with gradient-based training. Any sequence with Q before training steps is dead on arrival. The P-KD-Q sequence lets each step compound: pruning reduces the search space, distillation transfers knowledge to the pruned architecture, quantization reduces precision with minimal added loss.&lt;/p&gt;
&lt;p&gt;A practical note: quantization alone achieves 3.00× compression (5886→1959 MB). Adding pruning and distillation adds only 0.68× more (to 3.68×) at significant complexity cost. For many use cases, quantization alone is the right answer.&lt;/p&gt;
&lt;h2 id="4-where-compression-fails"&gt;4. Where compression fails&lt;/h2&gt;
&lt;h3 id="41-the-alignment-cliff-in-vlms"&gt;4.1 The alignment cliff in VLMs&lt;/h3&gt;
&lt;p&gt;VLM compression has a failure mode absent in text-only LLMs. At low compression ratios, structural pruning damages multimodal alignment (vision ↔ language) more than the language backbone; at high ratios, both degrade. This means for mild compression, fine-tuning only the multimodal projector is sufficient — you are repairing the alignment bridge, not the entire model.&lt;/p&gt;
&lt;p&gt;UKMP addresses this directly through modality-wise adaptive normalization and its weight recalling module&amp;rsquo;s progressive three-phase distillation. Text-only importance metrics (magnitude, gradient) cannot detect which parameters mediate the vision-language interface. The convergence assumption of Fisher information is also invalid for VLMs: frozen parameters retain first-order gradients, making second-order importance estimates actively misleading.&lt;/p&gt;
&lt;h3 id="42-extreme-sparsity-collapse"&gt;4.2 Extreme sparsity collapse&lt;/h3&gt;
&lt;p&gt;One-shot pruning methods degrade severely at 60-80% sparsity with semi-structured patterns. NIRVANA&amp;rsquo;s ablation shows magnitude-based scoring alone causes PPL ≈ 10⁵–10⁶ at 50% sparsity. Attention-only pruning causes catastrophic collapse; joint pruning of attention and MLP yields the smoothest degradation.&lt;/p&gt;
&lt;h3 id="43-early-quantization-destroys-trainability"&gt;4.3 Early quantization destroys trainability&lt;/h3&gt;
&lt;p&gt;Applying NF4 quantization before any other technique destroys trainability. Q-KD-P and Q-P-KD sequences achieve near-zero G-Eval scores (0.080, 0.060). The gradient-free nature of NF4-quantized models means they cannot participate in subsequent distillation or pruning recovery.&lt;/p&gt;
&lt;h3 id="44-layer-sensitivity-isnt-uniform"&gt;4.4 Layer sensitivity isn&amp;rsquo;t uniform&lt;/h3&gt;
&lt;p&gt;In partial 2:4 sparsification, later layers are more sensitive than earlier ones — skipping the last third of the model yields the best accuracy. For LVLMs, widthwise pruning of attention heads and MLP neurons outperforms wholesale layer removal. And within a single layer, individual output dimensions differ dramatically in sensitivity.&lt;/p&gt;
&lt;h2 id="5-how-hardware-changes-everything"&gt;5. How hardware changes everything&lt;/h2&gt;
&lt;h3 id="51-the-hardware-taxonomy-mismatch"&gt;5.1 The hardware-taxonomy mismatch&lt;/h3&gt;
&lt;p&gt;The compression technique that looks best on paper often delivers zero real-world speedup. Models optimized for GPU do not run fast on CPU and mobile, and vice versa.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;If your target is&amp;hellip;&lt;/th&gt;
&lt;th&gt;Prefer&amp;hellip;&lt;/th&gt;
&lt;th&gt;Avoid&amp;hellip;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Datacenter GPU (A100/H100)&lt;/td&gt;
&lt;td&gt;Semi-structured 2:4 + quantization&lt;/td&gt;
&lt;td&gt;Pure unstructured&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Edge/CPU/Mobile&lt;/td&gt;
&lt;td&gt;Structured pruning (widthwise)&lt;/td&gt;
&lt;td&gt;Any unstructured or semi-structured&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long-context serving&lt;/td&gt;
&lt;td&gt;KV-cache quantization&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extreme compression (≤45%)&lt;/td&gt;
&lt;td&gt;Structured + distillation recovery&lt;/td&gt;
&lt;td&gt;One-shot pruning alone&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="52-memory-bandwidth-is-the-real-bottleneck"&gt;5.2 Memory bandwidth is the real bottleneck&lt;/h3&gt;
&lt;p&gt;During autoregressive decode, each token generation requires loading the entire model from memory — a classic memory-bandwidth-bound operation. This explains why quantization helps more than pruning for decode latency (smaller weights mean less data movement), why KV-cache quantization becomes critical at long contexts, and why joint algorithm-hardware optimization is the only path to order-of-magnitude gains.&lt;/p&gt;
&lt;p&gt;The Titanus accelerator takes this to the extreme: chiplet-based digital computing-in-memory stores all static weights on-chip, eliminating repeated weight reloading during decode — a 39.4× reduction in off-chip memory access.&lt;/p&gt;
&lt;h3 id="53-the-edge-reality"&gt;5.3 The edge reality&lt;/h3&gt;
&lt;p&gt;CLIP-B/16 at 149.6M parameters already exceeds Jetson Nano&amp;rsquo;s 4GB RAM (no dedicated GPU), causing frequent memory swaps that kill real-time performance. Edge deployment demands the full toolbox: pre-deployment compression, efficient fine-tuning, runtime optimization, and careful security/privacy handling.&lt;/p&gt;
&lt;h3 id="54-a-practical-decision-flow"&gt;5.4 A practical decision flow&lt;/h3&gt;
&lt;p&gt;For compressing an existing LVLM:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Extremely low resources, no recovery training&lt;/strong&gt;: Widthwise pruning only. Accept accuracy loss.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Moderate compression (≤30%)&lt;/strong&gt;: Layerwise pruning + multimodal projector fine-tuning (5% of original data suffices).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;High compression (≤45%)&lt;/strong&gt;: Widthwise pruning + supervised fine-tuning + hidden-state distillation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;For any combination&lt;/strong&gt;: 4-bit quantization adds ~+0.1 PPL on top of sparsity.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Always: P-KD-Q ordering.&lt;/strong&gt; Never quantize before training.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="6-where-we-are-in-2025"&gt;6. Where we are in 2025&lt;/h2&gt;
&lt;p&gt;The compression field has matured from experimental techniques into an engineering discipline with clear tiers:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Production&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Foundational pruning&lt;/td&gt;
&lt;td&gt;SparseGPT, Wanda, LLM-Pruner&lt;/td&gt;
&lt;td&gt;Deployed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Production&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4-bit quantization&lt;/td&gt;
&lt;td&gt;AWQ, GPTQ, NF4 (QLoRA)&lt;/td&gt;
&lt;td&gt;Deployed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Production&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Inference engines&lt;/td&gt;
&lt;td&gt;vLLM (PagedAttention), TensorRT-LLM&lt;/td&gt;
&lt;td&gt;Deployed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Production&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;FP8 baseline&lt;/td&gt;
&lt;td&gt;H100/B200 hardware-native&lt;/td&gt;
&lt;td&gt;Deployed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Experimental&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Extreme pruning&lt;/td&gt;
&lt;td&gt;TRIM, NIRVANA, FastForward&lt;/td&gt;
&lt;td&gt;Active research&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Experimental&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;VLM-specific pruning&lt;/td&gt;
&lt;td&gt;UKMP (UKMI + weight recalling)&lt;/td&gt;
&lt;td&gt;Active research&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Experimental&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ultra-low-bit quant&lt;/td&gt;
&lt;td&gt;LittleBit (0.1-bit), ICQuant, iFairy&lt;/td&gt;
&lt;td&gt;Active research&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Experimental&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Curriculum distillation&lt;/td&gt;
&lt;td&gt;SRD&lt;/td&gt;
&lt;td&gt;Active research&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Experimental&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;VLM distillation&lt;/td&gt;
&lt;td&gt;Switch-KD, Align-KD&lt;/td&gt;
&lt;td&gt;Active research&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Experimental&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Black-box KD&lt;/td&gt;
&lt;td&gt;GrayKD (610M params, no proxy)&lt;/td&gt;
&lt;td&gt;Active research&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Experimental&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;NAS for compression&lt;/td&gt;
&lt;td&gt;CompressNAS, LLM-NAS, HAT&lt;/td&gt;
&lt;td&gt;Active research&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;I expect several of the experimental rows to move to production within 12-18 months. TRIM-style per-dimension pruning and ICQuant-style index coding are both conceptually simple enough to integrate into existing pipelines. UKMP&amp;rsquo;s modality-aware pruning is clearly the right approach for VLMs — the question is whether it generalizes beyond BLIP-2 to LLaVA-style architectures.&lt;/p&gt;
&lt;h2 id="7-a-unifying-mental-model"&gt;7. A unifying mental model&lt;/h2&gt;
&lt;p&gt;Compression works because networks use fewer bits of information than they allocate parameters. The art is knowing which parameters carry that information. The answer depends on four things:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What redundancy exists&lt;/strong&gt;: token-level, weight-level, layer-level — and modality-dependent. Video has spatiotemporal redundancy. MLLM sequences are &amp;gt;80% multimodal tokens that receive minimal attention. Within a single weight matrix, individual rows differ in sensitivity by orders of magnitude.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How you remove it&lt;/strong&gt;: quantize, prune, or distill. Quantization targets precision. Pruning targets structure. Distillation targets knowledge transfer. Token compression targets the input directly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What order you apply techniques&lt;/strong&gt;: P-KD-Q is empirically optimal. Any sequence with Q before training steps fails catastrophically. This is not a heuristic — it follows from NF4&amp;rsquo;s gradient-free nature.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What hardware runs it&lt;/strong&gt;: This is what determines whether a 50% parameter reduction translates to a 50% latency reduction or no reduction at all. Unstructured sparsity wins on accuracy but loses on every hardware metric. Structured pruning is the opposite. Semi-structured 2:4 is NVIDIA&amp;rsquo;s compromise.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;What I find striking about the failure modes is how cleanly they carve the parameter space. In VLMs, the vision-language alignment is more fragile than the language backbone. In deep transformers, later layers carry disproportionate importance, and individual output dimensions within the same layer differ dramatically. Hardware is not an implementation detail — it defines which removal patterns become faster.&lt;/p&gt;
&lt;p&gt;The frontier is hybrid, sequential, and precision-extreme: combining pruning, distillation, and quantization in the right order, with per-dimension granularity, pushing quantization to fractions of a bit — while ensuring the pipeline remains trainable throughout.&lt;/p&gt;
&lt;h2 id="boundary-conditions"&gt;Boundary conditions&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;This model assumes the pretrained model is available. If you are training from scratch with a compression target, the entire framework shifts — you would design the architecture sparse from the start rather than compressing post-hoc.&lt;/li&gt;
&lt;li&gt;The P-KD-Q ordering evidence comes from a single systematic study on Qwen2.5-3B. I have not seen replication on larger models or different architectures. The mechanism (NF4 gradient-free) is general, but the magnitude of the ordering effect at other scales is unknown.&lt;/li&gt;
&lt;li&gt;UKMP has been validated primarily on BLIP-2. Its generalization to LLaVA, InternVL, or other VLM architectures is an open question.&lt;/li&gt;
&lt;li&gt;Token compression is effective when tokens dominate compute (long video, long context). For single-image QA, the gains are modest.&lt;/li&gt;
&lt;li&gt;The practical decision flow (Section 5.4) assumes access to fine-tuning resources. For truly zero-shot deployment, only quantization and token compression apply.&lt;/li&gt;
&lt;li&gt;I have not covered training-time efficiency (mixed precision, gradient accumulation, ZeRO, FSDP), which interacts with compression in deployment pipelines but is a separate topic.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="open-questions"&gt;Open questions&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Does the P-KD-Q ordering effect replicate on models above 70B parameters? The mechanism is architecture-agnostic, but the magnitude could scale differently.&lt;/li&gt;
&lt;li&gt;Why do some layers tolerate 4-bit quantization while structurally similar layers fall apart at 6-bit? I suspect effective rank or singular value distribution, but I have not tested this.&lt;/li&gt;
&lt;li&gt;Can UKMP&amp;rsquo;s modality-aware importance metric be adapted to video-language models, where the redundancy patterns are spatiotemporally structured rather than spatially structured?&lt;/li&gt;
&lt;li&gt;What is the minimum viable recovery data for structured pruning of LVLMs at 50%+ sparsity? The current evidence says 5% of original data for moderate compression and full SFT for high compression, but the boundary between these regimes is fuzzy.&lt;/li&gt;
&lt;li&gt;At what point does token compression become more practical than model compression for video understanding? The 54M-token number for 90-minute video suggests the crossover exists but I do not know where.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;[[Q]] Six months from now: has UKMP been extended to LLaVA-style architectures, and does the modality-aware importance metric generalize beyond BLIP-2?&lt;/p&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Frankle &amp;amp; Carbin, &amp;ldquo;The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks&amp;rdquo;, ICLR 2019.&lt;/li&gt;
&lt;li&gt;Xiao et al., &amp;ldquo;SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models&amp;rdquo;, ICML 2023.&lt;/li&gt;
&lt;li&gt;Frantar &amp;amp; Alistarh, &amp;ldquo;SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot&amp;rdquo;, ICML 2023.&lt;/li&gt;
&lt;li&gt;Sun et al., &amp;ldquo;Wanda: A Simple and Effective Pruning Approach for Large Language Models&amp;rdquo;, ICLR 2024.&lt;/li&gt;
&lt;li&gt;Ashkboos et al., &amp;ldquo;SliceGPT: Compress Large Language Models by Deleting Rows and Columns&amp;rdquo;, ICLR 2024.&lt;/li&gt;
&lt;li&gt;Ma et al., &amp;ldquo;LLM-Pruner: On the Structural Pruning of Large Language Models&amp;rdquo;, NeurIPS 2023.&lt;/li&gt;
&lt;li&gt;NIRVANA: &amp;ldquo;NIRVANA: Neural Implicit Removal via Verifiable Adam-based NTK Alignment for Structured Pruning&amp;rdquo;, 2025.&lt;/li&gt;
&lt;li&gt;TRIM: &amp;ldquo;TRIM: Per-Dimension Structured Pruning for Large Language Models&amp;rdquo;, 2025.&lt;/li&gt;
&lt;li&gt;FastForward: &amp;ldquo;FastForward Pruning: Efficient LLM Pruning via Single-Step Reinforcement Learning&amp;rdquo;, 2025.&lt;/li&gt;
&lt;li&gt;UKMP: &amp;ldquo;UKMP: Unified Knowledge Maintenance Pruning for Vision-Language Models&amp;rdquo;, AAAI 2025.&lt;/li&gt;
&lt;li&gt;ICQuant: &amp;ldquo;ICQuant: Index Coding Quantization for Large Language Models&amp;rdquo;, 2025.&lt;/li&gt;
&lt;li&gt;Switch-KD: &amp;ldquo;Switch-KD: Knowledge Distillation with Visual Switch for Efficient Vision-Language Models&amp;rdquo;, CVPR 2026.&lt;/li&gt;
&lt;li&gt;Align-KD: &amp;ldquo;Align-KD: Shallow-Layer Attention Alignment for Mobile Vision-Language Models&amp;rdquo;, 2025.&lt;/li&gt;
&lt;li&gt;GrayKD: &amp;ldquo;GrayKD: Gray-Box Knowledge Distillation for Large Language Models&amp;rdquo;, AAAI 2026.&lt;/li&gt;
&lt;li&gt;SRD: &amp;ldquo;Selective Reflection Distillation: Curriculum Knowledge Distillation for LLMs&amp;rdquo;, 2025.&lt;/li&gt;
&lt;li&gt;CompressNAS: &amp;ldquo;CompressNAS: Neural Architecture Search for Model Compression&amp;rdquo;, 2025.&lt;/li&gt;
&lt;li&gt;LLM-NAS: &amp;ldquo;LLM-NAS: Large Language Models for Neural Architecture Search&amp;rdquo;, 2025.&lt;/li&gt;
&lt;li&gt;Compression Ordering: &amp;ldquo;A Systematic Study of Compression Ordering for Large Language Models&amp;rdquo;, 2025.&lt;/li&gt;
&lt;li&gt;Multimodal Token Compression Survey, arXiv 2507.20198, 2025.&lt;/li&gt;
&lt;li&gt;Efficient VLM Survey: &amp;ldquo;Efficient Vision-Language Models: A Survey&amp;rdquo;, 2025.&lt;/li&gt;
&lt;/ol&gt;</description></item><item><title>AI时代，个体何以为家？</title><link>http://fengwang.github.io/posts/man-vs-ai-cn/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>http://fengwang.github.io/posts/man-vs-ai-cn/</guid><description>&lt;h2 id="在不确定性中锚定自我的行动纲领"&gt;在不确定性中锚定自我的行动纲领&lt;/h2&gt;
&lt;p&gt;我们这代人，正不偏不倚地撞上一个历史的岔路口。眼看着技术洪流卷裹着社会加速狂奔，身后扬起的是前所未有的集体焦虑。尤其是在这波来势汹汹的人工智能浪潮里，一个直击灵魂的拷问像海底的暗涌，悄然浮出水面：当机器的智能和效率，已经无限逼近，甚至开始甩开人类时，我们这些凡夫俗子的价值，到底该往哪儿搁？我的脑子可不爱只停留在对“世界末日”的悲观预设上，那太没劲了。相反，我想把这看起来不可逆转的趋势，当作探路的起点，去琢磨琢磨，我们到底该如何凭着那点儿知识的微光，和一股子豁出去的劲头，为自己在这喧嚣中铸就一份意义，也为未来争取那么一点点更广阔的主体性自由。&lt;/p&gt;
&lt;h3 id="一解构价值重构的底层逻辑"&gt;一、解构“价值重构”的底层逻辑&lt;/h3&gt;
&lt;p&gt;要我说，人类的价值定义，就像海边的沙滩，被历史的潮汐反复冲刷，变幻莫测。你想啊，从茹毛饮血的时代，你力气大、跑得快，就能直接换来一口吃的；到后来有了土地，生产资料成了硬通货；再到工业革命，机器轰鸣着取代了无数体力活和手艺人，每一次技术大跃进，都在不动声色地调整着生产要素那杆秤的砝码。数字互联刚冒头那会儿，不少人还以为个体的黄金时代真来了，结果呢？不过是把打工的舞台从实体工厂搬到了资本搭建的线上平台，真正的大头，还是进了平台所有者的腰包。&lt;/p&gt;
&lt;p&gt;我琢磨来琢磨去，发现这些所谓的“进步”，嘴上说着要“解放人类”，骨子里玩的却是另一套把戏——悄悄地、系统性地“降低人在生产中的不可替代性”。它们把人那些独有的本事，一点点拆解成可以复制、可以掌控的工具，然后，咔嚓一下，个体在财富分配中的议价权就这么被无情地压低了。&lt;/p&gt;
&lt;p&gt;而现在，AI的横空出世，特别是它开始渗透到认知、创意，甚至连情感领域都不放过的时候，我得说，这场“把人从价值链条上剔除”的游戏，算是彻底进入了终极阶段。这可是头一回，机器把替代的触角伸到了我们人类最引以为傲的智识高地。那些我们拍着胸脯说“这活儿只有人能干”的事儿，AI正以一种更高效、更廉价的方式，大摇大摆地接手。这意味着什么？意味着过去我们靠着苦读、靠着积累专业知识来挖的那条“护城河”，正在以肉眼可见的速度被填平。生产的核心要素，已经不再是简单的“资本+劳动”，而是彻底变成了“资本+AI”，而“劳动”这个砝码，正被无限趋近于零。这不是什么简单的技术性失业，各位，这可是一场社会财富分配逻辑的深层结构性大重构，它极有可能把我们的社会，推向一个前所未有的两极分化深渊。&lt;/p&gt;
&lt;h3 id="二破除幻想直面现实壁垒"&gt;二、破除幻想，直面现实壁垒&lt;/h3&gt;
&lt;p&gt;面对眼前这个庞大得让人有点喘不过气来的结构性挑战，社会上当然不缺那些急着要“破局”的声音。你看，有人把希望寄托在“技术平权”上，天真地以为只要人人都能用上AI工具，就能把差距给抹平了；也有人扯着嗓子呼吁“制度干预”，指望着靠政策来重新分配蛋糕；更有甚者，还乐观得有点可爱，觉得等AI把所有“有用”的活儿都干完了，人类的价值就能回归到艺术、情感这类“无用”之事上去了。&lt;/p&gt;
&lt;p&gt;可我这个人，向来喜欢把事情掰开了揉碎了看。一番理性审视下来，这些听起来很美的“破局”路径，恐怕都得撞上现实的几堵高墙：&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;首先，什么“技术平权”，在我看来不过是个美丽的幻象。&lt;/strong&gt; 咱们个人能拿到的，顶多是AI的“使用权”，注意，是“使用权”，可不是“所有权”！AI最核心的壁垒——那天文数字般的算力、海量的数据、还有动辄几十上百亿参数的大模型——依然牢牢地攥在少数资本巨头手里。你想想，游戏的规则，制定权永远属于那些掌握生产资料的人，不是吗？&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;其次，指望“制度干预”来力挽狂澜，也面临着一个“囚徒困境”。&lt;/strong&gt; 资本这东西，早就把触角伸进了全球各国的政策体系，渗透之深，超乎想象。再加上各国之间为了抢占科技高地，那竞争简直是刺刀见红。在这种背景下，任何一个国家想要限制AI发展，都可能陷入“我限制了，别人没限制，那我不是傻了吗”的尴尬境地，结果往往是大家谁也别想管，只能眼睁睁看着资本的权重持续无限放大。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;最后，那些关于“价值重定义”的美好愿景，也得面对残酷的现实。&lt;/strong&gt; 没错，当AI把“有用”的活儿都抢走了，我们是能把重心放回艺术、情感这些“无用之事”上，可问题是，在现行的财富分配体系下，这些“无用之事”在经济上根本没有议价权啊！除非整个社会的基本价值体系来一次地动山摇式的根本性重构，否则，它们怎么可能成为我们个体安身立命的坚实基础呢？&lt;/p&gt;
&lt;p&gt;所以你看，全民之所以焦虑得夜不能寐，根源压根儿不是什么“学不会AI”，而是我们这些清醒的旁观者，已经看透了：无论你个体如何拼命折腾，都很难成为AI时代的规则制定者，充其量，也只能是被这股洪流裹挟着前行的一员罢了。&lt;/p&gt;
&lt;p&gt;然而，就算眼前这终局看起来板上钉钉、不可逆转，我心里头却依然有那么一份坚信：人类的能动性，可没那么容易就全然丧失。技术迭代的速度确实是指数级的，像坐上了火箭，可社会利益格局的重构啊，往往是线性的，像老牛拉破车。拦着AI全面落地，从来不只是技术本身，更多的是人性与利益那剪不断理还乱的复杂交织。你想想，那些既得利益群体的本能抗拒、国家对社会稳定的优先考量、还有咱们中国商业社会里头根深蒂固的人情世故和信任网络，这些可都是AI想全面替代，也得碰壁的无形壁垒。它们为我们争取到了一个宝贵的5到10年的“窗口期”。这个窗口期，可不是让我们躺平任嘲、坐以待毙的，恰恰相反，它提供了一个清醒的、不抱任何幻想的现实出口，一个让我们能够积累、能够转型、能够战略布局的绝佳机遇期。&lt;/p&gt;
&lt;h3 id="三行动纲领在不确定中铸就主体性"&gt;三、行动纲领：在不确定中铸就主体性&lt;/h3&gt;
&lt;p&gt;启蒙思想教我们要理性自律，存在主义又在耳边低语，呼唤我们积极介入这个世界。到了AI时代，面对个体价值被大水稀释的宏大叙事，我们更该回归到最朴素的本源：把知识当作指路的火把，把行动当作渡河的舟楫，在这不确定性的大海里，牢牢地锚定住那个“我”，为自己铸就一份实实在在的主体性自由。&lt;/p&gt;
&lt;h4 id="个体层面的自我铸就"&gt;个体层面的自我铸就：&lt;/h4&gt;
&lt;p&gt;先来说说我们最能直接掌控、也最有希望出成绩的领域——个体层面的自我铸就。毕竟，自己这亩三分地，总得先耕耘好不是？&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;首先，要卯足劲儿培养那些AI暂时还摸不着边儿的“元技能”和“复合能力”。&lt;/strong&gt; 这些是我们的独门绝技：
你得练就一副&lt;strong&gt;批判性思维的火眼金睛&lt;/strong&gt;，能透过现象看本质，在信息爆炸和一团浆糊的模糊情境里，依然能独立做出判断，去解决那些AI看了都头疼的开放性、非线性难题。
&lt;strong&gt;创新和创造力&lt;/strong&gt;更是重中之重。AI再厉害，也只是个超级缝合怪，真正的原创性思考、那种直击人心的艺术表达、还有跨领域的融合创新，它还真学不来。我们要把AI当成激发灵感、提升效率的趁手工具，而不是让它来替代我们思考。
别忘了&lt;strong&gt;情感智能和人际协作&lt;/strong&gt;。领导力、沟通协调、同理心、谈判博弈、化解冲突、社群搭建……这些都是构建信任、连接人心的核心要素，是机器永远无法复制的柔软力量。
我们还得成为AI的**“价值守门人”&lt;strong&gt;，在人机协作中，最终的伦理责任得由我们来扛。说到底，要确保技术这把双刃剑，是真真切切地服务于人类福祉，而不是反噬自身。
最后，也是最关键的，是&lt;/strong&gt;终身学习与超级适应性**。别害怕变化，要把学习当成一种生存本能，持续驾驭AI的“新语言”和“新工具”，让AI成为我们自身能力的“超级放大器”，而不是扼杀器。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;其次，我们要积极积累那些能让我们“反脆弱”的资本。&lt;/strong&gt; 毕竟，手里有粮，心里不慌：
&lt;strong&gt;物质资本&lt;/strong&gt;当然不能少。趁着这宝贵的窗口期，该赚的钱得努力赚，该攒的资产得好好攒，给自己多增加点财务冗余，真遇到大风大浪，也能有个兜底的。
&lt;strong&gt;社会资本&lt;/strong&gt;更是无价之宝。好好经营你的人际关系、信任网络，多参与社群活动。要知道，那些深厚的人际连接和互助网络，是AI永远也复制不来的“价值堡垒”，关键时刻能救命。
&lt;strong&gt;认知资本&lt;/strong&gt;也不能忽视。别光盯着AI怎么用，更要深入理解它背后的底层逻辑，这样才能形成高层次的判断力和战略预见性，不至于被新概念牵着鼻子走。
最后，也别忘了&lt;strong&gt;健康资本&lt;/strong&gt;。身体是革命的本钱，身心健康，才能有足够的韧性去应对未来可能出现的各种挑战和压力，熬过这场大变局。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;最最后，也是最最深刻的，我们要重新审视和重构生命的意义。&lt;/strong&gt;
别再死盯着纯粹的经济生产和效率至上那套了，是时候把重心挪一挪，多关注体验、连接、创造和自我实现。在艺术创作里找灵感、在志愿服务中找到价值、在家庭生活里感受温暖、在户外探索中找回野性……这些非经济的领域，往往藏着更深层次的意义感和满足感。
不妨学着拥抱“慢生活”和“反效率”的价值观，去对抗AI裹挟而来的那种过度功利主义和速度崇拜。要知道，守护人类独有的、那些不可量化的体验维度，才是我们作为“人”的最后一道防线。&lt;/p&gt;
&lt;h4 id="组织与社会层面的积极介入"&gt;组织与社会层面的积极介入：&lt;/h4&gt;
&lt;p&gt;当然，我们个体的努力并非孤立无援。涓涓细流汇成江海，通过积极的参与和倡议，我们完全可以汇聚起一股力量，去影响那些更广阔的系统，让它不至于偏航太远。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;我们可以倡导“人机共生”的理念，推动岗位重塑。&lt;/strong&gt; 鼓励那些组织和企业，别老想着让AI来“取代”人，而应该把它看作是人类能力的“增强器”。去设计那些人机协作的工作流程，创造出需要人类独特判断和创造力的新型“超级职位”，让AI成为我们的助手，而不是掘墓人。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;我们要不遗余力地推动教育体系的革新。&lt;/strong&gt; 大声疾呼，让教育别再是填鸭式的知识灌输，而是要彻底转向能力培养。把批判性思维、创造力、情商和伦理道德这些核心素养，摆到教学的C位，为我们的下一代，为他们能更好地适应AI时代做好万全准备。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;同时，也要密切关注并支持新型分配机制的探索。&lt;/strong&gt; 比如全民基本收入（UBI）、数据所有权、AI税等等，这些听起来有点超前的想法，或许正是未来社会财富分配的新方向。积极去了解它们的讨论和实践，促使技术带来的红利，能更广泛、更公平地惠及整个社会，而不是只流向少数人的口袋。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;别忘了，还要积极参与到AI伦理与治理框架的构建中去。&lt;/strong&gt; 关注AI的透明度够不够、它的决策能不能被解释、出了问题谁来负责。我们要大声倡导，把人类最核心的价值观融入AI的设计与应用中，给这匹脱缰的野马套上缰绳，避免技术最终失控，反噬我们自身。&lt;/p&gt;
&lt;h3 id="四锚定人类价值共塑未来"&gt;四、锚定人类价值，共塑未来&lt;/h3&gt;
&lt;p&gt;说到底，我们今天所面对的，确实像是一场对人类个体价值的“终极清算”，一场足以颠覆现有社会结构的深远变革。但话说回来，我们这些浸润在启蒙理性与存在主义思潮中的个体，怎能轻易就把这“终局”当成无可逃避的宿命呢？知识，从来都是打破蒙昧、指引我们前行的火炬；而行动本身，更是在我们看清了世界的荒谬与挑战之后，依然选择勇敢地投身其中，为自己、也为我们所在的社群，一砖一瓦地铸就意义的过程。&lt;/p&gt;
&lt;p&gt;真正的“体面”活法，绝不是在被告知“大势已去”之后，躺平了任由命运摆布。它应该是在这股技术洪流中，我们人类社会能够通过深思熟虑、而非盲目跟风，通过积极行动、而非被动接受，去共同构建一个以人类福祉为核心，以主体性自由为目标，一个真正可持续的未来。这并非要螳臂当车、逆转什么铁律，而是要先理解这些规律，然后在规律的缝隙里，找到我们人类能动性的最大化空间。所以，诸位，让我们把理性当作指引方向的灯塔，把勇气化作乘风破浪的风帆，在AI的滔天巨浪中，不仅仅是认识这个世界，更要撸起袖子，积极地去改造它，为人类的价值与尊严，锚定一个坚实而又光明的未来吧！&lt;/p&gt;</description></item><item><title>Hugo Qucik Introduction</title><link>http://fengwang.github.io/posts/my-first-post/</link><pubDate>Tue, 10 Mar 2026 21:20:12 +0100</pubDate><guid>http://fengwang.github.io/posts/my-first-post/</guid><description>&lt;h3 id="first-post-with-hugo"&gt;First post with Hugo&lt;/h3&gt;
&lt;p&gt;In the root folder, execute command&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;hugo new content content/posts/my-first-post.md
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This will create a new markdown file in the &lt;code&gt;content/posts&lt;/code&gt; directory with the name &lt;code&gt;my-first-post.md&lt;/code&gt;. The file will contain some default front matter, which you can edit to include your desired title, date, and other metadata.&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;---
title: &amp;#34;My First Post&amp;#34;
date: 2026-03-10T21:20:12+01:00
draft: true
---
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;The metadata are intoduced in Hugo as &amp;ldquo;front matter&amp;rdquo;. The &lt;code&gt;title&lt;/code&gt; field specifies the title of the post, the &lt;code&gt;date&lt;/code&gt; field indicates when the post was created, and the &lt;code&gt;draft&lt;/code&gt; field is set to &lt;code&gt;true&lt;/code&gt;, which means that the post will not be published until you change it to &lt;code&gt;false&lt;/code&gt;. More information about front matter can be found in the &lt;a href="https://gohugo.io/content-management/front-matter/"target="_blank" rel="noopener noreferrer"&gt;Hugo documentation&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id="quick-review-of-the-post-content"&gt;Quick review of the post content&lt;/h3&gt;
&lt;p&gt;Execute the command&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;hugo server -D
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then open your web browser and navigate to &lt;code&gt;http://localhost:1313/&lt;/code&gt;. You should see your new post listed on the homepage. Click on the post title to view the full content of the post. Since the &lt;code&gt;draft&lt;/code&gt; field is set to &lt;code&gt;true&lt;/code&gt;, the post will not be visible on the public site until you change it to &lt;code&gt;false&lt;/code&gt; and rebuild your site.&lt;/p&gt;
&lt;h3 id="publish-the-post"&gt;Publish the post&lt;/h3&gt;
&lt;p&gt;To publish the post, open the &lt;code&gt;my-first-post.md&lt;/code&gt; file and change the &lt;code&gt;draft&lt;/code&gt; field from &lt;code&gt;true&lt;/code&gt; to &lt;code&gt;false&lt;/code&gt;. Then save the file and rebuild your site by running the command:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;hugo
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;After rebuilding, your post will be published and visible on the public site. You can verify this by navigating to &lt;code&gt;http://localhost:1313/&lt;/code&gt; again and checking that your post is now listed on the homepage.&lt;/p&gt;</description></item><item><title>About</title><link>http://fengwang.github.io/about/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>http://fengwang.github.io/about/</guid><description>&lt;p&gt;A former physicist turned machine learning engineer, I have a passion for learning and sharing knowledge. With a background in physics, I bring a unique perspective to software development, combining analytical thinking with creativity. I enjoy exploring new technologies and applying them to solve real-world problems. In my free time, I like to read, travel, and experiment with new frontiers in deep learning and artificial intelligence. This blog is a platform for me to share my insights, experiences, and projects in the world of software engineering and beyond. I hope to inspire others to pursue their passions and contribute to the ever-evolving field of technology.&lt;/p&gt;</description></item></channel></rss>