Calibration-Based Expert Pruning for Mixture-of-Experts Language Models

GOBA-AI-Labs

Independent Research

February 2026

Abstract

We present a framework for compressing Mixture-of-Experts (MoE) language models by selectively removing experts based on their measured importance during inference. Unlike weight-based metrics that estimate importance from static model parameters, our calibration-based approach scores experts through actual inference on diverse workloads, producing significantly more accurate importance rankings. We introduce three complementary techniques: layer-adaptive expert allocation, which allows each layer to retain a different number of experts based on its sensitivity; language-aware expert protection, which detects and preserves language-specialized experts during compression; and zerobias router optimization, a zero-cost post-processing step that recovers quality at pruning cliff points by neutralizing stale router biases. We validate our approach on three model families: GPT-OSS-20B (lossless compression, MMLU 78% preserved at 10.4 GB), Qwen3-30B-A3B (language-aware pruning, MMLU 79% with thinking at 14 GB), and Qwen3-Coder-Next 80B (50% pruning, MMLU 72% at 24.4 GB). Across all models, we identify a universal pruning cliff phenomenon where quality transitions sharply from preserved to destroyed within a narrow pruning range, and show that the Gini coefficient of expert importance distribution predicts cliff sharpness. Our framework requires no GPU training, no gradient computation, and completes in under one hour on consumer hardware.

1. Introduction

Mixture-of-Experts (MoE) language models achieve frontier-level quality by activating only a fraction of their total parameters per token. Recent models such as DeepSeek-V3 (671B total, 37B active), Qwen3-Coder-Next (80B total, ~3B active per token), and GPT-OSS-20B (21B total, 3.6B active) demonstrate that MoE architectures can match or exceed dense models at a fraction of the inference cost. However, their total parameter count—often tens to hundreds of gigabytes in quantized form—places them beyond the memory capacity of consumer hardware.

Existing compression techniques address the size problem but not the structural opportunity MoE models present. Post-training quantization (GPTQ, AWQ, GGUF Q4) reduces precision uniformly across all parameters, treating the model as a monolithic block. Knowledge distillation can produce smaller student models but requires expensive retraining. Neither approach exploits the fundamental property that distinguishes MoE from dense models: the existence of discrete, independently-parameterized expert subnetworks that can be selectively retained or removed based on their functional contribution.

We propose calibration-based expert pruning: measuring expert importance through actual inference on representative workloads, then removing the least important experts from each layer. This operates directly on quantized GGUF model files, requiring no dequantization, no gradient computation, and no retraining. The output is a valid GGUF file with fewer experts per layer, ready for immediate inference.

Our contributions are:

Calibration-based importance scoring that significantly outperforms weight-based metrics (+15pp MMLU, +20pp on Japanese tasks).
Layer-adaptive expert allocation that preserves quality by allowing each layer to retain a dynamically determined number of experts.
Language-aware expert protection that detects and preserves language-specialized experts, enabling market-specific compression.
Zerobias router optimization that recovers quality at pruning cliff points by neutralizing stale router biases, extending the lossless compression frontier at zero cost.
Cross-model validation on three architectures (32, 128, and 512 experts per layer) demonstrating that pruning cliffs are universal and predictable.

2. Method

2.1 Overview

Our pruning pipeline operates in four stages: (1) calibration data collection, (2) importance scoring, (3) pruning plan generation with layer-adaptive allocation, and (4) GGUF file pruning. The pipeline takes a quantized GGUF model and a set of calibration prompts as input, and produces a pruned GGUF file with variable expert counts per layer.

2.2 Calibration-Based Importance Scoring

Each calibration prompt is run through the full model, and for every layer and expert, we record how frequently the router selects that expert and how strongly it prefers it when selected. The importance score combines these two signals: an expert that is both frequently activated and strongly preferred receives a high score, while an expert that is rarely used or weakly gated receives a low score.

The calibration set should span the workloads the pruned model is expected to handle. For general-purpose compression, we use prompts covering code generation, mathematical reasoning, factual recall, and natural language question answering. For language-specific compression, we add prompts in the target language(s).

Key finding: Calibration-based scoring significantly outperforms weight-based scoring. On Qwen3-30B-A3B at 80% expert retention, calibration achieves MMLU 74% while weight-based scoring achieves only 60% (+15pp difference). On Japanese evaluation, calibration achieves 85% while weight-based achieves 65% (+20pp). Weight-based scoring produces fundamentally different expert retention sets that do not optimize for inference quality.

2.3 Layer-Adaptive Expert Allocation

Not all layers in an MoE model are equally sensitive to expert removal. Some layers have highly specialized experts where removing any one causes significant quality loss, while others have redundant experts that can be safely removed. Our layer-adaptive approach computes per-layer importance distributions and allocates different retention counts to each layer based on the measured importance gap between retained and pruned experts.

This produces models where some layers retain nearly all experts while others have a significant number removed. The resulting GGUF files have a variable experts_per_layer metadata field that standard inference engines (llama.cpp) do not currently support. We developed moe-stream, an open-source Rust inference engine, to handle these variable-expert models. Models with uniform expert counts (e.g., GPT-OSS-20B pruned from 32 to 28 experts in all layers) remain compatible with llama.cpp.

2.4 Language-Aware Expert Protection

MoE models with sufficient expert count develop language-specialized experts during training. In Qwen3-30B-A3B (128 experts per layer), we identified 30 Japanese-specialist and 15 English-specialist experts through differential frequency analysis across multilingual calibration prompts. In contrast, GPT-OSS-20B (32 experts per layer) shows near-uniform routing (Gini = 0.041) with no language specialization, while GLM-5 (256 experts per layer) exhibits even stronger specialization (15 Japanese specialists, Gini = 0.444).

Expert count governs language specialization: 32 experts → 0 language specialists (Gini 0.041); 128 experts → 30 specialists (Gini 0.233); 256 experts → 15 specialists (Gini 0.444). Models with more experts develop clearer functional specialization, including language-specific routing.

For market-specific compression (e.g., Japanese market), we protect detected language-specialist experts from pruning regardless of their global importance score. This ensures that language capabilities are preserved even when aggressive compression is applied.

2.5 Zerobias Router Optimization

MoE routers contain learned bias terms that were calibrated during pre-training with the full expert set. After pruning, these biases may become miscalibrated: biases that previously directed tokens to now-absent experts create a routing vacuum, while remaining experts' biases no longer reflect their relative importance in the reduced set.

Zerobias sets all router biases to zero, forcing the router to rely solely on input-dependent routing weights. This is a zero-cost post-processing step—no training, no computation beyond modifying bias tensors in the GGUF file.

Zerobias is cliff-specific: At the pruning cliff (GPT-OSS-20B 27/32 experts), zerobias recovers +9pp MMLU (68% → 77%), nearly matching the safe operating point (28/32 = 78%). However, at the well-calibrated 28/32 point, zerobias is harmful (−14pp), because the original biases still encode useful routing information. Zerobias is beneficial only when the original biases are miscalibrated due to pruning beyond the model's redundancy margin.

3. Results

3.1 GPT-OSS-20B: Lossless Compression

GPT-OSS-20B is a 21B-parameter MoE model with 32 experts per layer, top-2 sigmoid routing, and MXFP4 format. Uniform pruning (same number of experts removed from each layer) is used because the model has too few experts for meaningful layer-adaptive allocation.

Table 1: GPT-OSS-20B pruning results (Q4_K_M, MMLU 100Q 0-shot, GSM8K 50Q 0-shot)
Config	Size	Experts/Layer	MMLU	GSM8K	HumanEval
Original	11.67 GB	32	78%	—	—
Pruned 28/32	10.40 GB	28	78%	92%	78%
Pruned 27/32	~10.1 GB	27	68%	—	—
27/32 + Zerobias	~9.4 GB	27	77%	84%	—
Pruned 26/32	~9.7 GB	26	69%	—	—

The 28/32 model achieves lossless compression: MMLU 78% (identical to original), GSM8K 92% (46/50), and HumanEval 78% (39/50). The file size reduction from 11.67 GB to 10.40 GB (−10.9%) is modest but comes at zero quality cost.

At 27/32 experts, a sharp pruning cliff appears: MMLU drops from 78% to 68% (−10pp) with a single expert removed per layer. Applying zerobias recovers most of this loss (77%, −1pp from original), producing a 9.4 GB model that is near-lossless. Notably, 26/32 without zerobias scores 69%—higher than 27/32 without zerobias (68%)—revealing that the cliff is a step function concentrated at the 28→27 transition.

3.2 Qwen3-30B-A3B: Language-Aware Pruning

Qwen3-30B-A3B is a 30B-parameter MoE model with 128 experts per layer across 48 layers. With more experts, layer-adaptive allocation and language-aware protection become effective.

Table 2: Qwen3-30B-A3B pruning curve (Q4_K_M, MMLU 100Q)
Config	Size	Keep Rate	MMLU	Notes
Original	17.3 GB	100%	77%	—
Pruned 90%	15.6 GB	90%	73%	−4pp
Pruned 80% (JP-aware)	14.0 GB	80%	79% (think-ON)	JA 90%
Pruned 70%	12.3 GB	70%	51%	Cliff (−26pp)
Pruned 60%	—	60%	Collapse	—

The 80% retention model (14.0 GB) with language-aware Japanese expert protection achieves MMLU 79% (with thinking enabled), GSM8K 92%, and Japanese quality 90%. This demonstrates that language-aware pruning can simultaneously compress and preserve multilingual quality.

A sharp cliff appears between 80% and 70% retention: MMLU drops from 72% to 51% (−21pp), with further pruning causing complete collapse. This establishes 80% retention (14 GB) as the practical lower bound for this model.

Calibration vs. Weight-Based Scoring

Table 3: Importance scoring method comparison (30B-A3B, 80% retention)
Method	MMLU	Japanese	GSM8K
Calibration-based + JA protect	74%	85%	92%
Weight-based + JA protect	60%	65%	—

Calibration-based scoring outperforms weight-based scoring by +14pp on MMLU and +20pp on Japanese evaluation at the same retention level. The two methods produce fundamentally different expert retention sets—weight norms do not predict inference-time importance.

3.3 Qwen3-Coder-Next 80B: Deep Pruning

Qwen3-Coder-Next is an 80B-parameter MoE model with 512 experts per layer across 48 layers (~3B active per token). The large expert count enables aggressive layer-adaptive pruning.

Table 4: Qwen3-Coder-Next 80B pruning (Q4_K_M, MMLU 100Q)
Config	Size	Keep Rate	MMLU	Other
Original Q4	~48 GB	100%	77%	HumanEval 74%
v7 (44% pruned)	27.68 GB	56%	70%	HumanEval 72%, LCB Easy 83%
50% pruned	24.4 GB	50%	72%	—
55% pruned	~20 GB	45%	60%	Cliff (−12pp)
65% pruned	~17.9 GB	35%	Random	Complete collapse

The 50% pruned model (24.4 GB) achieves MMLU 72%—93.5% of the original quality while fitting within 24 GB consumer hardware memory. This is notably better than Q2 quantization of the same model, which would produce a similar file size (~25–28 GB) but at estimated MMLU 55–60% due to uniform precision loss across all weights.

A cliff appears between 50% and 45% retention (−12pp), with 35% retention producing random outputs. The 50% keep rate is the deepest viable compression for this model.

3.4 Expert Pruning vs. Quantization

Table 5: Expert pruning compared to aggressive quantization at similar sizes
Approach	Target Size	Method	Remaining Precision	Quality Impact
Expert Pruning	24.4 GB	Remove 50% of experts	Full Q4 (4-bit)	MMLU 72%
Q2 Quantization	~25–28 GB	Reduce all weights to 2-bit	2-bit	MMLU ~55–60%

Expert pruning and quantization are orthogonal compression techniques. Expert pruning removes entire expert subnetworks while preserving full quantization precision on the remaining experts. Quantization reduces precision uniformly across all parameters. At the same file size, expert pruning achieves significantly higher quality because retained experts operate at their original precision, whereas aggressive quantization degrades every weight.

Furthermore, expert pruning can be applied on top of already-quantized models (as we do with Q4_K_M GGUF files), making the two techniques composable: quantization first for weight-level compression, then expert pruning for structural compression.

4. Cross-Model Findings

4.1 The Universal Pruning Cliff

All three model families exhibit a sharp pruning cliff—a narrow range of pruning rates within which quality transitions from fully preserved to fully destroyed. This is not gradual degradation but a phase transition.

Table 6: Pruning cliff characteristics across model families
Model	Experts/Layer	Safe Pruning	Cliff	Gini
GPT-OSS-20B	32	4 experts (~12.5%)	28 → 27 (−10pp)	0.041
Qwen3-30B-A3B	128	~26 experts (~20%)	80% → 70% (−21pp)	0.233
Qwen3-80B	512	~256 experts (~50%)	50% → 45% (−12pp)	—

A unifying predictor emerges: the Gini coefficient of expert importance distribution predicts cliff sharpness. Models with low Gini (near-uniform importance, e.g., GPT-OSS at 0.041) exhibit sharper per-expert cliffs because every expert contributes substantially. Models with higher Gini (more skewed importance) exhibit more gradual degradation, allowing deeper pruning before the cliff.

4.2 Quality Degradation Order

Across all pruning experiments, we observe a consistent ordering of capability degradation as pruning increases:

Code generation (most fragile) — degrades first, producing pseudocode before valid programs disappear
Arithmetic — phase-transition-like errors (e.g., 15+27=45)
Reasoning — degraded logical coherence
Factual knowledge (most robust) — last to degrade, distributed across many experts

This ordering is significant for calibration design: ensuring the calibration set covers code generation (the most fragile capability) is critical, as code-only evaluation reveals failure modes invisible to QA-only testing.

5. Selected Negative Results

Over 22 research phases, we systematically evaluated numerous approaches that did not succeed. We summarize the most significant negative results as boundary conditions for future work.

Table 7: Summary of key negative results
Approach	Result	Key Insight
Gate L2 norm pruning (REAP)	HumanEval 70% at 50%	Static weight metrics fail; calibration-based scoring required
Weight-based importance	MMLU 60% (vs calib. 74%)	Weight norms do not predict inference-time importance
Uniform pruning ratio	80B MMLU 64%	Layer-adaptive allocation critical for quality
EN-optimized force-penalty	MMLU 58%	Language experts also contribute to STEM reasoning
Routing boost + pruning	MMLU 56% (−21pp)	Boost and pruning plans must be jointly computed
Expert-as-Adapter (KD)	Layer MSE −15%, E2E −2pp	Layer-level improvement ≠ end-to-end improvement
TinyLoRA (13 params)	MMLU −4pp at cliff	Micro-parameter tuning insufficient for MoE recovery
GRPO router training	MMLU 67% (−5pp)	Gradient-based bias optimization underperforms zerobias
Zerobias iterative pruning	26/32: 65% (−4pp)	Zerobias is cliff-specific; harmful beyond the cliff
Dense SLM pruning (4B)	FFN 25%: collapse	Dense models lack expert-level redundancy; MoE ≫ Dense for compression
MoE abliteration	Max consistency 56%	Safety behavior distributed, not expert-localized

Cross-cutting lesson: The most consistent finding across all negative results is that router bias calibration—not expert capacity—is the dominant factor in post-pruning quality. Expert-as-Adapter achieves 15% MSE reduction but −2pp end-to-end. TinyLoRA and GRPO router training both degrade quality. Only zerobias (at cliff points) and calibration-based pruning (avoiding the cliff) produce positive results. This implies that preserving the router's ability to distribute tokens correctly is more important than the representational capacity of any individual expert.

6. Inference: moe-stream

Layer-adaptive pruning produces models with different expert counts per layer, which standard inference engines (llama.cpp) do not support. We developed moe-stream, an open-source Rust inference engine that handles these variable-expert models.

Key capabilities:

Three inference modes: GPU-resident (models < 80% RAM), GPU-hybrid (80–90% RAM), and SSD-streaming (> 90% RAM). Mode is auto-selected based on model size and available memory.
SSD streaming: Run models larger than RAM by streaming expert weights from NVMe SSD on demand. ~2 tok/s for a 48 GB model on 24 GB hardware.
Q4 quantized matmul: Skip dequantization and compute directly on Q4 weights, providing +79% speedup (1.16 → 2.07 tok/s).
Metal GPU compute: Hardware-accelerated inference on Apple Silicon (~55 tok/s for GPU-resident models).
Variable expert counts: Full support for experts_per_layer metadata in GGUF files.

Models with uniform expert counts (GPT-OSS-20B 28/32, 27/32) work with both llama.cpp and moe-stream. Models with layer-adaptive counts (Qwen3-30B-A3B JP-80pct, Qwen3-Coder-Next 50pct) require moe-stream.

7. Conclusion

We have presented a practical framework for compressing MoE language models by selectively removing experts. The key principles are:

Calibration over weights. Measuring expert importance through actual inference produces dramatically better results than static weight analysis (+15pp MMLU, +20pp on language tasks).
Layer-adaptive allocation. Each layer has different sensitivity to pruning; adaptive allocation preserves quality where it matters most.
Language-aware protection. Models with sufficient expert count develop language-specialized routing, and protecting these experts enables market-specific compression without quality loss.
Zerobias at the cliff. When pruning reaches the cliff point, zeroing router biases is the most effective recovery technique—surpassing gradient-based optimization, expert adapter training, and micro-parameter tuning.
The cliff is universal. All tested MoE architectures exhibit a sharp pruning cliff, predictable from the Gini coefficient of expert importance.

The practical outcome is that MoE models can be compressed significantly on consumer hardware with minimal quality loss: GPT-OSS-20B from 11.67 to 9.4 GB (MMLU 77%), Qwen3-30B-A3B from 17.3 to 14.0 GB (MMLU 79% with thinking), and Qwen3-Coder-Next 80B from ~48 to 24.4 GB (MMLU 72%). The entire pipeline requires no GPU training, no gradient computation, and completes in under one hour.

Pre-Pruned Models

All models are available on HuggingFace:

Model	Size	MMLU	llama.cpp	moe-stream
PrunedHub GPT-OSS-20B-28x	10.4 GB	78%	Yes	Yes
PrunedHub GPT-OSS-20B-27x-Zerobias	~9.4 GB	77%	Yes	Yes
PrunedHub Qwen3-30B-A3B-JP-80pct	14.0 GB	79%	No	Required
PrunedHub Qwen3-Coder-Next-50pct	24.4 GB	72%	No	Required

Citation

@misc{goba-ai-labs-expert-pruning-2026,
  title={Calibration-Based Expert Pruning for Mixture-of-Experts Language Models},
  author={GOBA-AI-Labs},
  year={2026},
  url={https://goba-ai-labs.github.io/paper/}
}