We remove up to 50% of MoE experts while preserving benchmark scores. Proven on models up to 80B — our goal is 400B+ on consumer hardware.
Pruned MoE models ready to run. GPT-OSS models work with llama.cpp; Qwen3 models require moe-stream.
openai/gpt-oss-20b
Zero quality loss across all benchmarks. MMLU 78%, HumanEval 78%, GSM8K 92%. Fits 16GB RAM.
Qwen/Qwen3-Coder-Next
80B model compressed to 24GB. MMLU 72%. At 44% pruning (27.7 GB): LCB Easy 83%, HumanEval 72%.
Qwen/Qwen3-30B-A3B
Language-aware pruning preserves Japanese quality. Thinking-ON: MMLU 79%, JA 90%.
openai/gpt-oss-20b
Router optimization recovers quality at the pruning cliff. -1pp with 15.6% fewer experts.
Run pruned MoE models on consumer hardware.
SSD-streaming MoE inference engine for Apple Silicon, CUDA & Linux
Expert pruning, not aggressive quantization.
Expert importance measured through actual inference on diverse workloads. Significantly more accurate than static weight analysis.
Each layer retains a dynamically determined number of experts. Some layers are more sensitive to pruning — adaptive allocation preserves quality where it matters.
Automatic detection and protection of language-specialized experts. Japanese, Chinese, and other language capabilities preserved during compression.
Post-pruning router bias correction extends the lossless compression frontier. Zero cost, no retraining required.
Expert pruning is just the beginning. Here's what becomes possible.
Frontier-class MoE models on a laptop with 24GB RAM — no $10K+ server GPU required. Democratizing access to the most capable AI.
Fewer experts means fewer FLOPs per token. Data centers could serve the same quality at half the compute — significant power savings at scale.
Pruned models create headroom for targeted reinforcement learning. Same size, potentially better performance — quality amplification, not just preservation.
MoE models become viable for on-device inference. Private, fast, and offline — bringing expert-level AI to phones and embedded systems.
This entire research — 22 phases, 4 models, 50+ experiments — was conducted on a single MacBook Pro (M4 Pro, 24GB RAM, 512GB SSD). No data center. No corporate funding.
Your support directly funds GPU access to test on larger models, cloud compute for comprehensive benchmarking, and time to continue open-source research.
Support on Ko-fi