Half the experts.
Full quality.

We remove up to 50% of MoE experts while preserving benchmark scores. Proven on models up to 80B — our goal is 400B+ on consumer hardware.

45 GB 24 GB
Qwen3-Coder-Next (80B) — 50% expert pruning, MMLU 72%

PrunedHub Models

Pruned MoE models ready to run. GPT-OSS models work with llama.cpp; Qwen3 models require moe-stream.

Inference Tools

Run pruned MoE models on consumer hardware.

moe-stream

Rust

SSD-streaming MoE inference engine for Apple Silicon, CUDA & Linux

80B models with 4GB RAM via NVMe SSD streaming
Layer-adaptive pruning support (experts_per_layer)
Q4 quantized matmul — +79% speedup
Metal GPU + CUDA + CPU hybrid inference
CI/CD with pre-built binaries for macOS & Linux
Python bindings (PyO3)
~55 tok/s GPU-resident ~2 tok/s SSD streaming MIT / Apache-2.0
View on GitHub

How It Works

Expert pruning, not aggressive quantization.

🔍

Calibration-Based Scoring

Expert importance measured through actual inference on diverse workloads. Significantly more accurate than static weight analysis.

📊

Layer-Adaptive Allocation

Each layer retains a dynamically determined number of experts. Some layers are more sensitive to pruning — adaptive allocation preserves quality where it matters.

🌏

Language-Aware Optimization

Automatic detection and protection of language-specialized experts. Japanese, Chinese, and other language capabilities preserved during compression.

Zerobias Router Optimization

Post-pruning router bias correction extends the lossless compression frontier. Zero cost, no retraining required.

Expert Pruning vs Quantization

Expert Pruning Q2 Quantization
Approach Remove redundant experts entirely Reduce precision of all weights
Quality impact Targeted, minimal Uniform degradation
~24 GB model quality MMLU 72% MMLU ~55–60%
Remaining precision Full Q4 precision 2-bit precision

Where This Is Going

Expert pruning is just the beginning. Here's what becomes possible.

💻

Consumer Hardware

Frontier-class MoE models on a laptop with 24GB RAM — no $10K+ server GPU required. Democratizing access to the most capable AI.

🔋

Lower Energy Cost

Fewer experts means fewer FLOPs per token. Data centers could serve the same quality at half the compute — significant power savings at scale.

🚀

Post-Pruning RL

Pruned models create headroom for targeted reinforcement learning. Same size, potentially better performance — quality amplification, not just preservation.

🌐

Edge Deployment

MoE models become viable for on-device inference. Private, fast, and offline — bringing expert-level AI to phones and embedded systems.

Built on a MacBook. Funded by You.

This entire research — 22 phases, 4 models, 50+ experiments — was conducted on a single MacBook Pro (M4 Pro, 24GB RAM, 512GB SSD). No data center. No corporate funding.

Your support directly funds GPU access to test on larger models, cloud compute for comprehensive benchmarking, and time to continue open-source research.

Support on Ko-fi