GOBA-AI-Labs — Making Large AI Accessible

Naive uniform pruning destroys capability — math drops to 0% at 75% removal, code generation -25pp at 50%. Our calibration-based layer-adaptive pruning removes 50% of experts with minimal loss: MMLU -5pp, HumanEval -2pp. 45 GB → 24 GB.

Qwen3.5-35B-A3B: Day-1 Pruning

Pruned within 24 hours of release. 5 methods compared on 256-expert DeltaNet hybrid. Weight-based pruning wins: MMLU 80% (-1pp), LiveCodeBench Easy 83.1% (142 problems). Published on HuggingFace.

How It Works

Expert pruning, not aggressive quantization.

🔍

Calibration-Based Scoring

Expert importance measured through actual inference on diverse workloads. Significantly more accurate than static weight analysis.

📊

Layer-Adaptive Allocation

Each layer retains a dynamically determined number of experts. Some layers are more sensitive to pruning — adaptive allocation preserves quality where it matters.

🌏

Language-Aware Optimization

Automatic detection and protection of language-specialized experts. Japanese, Chinese, and other language capabilities preserved during compression.

⚡

Zerobias Router Optimization

Post-pruning router bias correction extends the lossless compression frontier. Zero cost, no retraining required.

🧩

Expert-as-Adapter

Pruned expert slots become domain adapters. A LoRA alternative using the MoE architecture itself — no external modules, native router integration.

Expert Pruning vs Quantization

Expert Pruning Q2 Quantization

Approach Remove redundant experts entirely Reduce precision of all weights

Quality impact Targeted, minimal Uniform degradation

~24 GB model quality MMLU 72% MMLU ~55–60%

Remaining precision Full Q4 precision 2-bit precision

Where This Is Going

Expert pruning is just the beginning. Here's what becomes possible.

💻

Consumer Hardware

Frontier-class MoE models on a laptop with 24GB RAM — no $10K+ server GPU required. Democratizing access to the most capable AI.

🔋

Lower Energy Cost

Fewer experts means fewer FLOPs per token. Data centers could serve the same quality at half the compute — significant power savings at scale.

🚀

Post-Pruning RL

Pruned models create headroom for targeted reinforcement learning. Same size, potentially better performance — quality amplification, not just preservation.

🌐

Edge Deployment

MoE models become viable for on-device inference. Private, fast, and offline — bringing expert-level AI to phones and embedded systems.

Built on a MacBook. Funded by You.

This entire research — 29+ phases, 6 model families, 8 published models, 100+ experiments — was conducted on a single MacBook Pro (M4 Pro, 24GB RAM, 512GB SSD). No data center. No corporate funding.

Your support directly funds GPU access to test on larger models, cloud compute for comprehensive benchmarking, and time to continue open-source research.

Half the experts.
Full quality.

PrunedHub Models

Qwen3.5-35B-A3B-80pct

GPT-OSS-20B-28x

Qwen3-Coder-Next-50pct

Qwen3-30B-A3B-JP-80pct

GPT-OSS-20B-27x-Zerobias

Qwen3-30B-EN-MxMoE

Qwen3-30B-JP-MxMoE

GOBA Models

OLMoE-1B-7B Expert Tuned

Research Highlights

Inference Tools

moe-stream

How It Works

Calibration-Based Scoring

Layer-Adaptive Allocation

Language-Aware Optimization

Zerobias Router Optimization

Expert-as-Adapter

Expert Pruning vs Quantization

Where This Is Going

Consumer Hardware

Lower Energy Cost

Post-Pruning RL

Edge Deployment

Built on a MacBook. Funded by You.

Half the experts.Full quality.

PrunedHub Models

Qwen3.5-35B-A3B-80pct

GPT-OSS-20B-28x

Qwen3-Coder-Next-50pct

Qwen3-30B-A3B-JP-80pct

GPT-OSS-20B-27x-Zerobias

Qwen3-30B-EN-MxMoE

Qwen3-30B-JP-MxMoE

GOBA Models

OLMoE-1B-7B Expert Tuned

Research Highlights

Inference Tools

moe-stream

How It Works

Calibration-Based Scoring

Layer-Adaptive Allocation

Language-Aware Optimization

Zerobias Router Optimization

Expert-as-Adapter

Expert Pruning vs Quantization

Where This Is Going

Consumer Hardware

Lower Energy Cost

Post-Pruning RL

Edge Deployment

Built on a MacBook. Funded by You.

Half the experts.
Full quality.