REAP: How Cerebras Is Cutting Trillion-Parameter Models in Half

If you've browsed HuggingFace lately, you've probably noticed a new pattern in model names. GLM-4.7-REAP-268B. DeepSeek-V3.2-REAP-345B. Qwen3-Coder-REAP-25B. Models you recognize, but smaller. Sometimes dramatically so. They all share four letters in the middle: REAP.

REAP, short for Router-weighted Expert Activation Pruning, is a compression technique published by researchers at Cerebras that removes up to half the parameters from Mixture-of-Experts language models while keeping most of their capability intact. The method has gained rapid traction since its publication, with both Cerebras and independent contributors publishing pruned checkpoints across a growing list of frontier models, including DeepSeek-V3.2, Kimi-K2, the GLM family, MiniMax-M2, and multiple Qwen variants.

The reason it's showing up everywhere is straightforward: it works, it's open-source, and it addresses the central bottleneck of modern AI deployment. These models are too large to run on hardware most organizations can actually access.

The Paper

"REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression" was submitted to arXiv in October 2025 and accepted to ICLR 2026, one of the top machine learning conferences. It was written by Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, and Vithursan Thangarasa, a team spanning Cerebras Systems and the University of Calgary's Schulich School of Engineering. The paper runs 29 pages with 9 figures and 12 tables, and is published under a CC BY 4.0 license.

REFERENCED PAPER

REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression

Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, Vithursan Thangarasa

October 2025 arXiv:2510.13999 PDF GITHUB

The core question the paper asks: when you need to make a Mixture-of-Experts model smaller, is it better to merge similar experts together or prune the least important ones entirely?

Their answer is unambiguous: prune. And they prove it both theoretically and empirically.

What Mixture-of-Experts Actually Means

To understand why REAP matters, you need to understand the architecture it targets.

A standard dense language model like Llama 3 at 70 billion parameters activates every parameter for every token it processes. A Mixture-of-Experts model takes a different approach. It replaces the dense feed-forward layers with a collection of smaller "expert" sub-networks, then uses a lightweight router to select only a few of them per token.

GLM-5, for instance, has 744 billion total parameters spread across 256 experts, but only activates about 40 billion parameters per token (roughly 5.4% of the model). The router looks at each incoming token, evaluates which experts are most relevant, picks 8 of the 256, and routes the computation accordingly. The other 248 experts sit idle for that token.

This design decouples knowledge capacity from inference cost: you get the stored knowledge of a 744B model at roughly the computational cost of running a 40B model. The tradeoff is memory. Even though most experts aren't active for any given token, all 744 billion parameters still need to be loaded into GPU memory. You're paying for the full model in VRAM while only using a fraction of it at any moment.

This is the problem REAP solves.

How REAP Works

The insight behind REAP starts with a simple observation: not all experts contribute equally. In a model with 256 experts per layer, the paper found that 9 to 12 experts per layer had zero activation. They were completely dead. Many others fired rarely and contributed negligibly to the output when they did.

REAP scores each expert using a saliency criterion that multiplies two signals:

Score(expert) = mean( router_gate_weight × ‖expert_output‖₂ )

The first term, router gate weight, measures how confidently the router selects that expert when it does get activated. The second term, expert output norm, measures the magnitude of the expert's actual contribution to the layer output. An expert that fires often but produces tiny outputs scores low. An expert that fires rarely but produces critical outputs scores high.

This is more principled than simpler approaches like pruning based on activation frequency alone. An expert that activates on 2% of tokens but produces strong outputs for all of them may be far more important than one that activates on 10% of tokens but barely moves the needle.

Once each expert has a REAP score, the method simply removes the lowest-scoring experts from each layer. No retraining, no fine-tuning, no gradient computation. It's a one-shot operation: score, rank, cut. Because there is no gradient computation involved, the process completes in a fraction of the time that retraining or fine-tuning would require.

Why Pruning Beats Merging

Before REAP, the dominant approach was expert merging, which combines similar experts into fewer, larger ones. Methods like HC-SMoE and M-SMoE cluster experts by similarity, then average their weights. It seems intuitive: if two experts do similar things, merge them and save a slot.

The paper's key theoretical contribution is proving this intuition wrong for generative tasks. They derive an irreducible error bound for merging that shows the problem is structural, not algorithmic. When you merge two experts into one averaged expert, the router loses its ability to dynamically mix between them based on the input.

The authors call this "functional subspace collapse." In the original model, the router might send 70% of weight to Expert A and 30% to Expert B for one token, then reverse the ratio for the next token. After merging, there's only one averaged expert, and the router's fine-grained, input-dependent mixing strategy is permanently destroyed. This error is proportional to how different the experts are and how much the router varies its mixing between them. It cannot be recovered without retraining.

Pruning avoids this entirely. When you remove an expert, the surviving experts keep their original weights and the router retains full dynamic control over them. The model loses some capacity, but the mechanism stays intact.

The empirical results confirm the theory. On creative writing tasks with Qwen3-30B at 50% compression, REAP retained a score of 0.718 while HC-SMoE (a popular merging method) collapsed to 0.008, essentially producing gibberish. On GLM-4.5-Air at the same compression, REAP scored 0.754 while HC-SMoE managed 0.593, a less dramatic gap but still a clear win. On code generation with Qwen3-30B at 50% compression, REAP scored 0.541 versus HC-SMoE's 0.364.

The Benchmarks

The paper evaluates REAP across models ranging from 20 billion to over 1 trillion parameters, at both 25% and 50% compression ratios. The results are organized across code generation, math reasoning, multiple-choice, and creative writing tasks.

At the high end, the numbers are striking. With Qwen3-Coder-480B at 50% expert pruning:

Non-agentic coding (HumanEval, MBPP, LiveCodeBench): retained 97.6% of baseline performance
Agentic coding (SWE-Bench Verified): retained 96.7% of baseline performance

That means half the experts were removed and coding ability barely moved.

On smaller models, the tradeoff is more visible but still favorable compared to alternatives. Across all models excluding the large-scale ones, REAP's mean accuracy decrease at 25% compression was 1.9% on coding tasks. At 50%, it was 6.9%. By comparison, both HC-SMoE and M-SMoE showed accuracy drops exceeding 5% at just 25% compression and over 20% at 50%.

On math reasoning at 50% compression, REAP scored 0.857 versus the baseline's roughly 0.89, a 3.7% decline. On multiple-choice benchmarks, the gap was slightly larger, with REAP and HC-SMoE performing comparably at around 4% decline for 25% and 13% for 50%.

These numbers paint a consistent picture: REAP's advantage is largest on generative tasks, which are exactly the tasks that matter most for practical deployment.

What This Means for GLM-5

GLM-5, released by Zhipu AI in February 2026, is one of the largest open-weight MoE models available. Its specifications make it a natural target for REAP:

744 billion total parameters
256 routed experts per MoE layer, 8 active per token
~40 billion active parameters per inference pass
78 hidden layers (3 dense + 75 MoE)
~203K token context window (202,752 tokens)

Running GLM-5 at full precision requires well over a terabyte of VRAM. Even in FP8, the official checkpoint weighs in at hundreds of gigabytes, firmly in the territory of multi-node H100 clusters.

A community member has already applied REAP at 50% compression to GLM-5, producing a checkpoint that cuts from 256 to 128 experts per layer. The pruned model reduces total parameters from ~744B to ~380B while keeping active parameters largely unchanged at ~33B. In FP8, the pruned checkpoint is 358GB on disk.

The planned compression pipeline illustrates where this is heading:

REAP prune at 50% (744B to 380B). This step is done.
Quantize to 3-bit (GPTQ/AutoRound), projected at ~110-120GB.
Serve on 8x RTX 3090 (192GB total VRAM) via vLLM with tensor parallelism.

If that pipeline completes successfully, it would mean a 744 billion parameter model, one that was designed for data center hardware, running on consumer GPUs you can buy on eBay. The active parameter count per token would remain similar, meaning inference quality should stay close to the full model for tasks where the pruned experts aren't critical.

This checkpoint is still labeled experimental and its creator notes it may be broken in some configurations. But the progression speaks for itself.

What People Are Doing With It

The ecosystem around REAP has expanded quickly. Cerebras maintains an official HuggingFace collection with pruned checkpoints across model families:

DeepSeek-V3.2-REAP-345B-A37B, 50% pruned
GLM-4.7-REAP-268B-A32B, 25% pruned
GLM-4.7-REAP-218B-A32B, 40% pruned
GLM-4.7-Flash-REAP-23B-A3B, 25% pruned
Qwen3-Coder-REAP-25B-A3B, 20% pruned
Kimi-Linear-REAP-35B-A3B, 30% pruned
MiniMax-M2-REAP-162B-A10B and MiniMax-M2.1-REAP-172B-A10B

Independent contributors have expanded the roster further. OpenMOSE published Qwen3.5-REAP-262B-A17B, pruning ~35% of experts from the 512 in Qwen3.5. Others have produced various GLM-5 and GLM-4.7 variants at different compression levels.

Beyond static pruning, a project called "Reap It Yourself" (RIY) takes the concept a step further by integrating live expert profiling into vLLM. Instead of using generic calibration data, RIY monitors which experts your specific workload activates, then lets you mask inactive experts via HTTP API without modifying the checkpoint. The premise is that a medical research lab and a law firm will route through different experts in the same model, so the pruning should be workload-specific.

REAP-pruned models are compatible with vLLM out of the box, requiring no source modifications. This matters for adoption: you can download a pruned checkpoint and serve it the same way you'd serve any other model.

The Limitations

REAP is not without constraints, and the paper is relatively transparent about them.

First, the method applies uniform sparsity, removing the same percentage of experts from every layer. A follow-up paper from March 2026, EvoESAP, demonstrated that non-uniform layer-wise allocation can improve results significantly, finding up to +19.6% improvement on math benchmarks at 50% sparsity by allowing some layers to keep more experts while pruning others more aggressively. The REAP team has acknowledged this limitation; EvoESAP's framework even uses REAP as one of its pluggable scoring criteria.

Second, the paper's benchmarks lean heavily on code generation and multiple-choice tasks. Creative writing and open-ended reasoning are tested but receive less coverage. The creative writing collapse seen with merging methods (0.008 for HC-SMoE at 50%) makes REAP look excellent by comparison, but REAP's own score of 0.718 on that task still represents a meaningful decline from the unpruned baseline. For applications where nuance and stylistic range matter, even REAP's quality loss at high compression ratios may be unacceptable.

Third, REAP is a one-shot method with no recovery mechanism. Once experts are removed, there's no fine-tuning step to let the remaining experts compensate. This makes it fast and practical but potentially leaves performance on the table compared to methods that include a recovery phase. The tradeoff is intentional. Cerebras positions REAP's simplicity as a feature, not a limitation, but it's worth noting.

Finally, these results are on MoE models specifically. Dense models like Llama, which don't have a router or expert structure, can't benefit from this approach at all. REAP is powerful precisely because MoE architectures already have a natural axis for compression (the expert dimension), but that also means its applicability is bounded by how many future models adopt MoE designs.

What the Future Looks Like

The broader trend REAP represents, making trillion-parameter models accessible on smaller hardware, is arguably the most consequential direction in AI infrastructure right now. Training keeps getting more expensive; inference needs to get cheaper.

MoE architectures are becoming the default for frontier models. GLM-5 uses it. DeepSeek-V3 uses it. Qwen3 uses it. Kimi-K2 uses it. If the pattern holds, expert pruning techniques like REAP become a permanent part of the deployment toolkit. Not a research curiosity, but a standard step between "model released" and "model running in production."

The combination of REAP with quantization is particularly interesting. A 50% expert prune followed by 3-4 bit quantization can potentially compress a 744B model to fit on hardware costing under $10,000. That's a fundamentally different accessibility story than "requires a cluster of H100s."

Whether the quality holds up under that compound compression remains an open question. Each compression step introduces its own error, and the interaction between pruning and quantization isn't fully characterized yet. The community checkpoints pairing REAP with FP8 and lower quantization are early experiments, not production-validated deployments.

But the gap between "requires a data center" and "runs on a workstation" is closing faster than most people expected, and expert pruning is one of the main reasons why.

The REAP paper (arXiv: 2510.13999) is open access under CC BY 4.0. The codebase is available at github.com/CerebrasResearch/reap, and pruned model checkpoints are published in the Cerebras REAP collection on HuggingFace.