Open-Weight Routing at Scale: GLM-5.1 vs Claude Opus 4.7 on TrueFoundry AI Gateway

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

We ran 20 fixed prompts through TrueFoundry AI Gateway comparing four strategies: all Claude Opus 4.7, all Z.AI GLM-5.1, a Haiku classifier router (easy → open, hard → frontier), and an 80/20 virtual model. On this mix, classifier routing cut blended cost ~31% versus all- Opus ($15.72 vs $22.72 per 1M tokens) while scoring higher on our Sonnet judge (4.94 vs 4.85). All-open was cheapest ($3.00 / 1M) but slower and slightly lower quality. The takeaway: you do not need a single model string for every request — Gateway routing plus a cheap classifier can preserve frontier quality on hard tasks without paying frontier prices on easy ones.

Why this matters now

The open-weight wave is no longer theoretical. Models like GLM-5.1 ship with agentic coding positioning, 200K-token context, and list prices an order of magnitude below frontier APIs, while, Claude Opus 4.7 remains the reference for hard reasoning.

Platform teams face a familiar tradeoff:

Route everything to frontier → predictable quality, painful unit economics at volume.
Route everything to open-weight → attractive cost, uneven quality and latency tails on hard prompts.
Build custom routers → flexible, but you own classification logic, failover, billing reconciliation, and cache semantics across providers.

TrueFoundry AI Gateway sits in the middle: 1000+ LLMs through a unified OpenAI compatible API, virtual models with weight-based routing, semantic cache headers, and transparent pricing metrics for billing truth. We wanted to measure whether a simple EASY/HARD classifier — one Haiku call per request — could beat both extremes on cost and quality for a realistic 20-prompt workload.

What we compared (technical tour)

Open-weight baseline: GLM-5.1

GLM-5.1 is Z.AI's April 2026 flagship, accessed via TrueFoundry's Gateway, aimed at long-horizon agentic work — planning, tool use, and multi-step coding loops.

Frontier baseline: Claude Opus 4.7

Opus 4.7 is Anthropic's top-tier model for complex reasoning. Note: Opus 4.7uses a new tokenizer that can emit more tokens than older Claude models for the same text — cost comparisons should use measured token counts, not character counts.

App-level classifier router

Our router classifies each prompt as EASY or HARD in a single call (~8 output tokens). EASY → GLM-5.1; HARD → Opus 4.7. Quality scoring uses Claude Sonnet 4.6 as an LLM judge (1–5 against per-prompt rubrics).

Gateway virtual model (80/20)

We also tested a virtual-model in Gateway configured for weight-based routing (80% open / 20% frontier in the UI). This measures provider-side load balancing without app-level classification — a different knob than the Haiku router.

About our benchmark

Prompts: 20 tasks — 10 labeled easy (summarize, format JSON, translate) and 10 hard (distributed systems tradeoffs, SQL injection review, contract ambiguity, K8s OOM debug, etc.).

Metrics per strategy:

Metric	How we measured it
Cost	Token usage × public list prices; router strategy includes Haiku + Sonnet judge overhead
Latency	Wall-clock per request; report p50 / p95
Quality	Sonnet judge mean score 1–5 per prompt

What we did not claim: vendor SWE-bench scores, production traffic shapes.
‍

Vendor pricing context (May 2026)

Model	Input / 1M	Output / 1M	Source
Claude Opus 4.7	$5	$25	Anthropic pricing
Z.AI GLM 5.1	$0.98	$3.08	OpenRouter

GLM-5.1 is roughly 5× cheaper on input and ~8× cheaper on output than Opus 4.7 at list price — before routing, caching, or enterprise discounts. The interesting question is how much of that gap you keep after sending hard prompts to frontier.
‍

Our analysis (20-prompt run)

Cost per 1M tokens (this run's token mix)

Strategy	$ / 1M tokens	Total (20 prompts)	Quality (mean)	Latency p50
baseline_opus	$22.72	$0.28	4.85	9,094 ms
baseline_open	$3.00	$0.07	4.75	20,060 ms
router_classifier	$15.72	$0.28	4.94	14,944 ms
virtual_weighted	$7.19	$0.14	4.50	23,404 ms

‍

Router split (classifier)

The Haiku router sent 10/20 prompts to GLM-5.1 and 10/20 to Opus 4.7 — a 50/50 split on this prompt set (10 easy + 10 hard by design). Token volume followed suit: 7,774 tokens on GLM vs 10,072 on Opus for completion traffic.

Latency tails matter

Open-weight-only had the slowest p50 (20.1s) and an extreme p95 (~115s) — one long GLM completion on a hard prompt dominated the tail. Opus-only was fastest at p50 (9.1s) with a moderate p95 (~21s). The classifier landed in between on p50 (14.9s) with p95 ~26s.

Quality vs cost: the classifier sweet spot

Router vs all-Opus: ~31% lower blended $/1M ($15.72 vs $22.72) with higher mean judge score (4.94 vs 4.85). Total dollar cost for 20 prompts was essentially the same (~$0.28) because judge + router overhead offset GLM savings — at higher volume, the per-token gap compounds.
Router vs all-open: ~5.2× higher $/1M but +0.19 quality points. Cheapest is not best if hard prompts matter.
Virtual 80/20: $7.19 / 1M on a list-price blend estimate, but quality (4.50) trailed both baselines. Weight-based routing without task awareness is not a substitute for classification on this workload — validate the actual backend mix in Gateway Metrics, not just the virtual model id.

Why these results matter

Classification is cheap relative to frontier completions. One Haiku call per request is noise compared to a 1,024-token Opus completion on hard tasks. The router's economics work when easy traffic is a large share of volume — and when misroutes are rare.
List price ≠ your bill. Gateway may route through different providers, apply caching, or negotiate rates. We applied public list prices to measured tokens from our run; you should reconcile with Gateway Metrics → Download Raw Data before setting FinOps guardrails.
Latency and quality are coupled. Saving 31% on tokens does not help if p95 latency breaches SLOs. Our open-weight baseline showed that a single bad routing decision (sending a hard prompt only to GLM) can explode tail latency.
Two routing patterns, two stories. App-level EASY/HARD routing optimized quality-cost on this set. UI-level 80/20 virtual models optimized for operational simplicity but underperformed on quality here — useful for gradual rollouts, not a full replacement for task-aware routing.

Practical takeaways for platform teams

Start with a frontier + open-weight pair wired through one Gateway base URL. Swap models by changing the model string — no SDK fork per provider.
Add a cheap classifier (Haiku or similar) before you add complexity to virtual-model weights. Measure misroute rate on a gold subset of prompts.
Publish a prompt tier list (easy / hard) aligned with your rubrics — our 20-prompt set is a template, not your production distribution.
Reconcile cost in Gateway Metrics, not in notebook estimates. Export raw billing CSV and join on trace metadata
Layer semantic cache after routing stabilizes — semantic caching on easy, paraphrased prompts is where cache ROI usually appears (not measured in this baseline run).

How TrueFoundry AI Gateway made this possible

Unified OpenAI-compatible API — one client, base_url pointed at Gateway; same codepath for GLM, Opus, Haiku, and Sonnet.
Virtual models — weight-based 80/20 routing without application changes (docs).
Semantic cache — similarity-based response reuse (docs).
Observability — token usage, latency, and cost headers for reconciliation; ~3–4 ms latency and 350+ RPS on 1 vCPU at the gateway layer for high-throughput proxy scenarios.

Conclusion

Open-weight models like GLM-5.1 are priced to win easy traffic. Claude Opus 4.7 still earns its keep on hard prompts. The gap between them is large enough that routing matters more than model marketing.

On our 20-prompt harness through TrueFoundry AI Gateway, a Haiku classifier router delivered the best combined story: ~31% lower blended cost per million tokens than all- Opus, with a higher mean judge score (4.94 vs 4.85). All-open remained the cost floor; all- Opus the quality-and-speed ceiling for p50 latency.

‍

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now