Blank white background with no objects or features visible.

NEW RESEARCH: 80% of AI costs are invisible at billing. 200+ leaders reveal where the money goes. Read →

Open-Weight Routing at Scale: GLM-5.1 vs Claude Opus 4.7 on TrueFoundry AI Gateway

By Jitender Kumar

Updated: May 18, 2026

We ran 20 fixed prompts through TrueFoundry AI Gateway comparing four strategies: all Claude Opus 4.7, all Z.AI GLM-5.1, a Haiku classifier router (easy → open, hard → frontier), and an 80/20 virtual model. On this mix, classifier routing cut blended cost ~31% versus all- Opus ($15.72 vs $22.72 per 1M tokens) while scoring higher on our Sonnet judge (4.94 vs 4.85). All-open was cheapest ($3.00 / 1M) but slower and slightly lower quality. The takeaway: you do not need a single model string for every request — Gateway routing plus a cheap classifier can preserve frontier quality on hard tasks without paying frontier prices on easy ones.

Why this matters now

The open-weight wave is no longer theoretical. Models like GLM-5.1 ship with agentic coding positioning, 200K-token context, and list prices an order of magnitude below frontier APIs, while, Claude Opus 4.7 remains the reference for hard reasoning.

Platform teams face a familiar tradeoff:

  • Route everything to frontier → predictable quality, painful unit economics at volume.
  • Route everything to open-weight → attractive cost, uneven quality and latency tails on hard prompts.
  • Build custom routers → flexible, but you own classification logic, failover, billing reconciliation, and cache semantics across providers.

TrueFoundry AI Gateway sits in the middle: 1000+ LLMs through a unified OpenAI compatible API, virtual models with weight-based routing, semantic cache headers, and transparent pricing metrics for billing truth. We wanted to measure whether a simple EASY/HARD classifier — one Haiku call per request — could beat both extremes on cost and quality for a realistic 20-prompt workload.

What we compared (technical tour)

Open-weight baseline: GLM-5.1

GLM-5.1 is Z.AI's April 2026 flagship, accessed via TrueFoundry's Gateway, aimed at long-horizon agentic work — planning, tool use, and multi-step coding loops.

Frontier baseline: Claude Opus 4.7

Opus 4.7 is Anthropic's top-tier model for complex reasoning. Note: Opus 4.7uses a new tokenizer that can emit more tokens than older Claude models for the same text — cost comparisons should use measured token counts, not character counts.

App-level classifier router

Our router classifies each prompt as EASY or HARD in a single call (~8 output tokens). EASY → GLM-5.1; HARD → Opus 4.7. Quality scoring uses Claude Sonnet 4.6 as an LLM judge (1–5 against per-prompt rubrics).

Gateway virtual model (80/20)

We also tested a virtual-model in Gateway configured for weight-based routing (80% open / 20% frontier in the UI). This measures provider-side load balancing without app-level classification — a different knob than the Haiku router.

About our benchmark

Prompts: 20 tasks — 10 labeled easy (summarize, format JSON, translate) and 10 hard (distributed systems tradeoffs, SQL injection review, contract ambiguity, K8s OOM debug, etc.).

Metrics per strategy:

Metric How we measured it
Cost Token usage × public list prices; router strategy includes Haiku + Sonnet judge overhead
Latency Wall-clock per request; report p50 / p95
Quality Sonnet judge mean score 1–5 per prompt

What we did not claim: vendor SWE-bench scores, production traffic shapes.

Vendor pricing context (May 2026)

Model Input / 1M Output / 1M Source
Claude Opus 4.7 $5 $25 Anthropic pricing
Z.AI GLM 5.1 $0.98 $3.08 OpenRouter


GLM-5.1 is roughly 5× cheaper on input and ~8× cheaper on output than Opus 4.7 at list price — before routing, caching, or enterprise discounts. The interesting question is how much of that gap you keep after sending hard prompts to frontier.

Our analysis (20-prompt run)

Cost per 1M tokens (this run's token mix)

Strategy $ / 1M tokens Total (20 prompts) Quality (mean) Latency p50
baseline_opus $22.72 $0.28 4.85 9,094 ms
baseline_open $3.00 $0.07 4.75 20,060 ms
router_classifier $15.72 $0.28 4.94 14,944 ms
virtual_weighted $7.19 $0.14 4.50 23,404 ms


Router split (classifier)

The Haiku router sent 10/20 prompts to GLM-5.1 and 10/20 to Opus 4.7 — a 50/50 split on this prompt set (10 easy + 10 hard by design). Token volume followed suit: 7,774 tokens on GLM vs 10,072 on Opus for completion traffic.

Latency tails matter

Open-weight-only had the slowest p50 (20.1s) and an extreme p95 (~115s) — one long GLM completion on a hard prompt dominated the tail. Opus-only was fastest at p50 (9.1s) with a moderate p95 (~21s). The classifier landed in between on p50 (14.9s) with p95 ~26s.

Quality vs cost: the classifier sweet spot

  • Router vs all-Opus: ~31% lower blended $/1M ($15.72 vs $22.72) with higher mean judge score (4.94 vs 4.85). Total dollar cost for 20 prompts was essentially the same (~$0.28) because judge + router overhead offset GLM savings — at higher volume, the per-token gap compounds.
  • Router vs all-open: ~5.2× higher $/1M but +0.19 quality points. Cheapest is not best if hard prompts matter.
  • Virtual 80/20: $7.19 / 1M on a list-price blend estimate, but quality (4.50) trailed both baselines. Weight-based routing without task awareness is not a substitute for classification on this workload — validate the actual backend mix in Gateway Metrics, not just the virtual model id.

Why these results matter

  1. Classification is cheap relative to frontier completions. One Haiku call per request is noise compared to a 1,024-token Opus completion on hard tasks. The router's economics work when easy traffic is a large share of volume — and when misroutes are rare.
  2. List price ≠ your bill. Gateway may route through different providers, apply caching, or negotiate rates. We applied public list prices to measured tokens from our run; you should reconcile with Gateway Metrics → Download Raw Data before setting FinOps guardrails.
  3. Latency and quality are coupled. Saving 31% on tokens does not help if p95 latency breaches SLOs. Our open-weight baseline showed that a single bad routing decision (sending a hard prompt only to GLM) can explode tail latency.
  4. Two routing patterns, two stories. App-level EASY/HARD routing optimized quality-cost on this set. UI-level 80/20 virtual models optimized for operational simplicity but underperformed on quality here — useful for gradual rollouts, not a full replacement for task-aware routing.

Practical takeaways for platform teams

  1. Start with a frontier + open-weight pair wired through one Gateway base URL. Swap models by changing the model string — no SDK fork per provider.
  2. Add a cheap classifier (Haiku or similar) before you add complexity to virtual-model weights. Measure misroute rate on a gold subset of prompts.
  3. Publish a prompt tier list (easy / hard) aligned with your rubrics — our 20-prompt set is a template, not your production distribution.
  4. Reconcile cost in Gateway Metrics, not in notebook estimates. Export raw billing CSV and join on trace metadata
  5. Layer semantic cache after routing stabilizes — semantic caching on easy, paraphrased prompts is where cache ROI usually appears (not measured in this baseline run).

How TrueFoundry AI Gateway made this possible

  • Unified OpenAI-compatible API — one client, base_url pointed at Gateway; same codepath for GLM, Opus, Haiku, and Sonnet.
  • Virtual models — weight-based 80/20 routing without application changes (docs).
  • Semantic cache — similarity-based response reuse (docs).
  • Observability — token usage, latency, and cost headers for reconciliation; ~3–4 ms latency and 350+ RPS on 1 vCPU at the gateway layer for high-throughput proxy scenarios.

Conclusion

Open-weight models like GLM-5.1 are priced to win easy traffic. Claude Opus 4.7 still earns its keep on hard prompts. The gap between them is large enough that routing matters more than model marketing.

On our 20-prompt harness through TrueFoundry AI Gateway, a Haiku classifier router delivered the best combined story: ~31% lower blended cost per million tokens than all- Opus, with a higher mean judge score (4.94 vs 4.85). All-open remained the cost floor; all- Opus the quality-and-speed ceiling for p50 latency.

The fastest way to build, govern and scale your AI

Sign Up
Table of Contents

Govern, Deploy and Trace AI in Your Own Infrastructure

Book a 30-min with our AI expert

Book a Demo

The fastest way to build, govern and scale your AI

Book Demo

Discover More

No items found.
May 21, 2026
|
5 min read

Gemini 3.5 Flash: When the Fast Model Becomes the Frontier Model

LLMs & GenAI
Types of AI agents governed by TrueFoundry enterprise control plane
May 20, 2026
|
5 min read

Types of AI Agents: Definitions, Roles, and What They Mean for Enterprise Deployment

No items found.
Comparing AI agents and agentic AI workloads in enterprise production
May 20, 2026
|
5 min read

AI Agents vs Agentic AI: What the Difference Actually Means in Production

No items found.
May 20, 2026
|
5 min read

Agent Gateway Series (Part 4 of 7) | FinOps for Autonomous Systems

No items found.
No items found.

Recent Blogs

Black left pointing arrow symbol on white background, directional indicator.
Black left pointing arrow symbol on white background, directional indicator.
Take a quick product tour
Start Product Tour
Product Tour