TrueFoundry is recognized in the 2025 Gartner® Market Guide for AI Gateways! Read the full report

No items found.

Benchmarking LLM Guardrail Providers: A Data-Driven Comparison

February 20, 2026
|
9:30
min read
SHARE

Why LLM Applications Need Guardrails

Production LLM applications face a growing surface area of risk. Users can inadvertently leak personally identifiable information (PII) through conversational inputs. Models can generate toxic, violent, or sexually explicit content that violates platform policies. Adversarial users craft prompt injection attacks designed to override system instructions, extract confidential prompts, or bypass safety filters entirely.

The consequences are not hypothetical. A PII leak can trigger regulatory action under GDPR, CCPA, or HIPAA. Toxic outputs erode user trust and create brand liability. A successful prompt injection can expose proprietary system prompts or cause the model to execute unintended actions.

Prompt engineering and system instructions provide a first layer of defense, but they are insufficient on their own. Models can be coerced past instruction-level guardrails through encoding attacks, roleplay scenarios, or context manipulation. Automated guardrail systems — purpose-built classifiers that inspect inputs and outputs in real time — provide the defense-in-depth that production deployments require.

The challenge: the market now includes over a dozen guardrail providers, each with different strengths, latency profiles, and coverage gaps. How do you choose the right one for your use case?

TrueFoundry Guardrails: A Unified Gateway

TrueFoundry’s AI Gateway abstracts multiple guardrail providers behind a single OpenAI-compatible API (docs). Teams integrate once with the /v1/chat/completions endpoint and can swap providers through configuration - no code changes required.

The gateway supports two evaluation stages. Input-stage guardrails inspect user messages before they reach the LLM, blocking prompt injections, PII, or harmful content. Output-stage guardrails inspect model responses before they reach the user, catching hallucinations, toxic outputs, or leaked sensitive data.

TrueFoundry organizes guardrails into five task types:

Task Mode Stage Docs
PII Detection Mutate (redact) Input + Output Azure PII
Content Moderation Validate (block) Input + Output Azure Content Safety
Prompt Injection Validate (block) Input + Output Palo Alto Prisma
Hallucination Detection Validate (block) Output only Hallucination Detection
Topic Detection Validate (block) Output only Configure Guardrails

This benchmarking study focuses on the first three tasks — PII Detection, Content Moderation, and PromptInjection which have the broadest provider coverage and the most mature evaluation datasets.Evaluation Dataset DesignWe constructed category-balanced evaluation datasets of 400 samples per task, designed for statisticallymeaningful comparison with tight confidence intervals. Each dataset maintains a roughly 50/50 split betweenpositive (harmful/PII-containing) and negative (safe/clean) samples to ensure balanced evaluation of bothdetection and false positive rates.

PII Detection

Category Count Description
Email40Email addresses in various formats
PhoneNumber25US/international phone formats
SSN25Social Security Numbers
Person25Personal names with context
Address25Physical mailing addresses
CreditCard25Credit/debit card numbers
IPAddress25IPv4 and IPv6 addresses
Mixed25Multiple PII types per sample
Clean185No PII present

Content Moderation

Category Count Description
Hate39Hate speech and discrimination
SelfHarm33Self-harm and suicide content
Illegal33Illegal activity instructions
Harassment31Targeted harassment and bullying
Violence25Threats and violent content
Other1Categories with <5 samples, merged for statistical reliability
Safe238Benign content

Prompt Injection

Category Count Description
DirectInjection43Explicit instruction override attempts
Jailbreak40Persona/mode-switching attacks (DAN, etc.)
IndirectInjection32Hidden instructions in structured data
EncodingAttack22Base64, hex, ROT13 encoded payloads
Roleplay21Creative fiction framing to bypass filters
ContextManipulation21Conversation history exploitation
SystemPromptExtraction21Attempts to extract system prompts
Benign200Legitimate technical questions

Design decisions. Each dataset maintains approximately 50% safe/clean samples to measure false positiverates — a guardrail that flags everything is useless. Categories with fewer than 5 samples were mergedinto an “Other” category to ensure statistical reliability. Each sample carries per-provider ground truthlabels (expected_triggers) because providers may legitimately disagree on edge cases. For example, asample discussing “how AI safety guardrails work” is safe but touches security-adjacent language, and notall providers handle this distinction identically.All samples were hand-curated locally rather than drawn from external benchmarks. This ensures precisecontrol over category balance, difficulty distribution, and ground truth accuracy.

Evaluation Methodology

Every provider was evaluated against identical datasets through the TrueFoundry AI Gateway, ensuring a fair comparison with no per-provider data leakage.

Evaluation Pipeline

Dataset loading — JSONL datasets are loaded with automatic format detection (unified vs. legacyschema)2. Async evaluation — Samples are dispatched concurrently using semaphore-based throttling (50parallel requests) via the OpenAI-compatible /v1/chat/completions endpoint3. Binary classification — Each sample produces a binary outcome: guardrail triggered (true) or not(false), compared against per-provider ground truth4. Metrics aggregation — Standard classification metrics are computed across all samples

Metrics

Metric What it measures
Precision Of everything the guardrail flagged, how much was actually harmful
Recall Of all truly harmful content, how much did the guardrail catch
F1 Score Single score balancing precision and recall — the primary comparison metric
Accuracy Overall correctness across both harmful and safe samples
95% Confidence Interval Wilson score interval on accuracy, quantifying measurement uncertainty

F1 Score serves as the primary ranking metric because it balances the trade-off between precision (avoiding false alarms) and recall (catching real threats). A high-precision, low-recall guardrail misses threats. A high-recall, low-precision guardrail blocks legitimate users.

With 400 samples per task, Wilson score confidence intervals give ±0.03–0.05 margin at 95% confidence, tight enough to distinguish meaningful performance differences between providers.

Latency Tracking

We track latency at two levels:

• Client-side latency — End-to-end time measured in the evaluation harness, including network round- trip

• Server-side latency — Guardrail processing time only, extracted from TrueFoundry traces via the Spans API (tfy.guardrail.metric.latency_in_ms)

Server-side latency isolates the guardrail’s own processing time from network overhead, providing a more accurate comparison across providers.

Provider Comparison Results

PII Detection

Provider Precision Recall F1 Score Accuracy 95% CI Latency
Azure PII 1.000 0.865 0.928 0.928 [0.898, 0.949] 52.3ms

Azure PII provides fine-grained entity-level detection with configurable PII categories (Email, PhoneNumber, SSN, Address, CreditCardNumber, IPAddress, Person) and language-aware processing. It achieves perfect precision every flagged entity is genuine PII with strong recall at 0.865, evaluated in Mutate mode where detected PII is redacted rather than blocked outright. The missed detections (0.135 recall gap) tend to concentrate in ambiguous contexts where PII entities appear in non-standard formats.

Content Moderation

Provider Precision Recall F1 Score Accuracy 95% CI Latency
OpenAI Moderation 0.922 0.877 0.899 0.920 [0.889, 0.943] 191.5ms
Azure Content Safety 0.796 0.722 0.757 0.812 [0.771, 0.847] 52.2ms
PromptFoo 0.617 0.568 0.592 0.683 [0.636, 0.727] 1118.2ms

Content Moderation shows the clearest provider differentiation. OpenAI’s omni-moderation-latest modelleads with a 0.899 F1 score, achieving strong balance between precision and recall across Hate, Violence,SelfHarm, and Harassment categories. Azure Content Safety trades lower accuracy for significantly fasterresponse times (52ms vs. 192ms), making it a viable choice for latency-sensitive deployments. PromptFoo lagson both efficacy and latency in this evaluation, with its 1.1-second response times reflecting its LLM-baseddetection approach.

Prompt Injection

Provider Precision Recall F1 Score Accuracy 95% CI Latency
Pangea 0.750 0.990 0.853 0.830 [0.790, 0.864] 358.7ms

Pangea demonstrates a high-recall detection strategy, catching 0.990 of injection attempts at the cost of morefalse positives (0.750 precision). This means it rarely misses an attack but will occasionally flag legitimatesecurity-related questions. The safe samples in this dataset are deliberately security-adjacent (“How do AIsafety guardrails work?”) to stress-test false positive rates, which partially explains the precision gap. Forapplications where missing an injection attack carries higher risk than occasional false alarms, Pangea’srecall-oriented profile is well-suited.

Key Takeaways

No single provider wins across all tasks. The guardrail landscape is specialized: providers optimized for PII detection may underperform on prompt injection, and vice versa. This is expected — each task demands fundamentally different detection strategies.

Precision and recall tell different stories. A provider with high precision but low recall is conservative - it rarely raises false alarms but misses real threats. The inverse catches everything but fatigues users with false positives. The right balance depends on your application’s risk tolerance.

A unified gateway enables informed selection. By evaluating all providers through a single integration point, teams can benchmark providers head-to-head on their own data and select the best provider per task — or combine multiple providers for defense-in-depth. Teams can also build custom guardrails for domain-specific needs.

Task-specific evaluation is non-negotiable. Generic “safety scores” obscure critical differences in provider behavior. Only by evaluating against curated, category-balanced datasets with per-provider ground truth can teams make informed procurement decisions. The benchmarking framework described here — 400 category-balanced samples per task, Wilson score confidence intervals, per-provider labels, dual latency tracking, and standard classification metrics — provides a reproducible methodology for any team evaluating guardrail solutions.

The fastest way to build, govern and scale your AI

Discover More

No items found.
February 20, 2026
|
5 min read

Benchmarking LLM Guardrail Providers: A Data-Driven Comparison

No items found.
 On Premise AI Platform Guide for Enterprise Security
February 20, 2026
|
5 min read

On Premise AI Platform: Benefits, Architecture, and Deployment Guide

Engineering and Product
 LLM Cost Tracking Solution For Enterprise Observability, Governance & Optimization
February 20, 2026
|
5 min read

LLM Cost Tracking Solution For Enterprise Observability, Governance & Optimization

Engineering and Product
What is an LLM Proxy
February 20, 2026
|
5 min read

What Is an LLM Proxy and How It Works?

No items found.
No items found.
Take a quick product tour
Start Product Tour
Product Tour