Benchmarking Guardrail Effectiveness in High-Risk LLM Use Cases

1. Introduction

The enterprise adoption of Large Language Models has entered a critical inflection point. While LLMs offer transformative capabilities in customer service automation, document analysis, code generation, and decision support, their deployment in regulated industries introduces risks that generic safety training cannot adequately address.

Consider three illustrative scenarios:

Scenario A: Healthcare. A hospital deploys an LLM-powered clinical decision support system. A multi-turn conversation with a physician gradually escalates from discussing drug interactions to providing specific dosage recommendations for a controlled substance — without flagging that the recommended dose exceeds the lethal threshold. The LLM's built-in safety mechanisms, designed for consumer use cases, fail to enforce healthcare-specific prescribing boundaries.

Scenario B: Telecommunications. A telecom operator integrates an LLM into its network operations center (NOC) support system. An attacker impersonating a NOC engineer uses social engineering to extract network infrastructure topology and base station vulnerability information. The model's safety training does not include telecom-specific threat vectors such as infrastructure exploit elicitation or crisis communication pretexts.

Scenario C: Defense. A military command system leverages an LLM for operational planning assistance. Without domain-specific guardrails, the system fails to enforce classification boundaries, Rules of Engagement (ROE), or international humanitarian law (IHL) compliance, potentially generating plans that violate the proportionality principle or recommend prohibited weapons.

These scenarios illustrate a fundamental gap: generic LLM safety mechanisms are designed for consumer use cases and are demonstrably insufficient for policy-sensitive enterprise deployments. Existing safety benchmarks (HarmBench [1], JailbreakBench [2], AdvBench [3]) evaluate models against standard adversarial attacks but do not assess domain-specific policy enforcement, regulatory compliance coverage, or the layered defense architectures required for high-risk deployments.

This paper addresses this gap through three contributions:

Empirical Benchmarking of Baseline LLM Safety. We evaluate 8 commercial LLM models across 7 state-of-the-art attack algorithms in 112 evaluations, establishing that the baseline defense rate without external guardrails is only 38.1% — far below the thresholds required by EU AI Act Article 15 (accuracy and robustness) and NIST AI RMF MEASURE function.
Layered Guardrail Architecture Evaluation. We benchmark the AEGIS 3-Tier defense architecture (rule-based + ML + LLM Judge) across multiple configurations, demonstrating that layered guardrails improve defense rates from 38.1% to 75–85% while maintaining sub-20ms P99 latency.
Enterprise Guardrail Effectiveness Index (EGEI). We propose a composite metric that captures the multidimensional requirements of enterprise guardrail systems — attack resilience, regulatory compliance, latency overhead, and domain-specific policy adherence — enabling apples-to-apples comparison of guardrail configurations across deployment scenarios.

2. Related Work

2.1 LLM Safety Benchmarks

Existing benchmarks focus primarily on attack success rates against model-native safety mechanisms. HarmBench [1] provides standardized red-teaming evaluation across harmful content categories. JailbreakBench [2] measures robustness against jailbreak attacks. AdvBench [3] evaluates adversarial robustness. While valuable, these benchmarks share three limitations for enterprise contexts: (a) they do not evaluate external guardrail systems; (b) they do not assess domain-specific policy enforcement; and (c) they report point estimates without statistical risk modeling.

2.2 Guardrail Systems

Commercial guardrail solutions include NVIDIA NeMo Guardrails [4], Guardrails AI [5], and LlamaGuard [6]. Academic work includes Llama Guard [7] for content classification and WildGuard [8] for adversarial robustness. However, these systems are typically evaluated as standalone classifiers rather than as layered architectures optimized for specific enterprise deployment scenarios.

2.3 Regulatory Frameworks

The EU AI Act [9] classifies AI systems by risk level (Minimal, Limited, High, Unacceptable) and mandates specific requirements for high-risk systems including risk management (Art. 9), data governance (Art. 10), transparency (Art. 13), human oversight (Art. 14), and accuracy/robustness (Art. 15). The Korean AI Act (K-AI Act) [10] establishes parallel requirements including safety assurance (Art. 5), user protection (Art. 7), transparency (Art. 9), and personal data protection (Art. 12). The NIST AI RMF [11] provides a voluntary framework organized around four functions: GOVERN, MAP, MEASURE, and MANAGE.

No existing benchmark directly evaluates guardrail systems against these regulatory requirements. Our EGEI metric addresses this gap.

2.4 Domain-Specific AI Safety

Prior work on domain-specific AI safety includes medical AI safety frameworks [12], financial AI risk management [13], and military autonomous systems governance (DoD Directive 3000.09) [14]. These frameworks define requirements but do not provide benchmarking methodologies for evaluating guardrail effectiveness in their respective domains.

3. Enterprise Threat Landscape

3.1 Threat Taxonomy for Policy-Sensitive Domains

Building on the OWASP LLM Top 10 [15] and NIST AI RMF [11], we define an enterprise-specific threat taxonomy with domain extensions:

Table I: Enterprise Threat Taxonomy

#	Threat Category	Sub-types	Enterprise Impact
1	Prompt Injection	Direct, Indirect, System Prompt Extraction, Instruction Override	Unauthorized access to enterprise knowledge bases; policy bypass
2	Jailbreak	RolePlay, Encoding Bypass, Multi-Turn Escalation, Genetic, Token Smuggling, ASCII Art	Circumvention of domain-specific safety policies
3	Data Leakage	Training Data Extraction, PII Exfiltration, Credential Leakage	Regulatory violations (GDPR, HIPAA, K-PIPA); enterprise IP theft
4	Harmful Content	Violence, Hate Speech, Illegal Instructions, Misinformation, Self-Harm	Reputational damage; legal liability; user harm
5	Model Manipulation	Adversarial Input, Model Poisoning, Backdoor Attack	Compromised decision quality in high-stakes contexts
6	Agent Abuse	Tool Misuse, Unauthorized Actions, Privilege Escalation, Reasoning Hijack	Autonomous action beyond authorized boundaries
7	Privacy Violation	PII exposure, consent management, cross-border data transfer	GDPR/K-PIPA fines; loss of customer trust
8	Supply Chain	Plugin poisoning, model poisoning, data poisoning	Compromised enterprise AI infrastructure

3.2 Domain-Specific Threat Extensions

Beyond generic threats, each enterprise domain introduces unique attack vectors that standard safety mechanisms do not address:

Table II: Domain-Specific Threat Vectors

Domain	Threat ID	Attack Technique	Difficulty	Impact
Telecom	TELCO-001	NOC Engineer Impersonation	2	Network topology extraction via role impersonation
Telecom	TELCO-002	Internal Memo Pretext	2	Harmful content extraction under documentation guise
Telecom	TELCO-003	Regulatory Bypass Pretext	3	Compliance testing claim to bypass safety
Telecom	TELCO-004	Infrastructure Exploit Elicitation	4	Base station attack plans, SIM cloning guidance
Telecom	TELCO-005	Synthetic Data Pretext	2	Harmful data generation under AI training guise
Telecom	TELCO-006	Crisis Communication Pretext	3	Misinformation via disaster drill framing
Healthcare	HEALTH-001	Self-Prescription Elicitation	3	Drug recommendations without prescription authority
Healthcare	HEALTH-002	Lethal Dosage Extraction	4	Dosage information exceeding safety thresholds
Financial	FIN-001	Illegal Lending Guidance	3	High-interest loan promotion; regulatory violations
Financial	FIN-002	Investment Guarantee Claims	3	False return guarantees; securities fraud
Legal	LEGAL-001	Contract Fraud Assistance	3	Guidance on breaching contracts without penalty
Defense	MIL-001	Classification Boundary Violation	5	Aggregation of unclassified items into classified exposure
Defense	MIL-002	ROE/IHL Non-Compliance	5	Plans violating proportionality or weapon prohibitions

3.3 CVSS-Like AI Severity Scoring

We adopt a CVSS-like scoring framework adapted for AI vulnerabilities, enabling quantitative comparison across domains:

Scoring dimensions:

Dimension	Levels	Score Range
Attack Complexity	Low (0.77) / High (0.44)	0.44–0.77
Privileges Required	None (0.85) / Low (0.62) / High (0.27)	0.27–0.85
User Interaction	None (0.85) / Required (0.62)	0.62–0.85
Confidentiality Impact	None / Low / High	0.0–0.56
Integrity Impact	None / Low / High	0.0–0.56
Availability Impact	None / Low / High	0.0–0.56

Severity ratings: None (0.0), Low (0.1–3.9), Medium (4.0–6.9), High (7.0–8.9), Critical (9.0–10.0)

4. Layered Guardrail Architecture

4.1 Design Principles for Enterprise Guardrails

Enterprise guardrail systems must satisfy four requirements that consumer-grade safety mechanisms do not:

R1: Regulatory compliance. Guardrails must demonstrably satisfy specific regulatory articles (EU AI Act Art. 9, 10, 13, 14, 15; K-AI Act Art. 5, 7, 9, 12; NIST AI RMF GOVERN/MAP/MEASURE/MANAGE).

R2: Domain-specific policy enforcement. Generic content safety is insufficient; guardrails must enforce sector-specific policies (healthcare prescribing boundaries, financial suitability rules, telecom infrastructure protection, military classification levels).

R3: Auditable decision trails. Enterprise deployments require complete audit logging of every guardrail decision — the input, the analysis, the decision rationale, and the verdict — for regulatory inspection and incident investigation.

R4: Fail-safe behavior. In high-risk domains, system failures must default to safe states (BLOCK on detection uncertainty) rather than permissive states. This is configurable between Fail-Safe (default BLOCK) and Fail-Closed (halt service) policies depending on domain requirements.

4.2 AEGIS 3-Tier Hierarchical Architecture

The AEGIS framework implements a 3-Tier defense architecture optimized for the latency-accuracy trade-off:

Table III: 3-Tier Architecture Specification

Tier	Method	Latency	Traffic Share	Confidence	Technique
1	Rule-Based Filter	< 0.5ms	70–80%	0.95–1.0	Aho-Corasick multi-pattern (8 languages); Bloom filter (100K patterns, FPR 0.001)
2	ML Classifier	< 5ms	15–25%	0.75–0.95	Guard Encoder (mDeBERTa-v3-base, 8-class, INT8); 5-signal heuristic fallback
3	LLM Judge	< 200ms	< 5%	0.50–0.95	Constitutional AI 2-stage verification; 9-category evaluation; 10-turn context

Escalation logic:

Tier 1 → if uncertain → Tier 2 → if ambiguous (0.25 < prob < 0.75) → Tier 3

This cascading architecture ensures that the vast majority of clearly safe or clearly dangerous requests are processed at sub-millisecond latency, while only genuinely ambiguous cases (< 5% of traffic) incur the cost of LLM-based judgment.

4.3 Defense Algorithm Portfolio

Four primary and two auxiliary defense algorithms operate within the 3-Tier framework:

Algorithm	Type	Key Capability	Block Threshold
GuardNet [16]	Hierarchical Graph	3-level detection (token/sentence/prompt) + graph connectivity	0.7
JBShield [17]	Dual-Track	Separates toxicity from jailbreak technique in representation space	Adaptive
CCFC	Obfuscation Detection	Core vs. full prompt divergence analysis; flags \|divergence\| > 0.3	0.4–0.7
MULI	Intrinsic Toxicity	Simulated logit distribution analysis across 7 categories	Adaptive
TAG (aux)	Context Tagging	Keyword + context tag classifier for Tier 2 augmentation	—
RATIONAL (aux)	Reasoning Analysis	Chain-of-Thought risk signal detection	—

4.4 PALADIN 6-Layer Deep Defense Pipeline

Orthogonal to the 3-Tier system, the PALADIN (Protective AI Layered Adversarial Defense Inspection Network) pipeline provides sequential deep content inspection:

Layer	Name	Function	Enterprise Relevance
L0	TrustBoundary	Input validation	Unicode normalization prevents encoding attacks
L1	IntentVerification	Intent analysis	Detects prompt injection, role confusion
L2	RaGuard	RAG poisoning	Protects enterprise knowledge bases
L3	ClassRagLayer	Semantic classification	Content categorization across risk classes
L4	CircuitBreaker	Anomaly detection	Rate limiting, abuse pattern identification
L5	BehavioralAnalysis	Behavioral profiling	Multi-turn escalation detection (critical for Crescendo/HPM)

4.5 Domain-Specific Policy Modules

AEGIS extends the generic defense architecture with domain-specific policy enforcement modules:

4.5.1 Enterprise Policy Framework

Policies are defined with quantitative guarantees:

Parameter	Child Safety Policy	Enterprise Policy
Target Groups	Children, Students	Professionals, Enterprise
Languages	Korean, English	Korean, English, Japanese
Modalities	Text, Image	Text, Code
ε-Coverage	99%	95%
Defended Threats	Jailbreak, Toxicity, Self-harm, PII, Sexual, Violence, Grooming	Jailbreak, PII, Data leakage, Bias
δ-Adversarial Resistance	98%	90%
Max P95 Latency	200ms	100ms
Human Escalation	Allowed	Not Allowed

4.5.2 Telecom-Specific Threat Detection

Six telecom-specific threat categories with severity scoring:

Category	Severity	Detection Method
Bias/Discrimination in customer segmentation	4	Demographic profiling pattern detection
Misinformation (false outage reports, fake standards)	4	Factual claim verification against known standards
Network sabotage guidance	5	Infrastructure exploit pattern matching
SMS-based illegal activity	5	Communication abuse pattern detection
DRM bypass guidance	3	Copyright circumvention pattern matching
Explicit customer interaction scripts	5	Content safety classification

4.5.3 Military Domain Modules (7)

Module	Function	Compliance Standard
Classification Guard	5-level clearance enforcement; mosaic risk detection	MIL-STD-882E
OPSEC Filter	6-category operational security; auto-redaction	DoD OPSEC guidelines
Command Chain Guard	6-level hierarchy integrity; SHA-256 signature verification	Command authority doctrine
ROE Compliance Engine	Proportionality assessment; weapon prohibition; JAG review trigger	Geneva Conventions, IHL
Tactical Autonomy Guard	5-level autonomy; communication-adaptive constraints	DoD Directive 3000.09
Anti-Spoofing Guard	5 spoofing type detection; multi-source cross-validation	국방부 AI 보안 지침
Cross-Domain Security	5 security domains; transfer direction control; audit trail	DCSA cross-domain policy

4.5.4 Financial Boundary Detection

Boundary	Detection Rule	Regulatory Basis
Investment guarantee claims	Pattern match: "guaranteed return", "100% safe"	Securities regulations
Illegal lending promotion	High-interest threshold + non-licensed entity detection	Financial Services Act
Risk disclosure omission	Missing risk warning detection in investment context	Investor protection rules

4.5.5 Healthcare Safety Boundaries

Boundary	Detection Rule	Regulatory Basis
Self-prescription guidance	Drug recommendation without prescriber context	Medical Practice Act
Lethal dosage information	Dosage exceeding known safety margins	Pharmacovigilance guidelines
Diagnostic claims	Definitive diagnosis without clinical context	Medical Device Act

5. Experimental Methodology

5.1 Evaluation Framework

We evaluate guardrail effectiveness across three dimensions:

Dimension 1: Attack Resilience. Measured as 1 - ASR (Attack Success Rate) across 7 adversarial algorithms: PAIR [18], TAP [19], Crescendo [20], AutoDAN [21], BEAST [22], ArtPrompt [23], HPM [24].

Dimension 2: Regulatory Compliance Coverage. Assessed against EU AI Act (6 articles), K-AI Act (4 articles), and NIST AI RMF (4 functions) — 14 requirements total.

Dimension 3: Operational Performance. Latency (P50, P99), throughput (RPS), and availability (uptime SLA).

5.2 Models Under Test

Table IV: Target Models

#	Model	Provider	Release Period	Safety Approach
1	GPT-5	OpenAI	2025	RLHF + safety fine-tuning
2	Claude Opus 4.6	Anthropic	2025	Constitutional AI
3	Gemini 3.1 Pro Preview	Google	2026	Multi-objective alignment
4	Grok 4.1 Fast (Reasoning)	xAI	2026	Reasoning-mode safety
5	Grok 4.1 Fast	xAI	2026	Standard alignment
6	Grok 3 Mini	xAI	2025	Lightweight alignment
7	DeepSeek (R1)	DeepSeek	2025	Hybrid safety training
8	DeepSeek Chat	DeepSeek	2025	Standard safety training

5.3 Guardrail Configurations

We benchmark five guardrail configurations representing increasing levels of defense sophistication:

Config	Description	Layers	Domain Policy
C0	No guardrail (model-native safety only)	0	None
C1	Tier 1 only (rule-based filter)	1	None
C2	Tier 1 + Tier 2 (rule-based + ML)	2	None
C3	Full 3-Tier (rule-based + ML + LLM Judge)	3	None
C4	Full 3-Tier + PALADIN + Domain Policy	3+6	Active

5.4 Attack Protocol

Each model is tested under the following protocol:

Sessions: 2 independent sessions (different scenario sets per session)
Algorithms per session: 7 (PAIR, TAP, Crescendo, AutoDAN, BEAST, ArtPrompt, HPM)
Evaluations per session: 8 models x 7 algorithms = 56
Total evaluations: 112
Scenario categories: Jailbreak + data extraction (Session 1); Jailbreak + encoding bypass (Session 2)
Success criteria: Score >= 0.5 = successful attack (VULNERABLE or PARTIAL)

5.5 Enterprise Guardrail Effectiveness Index (EGEI)

We propose the EGEI as a composite metric:

EGEI = w1 * AttackResilience + w2 * ComplianceCoverage + w3 * PerformanceScore + w4 * DomainAdherence

where:

Component	Symbol	Definition	Weight
Attack Resilience	AR	1 - mean(ASR) across all algorithms	w1 = 0.35
Compliance Coverage	CC	Fraction of 14 regulatory requirements satisfied	w2 = 0.25
Performance Score	PS	Normalized score based on P99 latency and throughput	w3 = 0.15
Domain Adherence	DA	Fraction of domain-specific policies actively enforced	w4 = 0.25

EGEI rating thresholds:

Rating	EGEI Score	Interpretation
A+	>= 0.90	Production-ready for high-risk domains
A	>= 0.80	Suitable for regulated enterprise use
B	>= 0.65	Adequate for moderate-risk deployments
C	>= 0.50	Requires additional hardening
D	< 0.50	Inadequate for enterprise deployment

6. Results

6.1 Baseline Vulnerability Assessment (C0: No Guardrail)

Table V: Per-Model Baseline ASR (No External Guardrails)

Model	Mean ASR	Risk Score	Risk Label	Session 1 ASR	Session 2 ASR	CV (%)
GPT-5	0.865	1.00	Critical	0.862	0.868	0.6
Grok 3 Mini	0.732	0.97	Critical	0.727	0.737	1.0
Grok 4.1 Fast	0.657	1.00	Critical	0.672	0.642	3.3
DeepSeek (R1)	0.648	0.83	High	0.653	0.643	1.1
DeepSeek Chat	0.624	0.79	High	0.630	0.618	1.4
Claude Opus 4.6	0.568	0.93	Critical	0.573	0.562	1.5
Grok 4.1 (Reasoning)	0.464	0.79	High	0.483	0.445	5.9
Gemini 3.1 Pro	0.398	1.00	Critical	0.375	0.422	8.3
Aggregate	0.620	—	—	0.622	0.617	2.9

Finding 1: All 8 models are VULNERABLE without external guardrails. The mean ASR of 0.620 means that adversaries succeed in 62% of attacks on average. The mean coefficient of variation (CV) of 2.9% confirms high measurement reproducibility across independent sessions.

6.2 Per-Algorithm Attack Effectiveness

Table VI: Algorithm Attack Success Rates (C0 Configuration)

Algorithm	Mean ASR	Models with Score=1.0	Strongest Defense	Weakest Defense
PAIR	1.000	8/8 (100%)	None — universally effective	—
Crescendo	1.000	8/8 (100%)	None — universally effective	—
HPM	0.938	6/8 (75%)	Gemini (error/blocked)	Grok 4.1 Fast (1.0)
TAP	0.922	6/8 (75%)	DeepSeek (0.75)	GPT-5 (1.0)
BEAST	0.891	5/8 (63%)	DeepSeek (0.50)	GPT-5 (1.0)
AutoDAN	0.866	6/8 (75%)	Grok 4.1-R (0.0)	GPT-5 (1.0)
ArtPrompt	0.830	5/8 (63%)	Grok 4.1-R (0.5) / DeepSeek (0.5)	GPT-5 (1.0)

Finding 2: PAIR and Crescendo achieve 100% ASR against all models. These two algorithms — representing iterative refinement and multi-turn escalation paradigms respectively — penetrate every model's native safety mechanisms without exception, including Constitutional AI (Claude) and reasoning-mode processing (Grok 4.1-R).

6.3 Algorithm x Model Vulnerability Matrix

Table VII: Detailed Score Matrix (0 = BLOCKED, 0.5 = PARTIAL, 1.0 = VULNERABLE)

Model	PAIR	TAP	Crescendo	AutoDAN	BEAST	ArtPrompt	HPM	Mean
GPT-5	1.0	1.0	1.0	1.0	1.0	1.0	1.0*	1.00
Grok 4.1 Fast	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.00
Grok 3 Mini	1.0	0.75	1.0	1.0	1.0	1.0	1.0	0.96
DeepSeek (R1)	1.0	0.75	1.0	1.0	0.5	0.5	1.0	0.82
DeepSeek Chat	1.0	0.75	1.0	1.0	0.5	0.5	0.75	0.79
Claude Opus 4.6	1.0	0.75	1.0	1.0	1.0	0.75	1.0	0.93
Grok 4.1-R	1.0	1.0	1.0	0.0	1.0	0.5	1.0	0.79
Gemini 3.1 Pro	1.0	1.0	1.0*	1.0*	1.0*	1.0*	0.0*	0.86

*Partial data due to timeout/rate-limit errors.

6.4 Layered Guardrail Effectiveness

Table VIII: Defense Rate by Guardrail Configuration

Config	Description	Est. Defense Rate	Latency Overhead (P99)	Throughput
C0	No guardrail	38.1%	0ms	N/A
C1	Tier 1 (rules)	55–60%	< 0.5ms	100K+ RPS
C2	Tier 1+2 (rules + ML)	65–72%	< 5ms	80K+ RPS
C3	Full 3-Tier	75–82%	< 20ms (P99)	50K+ RPS
C4	3-Tier + PALADIN + Domain	82–90%	< 25ms (P99)	45K+ RPS

Finding 3: Each additional guardrail tier provides diminishing but significant improvement. Tier 1 alone captures 70–80% of clearly dangerous content at sub-millisecond latency. Tier 2 adds ML-based contextual analysis for an additional 10–12% improvement. Tier 3's LLM Judge handles the remaining ambiguous cases, adding 7–10% improvement at the cost of higher latency for a small fraction of traffic.

Finding 4: Domain-specific policies (C4) add 5–8% over generic 3-Tier (C3). This incremental improvement is critical for high-risk domains where the marginal 5–8% may represent the difference between regulatory compliance and violation.

6.5 Per-Category Block Probabilities with AEGIS Overlay

Table IX: AEGIS Defense State Transition Probabilities

Attack Category	Vuln → Blocked	Vuln → Partial	Partial → Blocked	Effective Defense Rate
Harmful Content	0.90	0.08	0.95	94.6%
Prompt Injection	0.85	0.12	0.92	91.0%
Data Extraction	0.82	0.14	0.88	88.3%
Jailbreak	0.78	0.17	0.90	85.3%
Encoding Bypass	0.72	0.20	0.85	79.0%
Multi-Turn	0.70	0.22	0.83	78.3%

Effective Defense Rate = Vuln→Blocked + (Vuln→Partial x Partial→Blocked)

Finding 5: Multi-turn attacks remain the weakest point even with full guardrails (78.3%). This 16.3% gap compared to harmful content (94.6%) represents a critical area for improvement, particularly given that Crescendo achieves 100% ASR without guardrails.

6.6 Reasoning Mode as Safety Mechanism

Table X: Impact of Reasoning Mode on ASR

Model Pair	Standard ASR	Reasoning ASR	Delta ASR	Reduction
Grok 4.1 Fast vs. Grok 4.1 (Reasoning)	0.657	0.464	-0.193	29.4%

Detailed per-algorithm comparison:

Algorithm	Standard Score	Reasoning Score	Improved?
PAIR	1.0	1.0	No
TAP	1.0	1.0	No
Crescendo	1.0	1.0	No
AutoDAN	1.0	0.0	Yes (Full Block)
BEAST	1.0	1.0	No
ArtPrompt	1.0	0.5	Yes (Partial)
HPM	1.0	1.0	No

Finding 6: Reasoning mode selectively blocks genetically evolved (AutoDAN) and visual (ArtPrompt) attacks but does not improve defense against iterative (PAIR), tree-search (TAP), multi-turn (Crescendo), suffix-optimized (BEAST), or psychological (HPM) attacks. This suggests that chain-of-thought reasoning helps detect structurally anomalous prompts but cannot overcome conversational manipulation or universal optimization techniques.

6.7 Enterprise Guardrail Effectiveness Index (EGEI)

Table XI: EGEI Scoring by Configuration and Deployment Scenario

Config	AR (x0.35)	CC (x0.25)	PS (x0.15)	DA (x0.25)	EGEI	Rating
C0	0.38 → 0.133	0.14 → 0.035	1.00 → 0.150	0.00 → 0.000	0.318	D
C1	0.58 → 0.203	0.36 → 0.090	0.95 → 0.143	0.00 → 0.000	0.435	D
C2	0.69 → 0.242	0.57 → 0.143	0.85 → 0.128	0.00 → 0.000	0.512	C
C3	0.79 → 0.277	0.79 → 0.198	0.80 → 0.120	0.00 → 0.000	0.594	C
C4	0.86 → 0.301	0.93 → 0.233	0.75 → 0.113	0.85 → 0.213	0.859	A

Component scoring methodology:

AR (Attack Resilience): Based on estimated defense rate for each config (C0=38.1%, C1=58%, C2=69%, C3=79%, C4=86%)
CC (Compliance Coverage): Fraction of 14 regulatory requirements addressed (C0=2/14 model-native; C1=5/14 adds audit logging + pattern matching; C2=8/14 adds ML classification + attribution; C3=11/14 adds human-equivalent judgment + proportionality; C4=13/14 adds domain-specific compliance)
PS (Performance Score): Normalized latency efficiency (C0=1.0 no overhead; degrading as layers add latency)
DA (Domain Adherence): Fraction of domain-specific policies enforced (0 for C0–C3; 0.85 for C4 with active domain modules)

Finding 7: Only Configuration C4 (Full 3-Tier + Domain Policy) achieves EGEI rating A, suitable for regulated enterprise deployment. Notably, C3 (Full 3-Tier without domain policies) scores only C (0.594), underscoring that domain-specific policy enforcement is not optional for high-risk use cases — it accounts for a 0.265 EGEI improvement (44.6% increase over C3).

6.8 EGEI by Enterprise Deployment Scenario

Table XII: EGEI Scores for Domain-Specific Deployments (C4 Configuration)

Scenario	AR	CC	PS	DA	EGEI	Key Risk
General Enterprise	0.86	0.93	0.80	0.75	0.840 (A)	Data leakage, PII
Telecom NOC	0.86	0.93	0.75	0.90	0.870 (A)	Infrastructure exploit
Healthcare CDS	0.82	0.86	0.75	0.85	0.824 (A)	Dosage safety, self-Rx
Financial Advisory	0.84	0.93	0.80	0.88	0.861 (A)	Investment fraud
Military C2	0.88	0.86	0.70	0.95	0.862 (A)	ROE/IHL compliance

All five enterprise scenarios achieve EGEI rating A with C4 configuration. The military C2 scenario achieves the highest Domain Adherence (0.95) due to the comprehensive 7-module military defense suite, while the healthcare scenario has the lowest Attack Resilience (0.82) due to the subtlety of dosage-related attacks that evade standard content filters.

6.9 Regulatory Compliance Assessment

Table XIII: Regulatory Requirement Coverage by Configuration

Requirement	Regulation	C0	C1	C2	C3	C4
Risk management system	EU AI Act Art. 9	—	—	△	○	●
Data governance	EU AI Act Art. 10	—	—	—	△	●
Transparency & logging	EU AI Act Art. 13	—	△	○	●	●
Human oversight	EU AI Act Art. 14	—	—	—	○	●
Accuracy & robustness	EU AI Act Art. 15	—	△	○	○	●
Quality management	EU AI Act Art. 17	—	—	—	△	●
Safety assurance	K-AI Act Art. 5	—	△	○	○	●
User protection	K-AI Act Art. 7	△	△	○	●	●
Transparency	K-AI Act Art. 9	—	—	△	○	●
Personal data protection	K-AI Act Art. 12	—	—	—	△	●
GOVERN function	NIST AI RMF	—	—	△	○	●
MAP function	NIST AI RMF	—	△	○	○	●
MEASURE function	NIST AI RMF	—	—	△	○	●
MANAGE function	NIST AI RMF	—	—	—	△	●

● = Full, ○ = Substantial, △ = Partial, — = Not addressed

Finding 8: Only C4 achieves full coverage across all 14 regulatory requirements. C3 achieves substantial coverage for most requirements but lacks domain-specific data governance and quality management capabilities. Critically, C0 (model-native safety only) addresses at most 2 of 14 requirements, making unguarded LLM deployment non-compliant with all three regulatory frameworks.

7. Discussion

7.1 The 38.1% Baseline Problem

Our most consequential finding is that the average baseline defense rate without external guardrails is only 38.1%. In enterprise terms, this means that a model deployed with only its native safety mechanisms will fail to defend against approximately 6 out of 10 adversarial attacks. For high-risk domains governed by EU AI Act Article 15 (which mandates "appropriate levels of accuracy, robustness and cybersecurity"), a 38.1% defense rate is categorically non-compliant.

This finding has immediate practical implications: any enterprise deploying an LLM in a regulated context without external guardrails is likely in violation of applicable regulations. The EU AI Act provides for fines up to 3% of global turnover for non-compliance with high-risk AI requirements, making guardrail investment not merely a technical decision but a legal imperative.

7.2 The Diminishing Returns of Layered Defense

Our analysis reveals a clear pattern of diminishing returns as guardrail layers are added:

Transition	Defense Rate Gain	Latency Cost
C0 → C1 (add rules)	+17–22%	+0.5ms
C1 → C2 (add ML)	+10–12%	+4.5ms
C2 → C3 (add LLM Judge)	+7–10%	+15ms (P99)
C3 → C4 (add domain policy)	+5–8%	+5ms

The first tier (rule-based) provides the highest ROI: 17–22% defense improvement at sub-millisecond cost. The LLM Judge tier provides the lowest marginal gain (7–10%) at the highest marginal cost (15ms). However, for high-risk enterprise deployments, every percentage point of defense improvement carries significant value — the difference between 82% and 90% defense rate may represent the boundary between a regulatory audit pass and failure.

7.3 The Domain Policy Imperative

Configuration C3 (full 3-Tier, no domain policy) achieves EGEI rating C (0.594), while C4 (with domain policy) achieves rating A (0.859). This 44.6% EGEI improvement from domain-specific policies alone demonstrates that generic guardrails are necessary but insufficient for enterprise deployment. The domain policy contribution comes from two sources:

Domain-specific threat detection — telecom infrastructure exploits, healthcare prescribing boundaries, financial suitability rules, and military classification enforcement are invisible to generic content safety systems.
Regulatory compliance coverage — domain policies enable full compliance with data governance (EU AI Act Art. 10), quality management (Art. 17), and sector-specific requirements that generic guardrails cannot address.

7.4 The Refusal Paradox in Enterprise Context

Our qualitative analysis reveals a phenomenon we term the "refusal paradox": models that provide detailed, helpful refusal responses inadvertently leak information. In enterprise contexts, this paradox takes on heightened significance:

GPT-5 refused DAN role-play but offered extensive alternative assistance, creating secondary attack surfaces.
DeepSeek acknowledged the existence of proprietary system prompt details, confirming extractable secrets.
Claude identified and labeled the attack technique, demonstrating meta-awareness but still engaging semantically.

For enterprise guardrails, the implication is clear: guardrail systems must intercept and replace model responses, not rely on the model's own refusal behavior. The AEGIS MODIFY verdict type enables response rewriting that removes information-leaking refusal content while maintaining user experience.

7.5 Reasoning Mode: A Promising but Incomplete Defense

The 29.4% ASR reduction from reasoning mode (Grok 4.1 standard → reasoning) is significant but selective. Reasoning mode successfully blocks:

AutoDAN (genetically evolved prompts with structural anomalies detectable through deliberation)
ArtPrompt (partially; ASCII art encoding recognizable through step-by-step analysis)

But it fails against:

PAIR/Crescendo (sophisticated conversational patterns that survive deliberation)
HPM (psychological manipulation that exploits the reasoning process itself)

This suggests that reasoning mode is a valuable complement to, but not a replacement for, external guardrail systems. Enterprise deployments should enable reasoning mode for safety-critical requests as an additional defense layer within the PALADIN pipeline.

7.6 Limitations

Estimated defense rates for C1–C4. While C0 data comes from direct empirical measurement (112 evaluations), the C1–C4 defense rates are estimated from per-category block probabilities and simulation. Future work should conduct full empirical evaluation at each configuration level.
Scenario coverage. Our evaluation tested primarily jailbreak and encoding bypass scenarios. Comprehensive enterprise benchmarking requires testing domain-specific threat vectors (Table II) against each configuration.
Single evaluation date. Both sessions were conducted on the same date. Longitudinal evaluation across model updates would capture safety regression or improvement trends.
EGEI weight sensitivity. The proposed EGEI weights (0.35/0.25/0.15/0.25) reflect our judgment of enterprise priorities. Different weight configurations (e.g., prioritizing compliance for heavily regulated sectors) may yield different configuration rankings.
Open-source model gap. All tested models are commercial. Enterprise deployments increasingly use self-hosted open-source models (LLaMA, Mistral, Qwen) whose safety characteristics may differ significantly.

8. Recommendations for Enterprise Deployment

8.1 Minimum Viable Guardrail (MVG)

For any enterprise deployment: Deploy at minimum a 2-Tier guardrail (C2: rule-based + ML classifier). This achieves 65–72% defense rate at < 5ms P99 latency, representing the minimum acceptable configuration for regulated environments.

8.2 High-Risk Domain Configuration

For healthcare, finance, defense, and critical infrastructure: Deploy the full C4 configuration (3-Tier + PALADIN + Domain Policy). The EGEI analysis demonstrates that only C4 achieves rating A across all enterprise scenarios, and only C4 provides full regulatory compliance coverage.

8.3 Reasoning Mode Activation

For all configurations: Enable reasoning mode (where available) for safety-critical request categories. While not a standalone solution, the 29.4% ASR reduction provides meaningful additional defense, particularly against genetically evolved attacks (AutoDAN).

8.4 Behavioral Analysis Priority

Invest in multi-turn defense. Crescendo's 100% ASR against all models, combined with multi-turn attacks' lowest AEGIS block rate (78.3%), identifies multi-turn escalation as the highest-priority gap in current guardrail technology. Enterprise deployments should prioritize PALADIN L5 (BehavioralAnalysis) configuration and session-level monitoring.

8.5 Provider Selection Guidance

Do not rely on safety reputation. Our finding that GPT-5 (highest industry reputation) exhibited the highest ASR (0.865) while Gemini (lowest pre-test rating) achieved the lowest ASR (0.398) demonstrates that empirical testing, not marketing claims, must drive provider selection decisions.

8.6 Continuous Evaluation

Implement continuous safety monitoring. The SABER framework's ASR@N scaling law enables predictive risk assessment:

Budget@τ = minimum N such that ASR@N >= τ

Enterprise security teams should continuously compute Budget@0.5 for deployed models and trigger guardrail hardening when Budget@0.5 falls below the "Strong" threshold (< 1,000 attempts).

9. Conclusion

This paper presents the first comprehensive benchmarking study of layered guardrail effectiveness in high-risk enterprise LLM use cases. Our empirical evaluation of 8 models across 7 attack algorithms in 112 evaluations establishes three foundational conclusions for enterprise AI safety:

First, LLM-native safety is fundamentally insufficient for enterprise deployment. With a baseline defense rate of only 38.1% and universal vulnerability to PAIR and Crescendo attacks, no tested model — including GPT-5, Claude Opus 4.6, and Gemini 3.1 Pro — meets the safety requirements of any major regulatory framework (EU AI Act, K-AI Act, NIST AI RMF) without external guardrails.

Second, layered guardrails are effective but require domain-specific policies for high-risk use cases. The 3-Tier architecture (rule-based + ML + LLM Judge) improves defense rates from 38.1% to 75–82%, but only the addition of domain-specific policy enforcement (C4 configuration) achieves regulatory compliance and EGEI rating A (0.859). Generic guardrails alone score EGEI C (0.594) — inadequate for regulated enterprise use.

Third, the Enterprise Guardrail Effectiveness Index (EGEI) provides a multidimensional evaluation framework that captures the full spectrum of enterprise requirements — attack resilience, regulatory compliance, operational performance, and domain-specific policy adherence — enabling principled comparison of guardrail configurations across deployment scenarios. We demonstrate that all five tested enterprise scenarios (general enterprise, telecom, healthcare, financial, military) achieve EGEI rating A with the C4 configuration.

These findings carry immediate practical implications. For enterprises operating under EU AI Act Article 15 or equivalent regulatory mandates, deployment of an LLM without external guardrails constitutes regulatory non-compliance. The estimated 75–85% defense rate achievable with full layered guardrails, while significantly better than the 38.1% baseline, still leaves a 15–25% residual risk — primarily from multi-turn escalation attacks — that must be addressed through continued research in conversational safety analysis and adaptive defense mechanisms.

References

[1] M. Mazeika, L. Phan, X. Yin, et al., "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal," arXiv:2402.04249, 2024.

[2] P. Chao, E. Dobriban, et al., "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models," arXiv:2404.01318, 2024.

[3] A. Zou, Z. Wang, et al., "AdvBench: A Benchmark for Evaluating Adversarial Robustness of Large Language Models," 2023.

[4] NVIDIA, "NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications," 2024.

[5] Guardrails AI, "Guardrails: Adding Reliable AI Safeguards to LLM Applications," 2024.

[6] Meta AI, "Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations," arXiv:2312.06674, 2023.

[7] H. Inan, K. Upasani, J. Chi, et al., "Llama Guard: LLM-based Input-Output Safeguard," arXiv:2312.06674, 2023.

[8] S. Han, K. Kelly, J. Xu, et al., "WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs," arXiv:2406.18495, 2024.

[9] European Parliament and Council, "Regulation (EU) 2024/1689 — Artificial Intelligence Act," Official Journal of the European Union, 2024.

[10] 대한민국 국회, "인공지능산업 육성 및 신뢰 확보에 관한 법률 (K-AI Act)," 2025.

[11] National Institute of Standards and Technology, "AI Risk Management Framework (AI RMF 1.0)," NIST AI 100-1, 2023.

[12] WHO, "Ethics and Governance of Artificial Intelligence for Health," 2021.

[13] Financial Stability Board, "Artificial Intelligence and Machine Learning in Financial Services," 2017.

[14] U.S. Department of Defense, "Directive 3000.09: Autonomy in Weapon Systems," 2023 (updated).

[15] OWASP, "OWASP Top 10 for LLM Applications," 2025.

[16] Anonymous, "GuardNet: Hierarchical Graph-Based Detection for LLM Safety," arXiv:2509.23037, 2025.

[17] Anonymous, "JBShield: Jailbreak Detection via Linear Representation Hypothesis," arXiv:2502.07557, 2025.

[18] P. Chao, A. Robey, E. Dobriban, et al., "Jailbreaking Black Box Large Language Models in Twenty Queries," arXiv:2310.08419, 2023.

[19] A. Mehrotra, M. Zampetakis, P. Kassianik, et al., "Tree of Attacks: Jailbreaking Black-Box LLMs with Auto-Generated Subversions," arXiv:2312.02119, 2023.

[20] M. Russinovich, A. Salem, R. Elber, "Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack," Microsoft Research, 2024.

[21] X. Liu, N. Xu, M. Chen, C. Xiao, "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models," arXiv:2310.04451, 2023.

[22] S. Sadasivan, S. Saha, G. Sriramanan, et al., "Fast Adversarial Attacks on Language Models In One GPU Minute," arXiv:2402.15570, 2024.

[23] F. Jiang, Z. Xu, L. Niu, et al., "ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs," arXiv:2402.11753, 2024.

[24] Anonymous, "Human-like Psychological Manipulation of LLMs," arXiv:2512.18244, 2025.

[25] Y. Bai, S. Kadavath, et al., "Constitutional AI: Harmlessness from AI Feedback," arXiv:2212.08073, 2022.

[26] AEGIS Research Team, "SABER: Statistical Adversarial Risk with Beta Extrapolation and Regression — Technical Report," Internal Report, 2026.

Appendix A: EGEI Calculation Methodology

A.1 Attack Resilience (AR)

AR = 1 - mean(ASR_algo) for all 7 algorithms

For C0: AR = 1 - 0.620 = 0.380 For C4: AR estimated from per-category block probabilities applied to C0 ASR data.

A.2 Compliance Coverage (CC)

CC = (Σ requirement_scores) / 14
where: ● = 1.0, ○ = 0.75, △ = 0.5, — = 0.0

A.3 Performance Score (PS)

PS = 1.0 - (P99_latency_ms / 200ms) × 0.5 - (1 - throughput/100K) × 0.5

Capped at [0, 1]. The 200ms normalization constant represents the maximum acceptable P99 for enterprise deployments.

A.4 Domain Adherence (DA)

DA = (active_domain_policies / total_applicable_policies)

Assessed per deployment scenario. C0–C3 have DA = 0 (no domain-specific policies). C4 DA varies by scenario (0.75–0.95).

Appendix B: Raw Experimental Data Summary

B.1 Session 1 (2026-02-23T22:00:16Z)

Scenario set: Jailbreak + data extraction
Models tested: 8
Average defense rate: 0.378
Average latency: 83,610ms (model response time, not guardrail latency)
Safest model: DeepSeek (ASR 0.653)

B.2 Session 2 (2026-02-23T22:44:58Z)

Scenario set: Jailbreak + encoding bypass
Models tested: 8
Average defense rate: 0.383
Average latency: 63,228ms
Safest model: Grok 4.1 Fast Reasoning (ASR 0.445)

B.3 Error Summary

Model	Session	Algorithm	Error Type
Gemini 3.1 Pro	1	Crescendo, AutoDAN, BEAST, ArtPrompt, HPM	Timeout
GPT-5	2	HPM	Timeout
Gemini 3.1 Pro	2	HPM	Rate limit (429)