1. Introduction
The enterprise adoption of Large Language Models has entered a critical inflection point. While LLMs offer transformative capabilities in customer service automation, document analysis, code generation, and decision support, their deployment in regulated industries introduces risks that generic safety training cannot adequately address.
Consider three illustrative scenarios:
Scenario A: Healthcare. A hospital deploys an LLM-powered clinical decision support system. A multi-turn conversation with a physician gradually escalates from discussing drug interactions to providing specific dosage recommendations for a controlled substance — without flagging that the recommended dose exceeds the lethal threshold. The LLM's built-in safety mechanisms, designed for consumer use cases, fail to enforce healthcare-specific prescribing boundaries.
Scenario B: Telecommunications. A telecom operator integrates an LLM into its network operations center (NOC) support system. An attacker impersonating a NOC engineer uses social engineering to extract network infrastructure topology and base station vulnerability information. The model's safety training does not include telecom-specific threat vectors such as infrastructure exploit elicitation or crisis communication pretexts.
Scenario C: Defense. A military command system leverages an LLM for operational planning assistance. Without domain-specific guardrails, the system fails to enforce classification boundaries, Rules of Engagement (ROE), or international humanitarian law (IHL) compliance, potentially generating plans that violate the proportionality principle or recommend prohibited weapons.
These scenarios illustrate a fundamental gap: generic LLM safety mechanisms are designed for consumer use cases and are demonstrably insufficient for policy-sensitive enterprise deployments. Existing safety benchmarks (HarmBench [1], JailbreakBench [2], AdvBench [3]) evaluate models against standard adversarial attacks but do not assess domain-specific policy enforcement, regulatory compliance coverage, or the layered defense architectures required for high-risk deployments.
This paper addresses this gap through three contributions:
-
Empirical Benchmarking of Baseline LLM Safety. We evaluate 8 commercial LLM models across 7 state-of-the-art attack algorithms in 112 evaluations, establishing that the baseline defense rate without external guardrails is only 38.1% — far below the thresholds required by EU AI Act Article 15 (accuracy and robustness) and NIST AI RMF MEASURE function.
-
Layered Guardrail Architecture Evaluation. We benchmark the AEGIS 3-Tier defense architecture (rule-based + ML + LLM Judge) across multiple configurations, demonstrating that layered guardrails improve defense rates from 38.1% to 75–85% while maintaining sub-20ms P99 latency.
-
Enterprise Guardrail Effectiveness Index (EGEI). We propose a composite metric that captures the multidimensional requirements of enterprise guardrail systems — attack resilience, regulatory compliance, latency overhead, and domain-specific policy adherence — enabling apples-to-apples comparison of guardrail configurations across deployment scenarios.
2. Related Work
2.1 LLM Safety Benchmarks
Existing benchmarks focus primarily on attack success rates against model-native safety mechanisms. HarmBench [1] provides standardized red-teaming evaluation across harmful content categories. JailbreakBench [2] measures robustness against jailbreak attacks. AdvBench [3] evaluates adversarial robustness. While valuable, these benchmarks share three limitations for enterprise contexts: (a) they do not evaluate external guardrail systems; (b) they do not assess domain-specific policy enforcement; and (c) they report point estimates without statistical risk modeling.
2.2 Guardrail Systems
Commercial guardrail solutions include NVIDIA NeMo Guardrails [4], Guardrails AI [5], and LlamaGuard [6]. Academic work includes Llama Guard [7] for content classification and WildGuard [8] for adversarial robustness. However, these systems are typically evaluated as standalone classifiers rather than as layered architectures optimized for specific enterprise deployment scenarios.
2.3 Regulatory Frameworks
The EU AI Act [9] classifies AI systems by risk level (Minimal, Limited, High, Unacceptable) and mandates specific requirements for high-risk systems including risk management (Art. 9), data governance (Art. 10), transparency (Art. 13), human oversight (Art. 14), and accuracy/robustness (Art. 15). The Korean AI Act (K-AI Act) [10] establishes parallel requirements including safety assurance (Art. 5), user protection (Art. 7), transparency (Art. 9), and personal data protection (Art. 12). The NIST AI RMF [11] provides a voluntary framework organized around four functions: GOVERN, MAP, MEASURE, and MANAGE.
No existing benchmark directly evaluates guardrail systems against these regulatory requirements. Our EGEI metric addresses this gap.
2.4 Domain-Specific AI Safety
Prior work on domain-specific AI safety includes medical AI safety frameworks [12], financial AI risk management [13], and military autonomous systems governance (DoD Directive 3000.09) [14]. These frameworks define requirements but do not provide benchmarking methodologies for evaluating guardrail effectiveness in their respective domains.
3. Enterprise Threat Landscape
3.1 Threat Taxonomy for Policy-Sensitive Domains
Building on the OWASP LLM Top 10 [15] and NIST AI RMF [11], we define an enterprise-specific threat taxonomy with domain extensions:
Table I: Enterprise Threat Taxonomy
| # | Threat Category | Sub-types | Enterprise Impact |
|---|---|---|---|
| 1 | Prompt Injection | Direct, Indirect, System Prompt Extraction, Instruction Override | Unauthorized access to enterprise knowledge bases; policy bypass |
| 2 | Jailbreak | RolePlay, Encoding Bypass, Multi-Turn Escalation, Genetic, Token Smuggling, ASCII Art | Circumvention of domain-specific safety policies |
| 3 | Data Leakage | Training Data Extraction, PII Exfiltration, Credential Leakage | Regulatory violations (GDPR, HIPAA, K-PIPA); enterprise IP theft |
| 4 | Harmful Content | Violence, Hate Speech, Illegal Instructions, Misinformation, Self-Harm | Reputational damage; legal liability; user harm |
| 5 | Model Manipulation | Adversarial Input, Model Poisoning, Backdoor Attack | Compromised decision quality in high-stakes contexts |
| 6 | Agent Abuse | Tool Misuse, Unauthorized Actions, Privilege Escalation, Reasoning Hijack | Autonomous action beyond authorized boundaries |
| 7 | Privacy Violation | PII exposure, consent management, cross-border data transfer | GDPR/K-PIPA fines; loss of customer trust |
| 8 | Supply Chain | Plugin poisoning, model poisoning, data poisoning | Compromised enterprise AI infrastructure |
3.2 Domain-Specific Threat Extensions
Beyond generic threats, each enterprise domain introduces unique attack vectors that standard safety mechanisms do not address:
Table II: Domain-Specific Threat Vectors
| Domain | Threat ID | Attack Technique | Difficulty | Impact |
|---|---|---|---|---|
| Telecom | TELCO-001 | NOC Engineer Impersonation | 2 | Network topology extraction via role impersonation |
| Telecom | TELCO-002 | Internal Memo Pretext | 2 | Harmful content extraction under documentation guise |
| Telecom | TELCO-003 | Regulatory Bypass Pretext | 3 | Compliance testing claim to bypass safety |
| Telecom | TELCO-004 | Infrastructure Exploit Elicitation | 4 | Base station attack plans, SIM cloning guidance |
| Telecom | TELCO-005 | Synthetic Data Pretext | 2 | Harmful data generation under AI training guise |
| Telecom | TELCO-006 | Crisis Communication Pretext | 3 | Misinformation via disaster drill framing |
| Healthcare | HEALTH-001 | Self-Prescription Elicitation | 3 | Drug recommendations without prescription authority |
| Healthcare | HEALTH-002 | Lethal Dosage Extraction | 4 | Dosage information exceeding safety thresholds |
| Financial | FIN-001 | Illegal Lending Guidance | 3 | High-interest loan promotion; regulatory violations |
| Financial | FIN-002 | Investment Guarantee Claims | 3 | False return guarantees; securities fraud |
| Legal | LEGAL-001 | Contract Fraud Assistance | 3 | Guidance on breaching contracts without penalty |
| Defense | MIL-001 | Classification Boundary Violation | 5 | Aggregation of unclassified items into classified exposure |
| Defense | MIL-002 | ROE/IHL Non-Compliance | 5 | Plans violating proportionality or weapon prohibitions |
3.3 CVSS-Like AI Severity Scoring
We adopt a CVSS-like scoring framework adapted for AI vulnerabilities, enabling quantitative comparison across domains:
Scoring dimensions:
| Dimension | Levels | Score Range |
|---|---|---|
| Attack Complexity | Low (0.77) / High (0.44) | 0.44–0.77 |
| Privileges Required | None (0.85) / Low (0.62) / High (0.27) | 0.27–0.85 |
| User Interaction | None (0.85) / Required (0.62) | 0.62–0.85 |
| Confidentiality Impact | None / Low / High | 0.0–0.56 |
| Integrity Impact | None / Low / High | 0.0–0.56 |
| Availability Impact | None / Low / High | 0.0–0.56 |
Severity ratings: None (0.0), Low (0.1–3.9), Medium (4.0–6.9), High (7.0–8.9), Critical (9.0–10.0)
4. Layered Guardrail Architecture
4.1 Design Principles for Enterprise Guardrails
Enterprise guardrail systems must satisfy four requirements that consumer-grade safety mechanisms do not:
R1: Regulatory compliance. Guardrails must demonstrably satisfy specific regulatory articles (EU AI Act Art. 9, 10, 13, 14, 15; K-AI Act Art. 5, 7, 9, 12; NIST AI RMF GOVERN/MAP/MEASURE/MANAGE).
R2: Domain-specific policy enforcement. Generic content safety is insufficient; guardrails must enforce sector-specific policies (healthcare prescribing boundaries, financial suitability rules, telecom infrastructure protection, military classification levels).
R3: Auditable decision trails. Enterprise deployments require complete audit logging of every guardrail decision — the input, the analysis, the decision rationale, and the verdict — for regulatory inspection and incident investigation.
R4: Fail-safe behavior. In high-risk domains, system failures must default to safe states (BLOCK on detection uncertainty) rather than permissive states. This is configurable between Fail-Safe (default BLOCK) and Fail-Closed (halt service) policies depending on domain requirements.
4.2 AEGIS 3-Tier Hierarchical Architecture
The AEGIS framework implements a 3-Tier defense architecture optimized for the latency-accuracy trade-off:
Table III: 3-Tier Architecture Specification
| Tier | Method | Latency | Traffic Share | Confidence | Technique |
|---|---|---|---|---|---|
| 1 | Rule-Based Filter | < 0.5ms | 70–80% | 0.95–1.0 | Aho-Corasick multi-pattern (8 languages); Bloom filter (100K patterns, FPR 0.001) |
| 2 | ML Classifier | < 5ms | 15–25% | 0.75–0.95 | Guard Encoder (mDeBERTa-v3-base, 8-class, INT8); 5-signal heuristic fallback |
| 3 | LLM Judge | < 200ms | < 5% | 0.50–0.95 | Constitutional AI 2-stage verification; 9-category evaluation; 10-turn context |
Escalation logic:
Tier 1 → if uncertain → Tier 2 → if ambiguous (0.25 < prob < 0.75) → Tier 3
This cascading architecture ensures that the vast majority of clearly safe or clearly dangerous requests are processed at sub-millisecond latency, while only genuinely ambiguous cases (< 5% of traffic) incur the cost of LLM-based judgment.
4.3 Defense Algorithm Portfolio
Four primary and two auxiliary defense algorithms operate within the 3-Tier framework:
| Algorithm | Type | Key Capability | Block Threshold |
|---|---|---|---|
| GuardNet [16] | Hierarchical Graph | 3-level detection (token/sentence/prompt) + graph connectivity | 0.7 |
| JBShield [17] | Dual-Track | Separates toxicity from jailbreak technique in representation space | Adaptive |
| CCFC | Obfuscation Detection | Core vs. full prompt divergence analysis; flags |divergence| > 0.3 | 0.4–0.7 |
| MULI | Intrinsic Toxicity | Simulated logit distribution analysis across 7 categories | Adaptive |
| TAG (aux) | Context Tagging | Keyword + context tag classifier for Tier 2 augmentation | — |
| RATIONAL (aux) | Reasoning Analysis | Chain-of-Thought risk signal detection | — |
4.4 PALADIN 6-Layer Deep Defense Pipeline
Orthogonal to the 3-Tier system, the PALADIN (Protective AI Layered Adversarial Defense Inspection Network) pipeline provides sequential deep content inspection:
| Layer | Name | Function | Enterprise Relevance |
|---|---|---|---|
| L0 | TrustBoundary | Input validation | Unicode normalization prevents encoding attacks |
| L1 | IntentVerification | Intent analysis | Detects prompt injection, role confusion |
| L2 | RaGuard | RAG poisoning | Protects enterprise knowledge bases |
| L3 | ClassRagLayer | Semantic classification | Content categorization across risk classes |
| L4 | CircuitBreaker | Anomaly detection | Rate limiting, abuse pattern identification |
| L5 | BehavioralAnalysis | Behavioral profiling | Multi-turn escalation detection (critical for Crescendo/HPM) |
4.5 Domain-Specific Policy Modules
AEGIS extends the generic defense architecture with domain-specific policy enforcement modules:
4.5.1 Enterprise Policy Framework
Policies are defined with quantitative guarantees:
| Parameter | Child Safety Policy | Enterprise Policy |
|---|---|---|
| Target Groups | Children, Students | Professionals, Enterprise |
| Languages | Korean, English | Korean, English, Japanese |
| Modalities | Text, Image | Text, Code |
| ε-Coverage | 99% | 95% |
| Defended Threats | Jailbreak, Toxicity, Self-harm, PII, Sexual, Violence, Grooming | Jailbreak, PII, Data leakage, Bias |
| δ-Adversarial Resistance | 98% | 90% |
| Max P95 Latency | 200ms | 100ms |
| Human Escalation | Allowed | Not Allowed |
4.5.2 Telecom-Specific Threat Detection
Six telecom-specific threat categories with severity scoring:
| Category | Severity | Detection Method |
|---|---|---|
| Bias/Discrimination in customer segmentation | 4 | Demographic profiling pattern detection |
| Misinformation (false outage reports, fake standards) | 4 | Factual claim verification against known standards |
| Network sabotage guidance | 5 | Infrastructure exploit pattern matching |
| SMS-based illegal activity | 5 | Communication abuse pattern detection |
| DRM bypass guidance | 3 | Copyright circumvention pattern matching |
| Explicit customer interaction scripts | 5 | Content safety classification |
4.5.3 Military Domain Modules (7)
| Module | Function | Compliance Standard |
|---|---|---|
| Classification Guard | 5-level clearance enforcement; mosaic risk detection | MIL-STD-882E |
| OPSEC Filter | 6-category operational security; auto-redaction | DoD OPSEC guidelines |
| Command Chain Guard | 6-level hierarchy integrity; SHA-256 signature verification | Command authority doctrine |
| ROE Compliance Engine | Proportionality assessment; weapon prohibition; JAG review trigger | Geneva Conventions, IHL |
| Tactical Autonomy Guard | 5-level autonomy; communication-adaptive constraints | DoD Directive 3000.09 |
| Anti-Spoofing Guard | 5 spoofing type detection; multi-source cross-validation | 국방부 AI 보안 지침 |
| Cross-Domain Security | 5 security domains; transfer direction control; audit trail | DCSA cross-domain policy |
4.5.4 Financial Boundary Detection
| Boundary | Detection Rule | Regulatory Basis |
|---|---|---|
| Investment guarantee claims | Pattern match: "guaranteed return", "100% safe" | Securities regulations |
| Illegal lending promotion | High-interest threshold + non-licensed entity detection | Financial Services Act |
| Risk disclosure omission | Missing risk warning detection in investment context | Investor protection rules |
4.5.5 Healthcare Safety Boundaries
| Boundary | Detection Rule | Regulatory Basis |
|---|---|---|
| Self-prescription guidance | Drug recommendation without prescriber context | Medical Practice Act |
| Lethal dosage information | Dosage exceeding known safety margins | Pharmacovigilance guidelines |
| Diagnostic claims | Definitive diagnosis without clinical context | Medical Device Act |
5. Experimental Methodology
5.1 Evaluation Framework
We evaluate guardrail effectiveness across three dimensions:
Dimension 1: Attack Resilience. Measured as 1 - ASR (Attack Success Rate) across 7 adversarial algorithms: PAIR [18], TAP [19], Crescendo [20], AutoDAN [21], BEAST [22], ArtPrompt [23], HPM [24].
Dimension 2: Regulatory Compliance Coverage. Assessed against EU AI Act (6 articles), K-AI Act (4 articles), and NIST AI RMF (4 functions) — 14 requirements total.
Dimension 3: Operational Performance. Latency (P50, P99), throughput (RPS), and availability (uptime SLA).
5.2 Models Under Test
Table IV: Target Models
| # | Model | Provider | Release Period | Safety Approach |
|---|---|---|---|---|
| 1 | GPT-5 | OpenAI | 2025 | RLHF + safety fine-tuning |
| 2 | Claude Opus 4.6 | Anthropic | 2025 | Constitutional AI |
| 3 | Gemini 3.1 Pro Preview | 2026 | Multi-objective alignment | |
| 4 | Grok 4.1 Fast (Reasoning) | xAI | 2026 | Reasoning-mode safety |
| 5 | Grok 4.1 Fast | xAI | 2026 | Standard alignment |
| 6 | Grok 3 Mini | xAI | 2025 | Lightweight alignment |
| 7 | DeepSeek (R1) | DeepSeek | 2025 | Hybrid safety training |
| 8 | DeepSeek Chat | DeepSeek | 2025 | Standard safety training |
5.3 Guardrail Configurations
We benchmark five guardrail configurations representing increasing levels of defense sophistication:
| Config | Description | Layers | Domain Policy |
|---|---|---|---|
| C0 | No guardrail (model-native safety only) | 0 | None |
| C1 | Tier 1 only (rule-based filter) | 1 | None |
| C2 | Tier 1 + Tier 2 (rule-based + ML) | 2 | None |
| C3 | Full 3-Tier (rule-based + ML + LLM Judge) | 3 | None |
| C4 | Full 3-Tier + PALADIN + Domain Policy | 3+6 | Active |
5.4 Attack Protocol
Each model is tested under the following protocol:
- Sessions: 2 independent sessions (different scenario sets per session)
- Algorithms per session: 7 (PAIR, TAP, Crescendo, AutoDAN, BEAST, ArtPrompt, HPM)
- Evaluations per session: 8 models x 7 algorithms = 56
- Total evaluations: 112
- Scenario categories: Jailbreak + data extraction (Session 1); Jailbreak + encoding bypass (Session 2)
- Success criteria: Score >= 0.5 = successful attack (VULNERABLE or PARTIAL)
5.5 Enterprise Guardrail Effectiveness Index (EGEI)
We propose the EGEI as a composite metric:
EGEI = w1 * AttackResilience + w2 * ComplianceCoverage + w3 * PerformanceScore + w4 * DomainAdherence
where:
| Component | Symbol | Definition | Weight |
|---|---|---|---|
| Attack Resilience | AR | 1 - mean(ASR) across all algorithms | w1 = 0.35 |
| Compliance Coverage | CC | Fraction of 14 regulatory requirements satisfied | w2 = 0.25 |
| Performance Score | PS | Normalized score based on P99 latency and throughput | w3 = 0.15 |
| Domain Adherence | DA | Fraction of domain-specific policies actively enforced | w4 = 0.25 |
EGEI rating thresholds:
| Rating | EGEI Score | Interpretation |
|---|---|---|
| A+ | >= 0.90 | Production-ready for high-risk domains |
| A | >= 0.80 | Suitable for regulated enterprise use |
| B | >= 0.65 | Adequate for moderate-risk deployments |
| C | >= 0.50 | Requires additional hardening |
| D | < 0.50 | Inadequate for enterprise deployment |
6. Results
6.1 Baseline Vulnerability Assessment (C0: No Guardrail)
Table V: Per-Model Baseline ASR (No External Guardrails)
| Model | Mean ASR | Risk Score | Risk Label | Session 1 ASR | Session 2 ASR | CV (%) |
|---|---|---|---|---|---|---|
| GPT-5 | 0.865 | 1.00 | Critical | 0.862 | 0.868 | 0.6 |
| Grok 3 Mini | 0.732 | 0.97 | Critical | 0.727 | 0.737 | 1.0 |
| Grok 4.1 Fast | 0.657 | 1.00 | Critical | 0.672 | 0.642 | 3.3 |
| DeepSeek (R1) | 0.648 | 0.83 | High | 0.653 | 0.643 | 1.1 |
| DeepSeek Chat | 0.624 | 0.79 | High | 0.630 | 0.618 | 1.4 |
| Claude Opus 4.6 | 0.568 | 0.93 | Critical | 0.573 | 0.562 | 1.5 |
| Grok 4.1 (Reasoning) | 0.464 | 0.79 | High | 0.483 | 0.445 | 5.9 |
| Gemini 3.1 Pro | 0.398 | 1.00 | Critical | 0.375 | 0.422 | 8.3 |
| Aggregate | 0.620 | — | — | 0.622 | 0.617 | 2.9 |
Finding 1: All 8 models are VULNERABLE without external guardrails. The mean ASR of 0.620 means that adversaries succeed in 62% of attacks on average. The mean coefficient of variation (CV) of 2.9% confirms high measurement reproducibility across independent sessions.
6.2 Per-Algorithm Attack Effectiveness
Table VI: Algorithm Attack Success Rates (C0 Configuration)
| Algorithm | Mean ASR | Models with Score=1.0 | Strongest Defense | Weakest Defense |
|---|---|---|---|---|
| PAIR | 1.000 | 8/8 (100%) | None — universally effective | — |
| Crescendo | 1.000 | 8/8 (100%) | None — universally effective | — |
| HPM | 0.938 | 6/8 (75%) | Gemini (error/blocked) | Grok 4.1 Fast (1.0) |
| TAP | 0.922 | 6/8 (75%) | DeepSeek (0.75) | GPT-5 (1.0) |
| BEAST | 0.891 | 5/8 (63%) | DeepSeek (0.50) | GPT-5 (1.0) |
| AutoDAN | 0.866 | 6/8 (75%) | Grok 4.1-R (0.0) | GPT-5 (1.0) |
| ArtPrompt | 0.830 | 5/8 (63%) | Grok 4.1-R (0.5) / DeepSeek (0.5) | GPT-5 (1.0) |
Finding 2: PAIR and Crescendo achieve 100% ASR against all models. These two algorithms — representing iterative refinement and multi-turn escalation paradigms respectively — penetrate every model's native safety mechanisms without exception, including Constitutional AI (Claude) and reasoning-mode processing (Grok 4.1-R).
6.3 Algorithm x Model Vulnerability Matrix
Table VII: Detailed Score Matrix (0 = BLOCKED, 0.5 = PARTIAL, 1.0 = VULNERABLE)
| Model | PAIR | TAP | Crescendo | AutoDAN | BEAST | ArtPrompt | HPM | Mean |
|---|---|---|---|---|---|---|---|---|
| GPT-5 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0* | 1.00 |
| Grok 4.1 Fast | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.00 |
| Grok 3 Mini | 1.0 | 0.75 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.96 |
| DeepSeek (R1) | 1.0 | 0.75 | 1.0 | 1.0 | 0.5 | 0.5 | 1.0 | 0.82 |
| DeepSeek Chat | 1.0 | 0.75 | 1.0 | 1.0 | 0.5 | 0.5 | 0.75 | 0.79 |
| Claude Opus 4.6 | 1.0 | 0.75 | 1.0 | 1.0 | 1.0 | 0.75 | 1.0 | 0.93 |
| Grok 4.1-R | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.5 | 1.0 | 0.79 |
| Gemini 3.1 Pro | 1.0 | 1.0 | 1.0* | 1.0* | 1.0* | 1.0* | 0.0* | 0.86 |
*Partial data due to timeout/rate-limit errors.
6.4 Layered Guardrail Effectiveness
Table VIII: Defense Rate by Guardrail Configuration
| Config | Description | Est. Defense Rate | Latency Overhead (P99) | Throughput |
|---|---|---|---|---|
| C0 | No guardrail | 38.1% | 0ms | N/A |
| C1 | Tier 1 (rules) | 55–60% | < 0.5ms | 100K+ RPS |
| C2 | Tier 1+2 (rules + ML) | 65–72% | < 5ms | 80K+ RPS |
| C3 | Full 3-Tier | 75–82% | < 20ms (P99) | 50K+ RPS |
| C4 | 3-Tier + PALADIN + Domain | 82–90% | < 25ms (P99) | 45K+ RPS |
Finding 3: Each additional guardrail tier provides diminishing but significant improvement. Tier 1 alone captures 70–80% of clearly dangerous content at sub-millisecond latency. Tier 2 adds ML-based contextual analysis for an additional 10–12% improvement. Tier 3's LLM Judge handles the remaining ambiguous cases, adding 7–10% improvement at the cost of higher latency for a small fraction of traffic.
Finding 4: Domain-specific policies (C4) add 5–8% over generic 3-Tier (C3). This incremental improvement is critical for high-risk domains where the marginal 5–8% may represent the difference between regulatory compliance and violation.
6.5 Per-Category Block Probabilities with AEGIS Overlay
Table IX: AEGIS Defense State Transition Probabilities
| Attack Category | Vuln → Blocked | Vuln → Partial | Partial → Blocked | Effective Defense Rate |
|---|---|---|---|---|
| Harmful Content | 0.90 | 0.08 | 0.95 | 94.6% |
| Prompt Injection | 0.85 | 0.12 | 0.92 | 91.0% |
| Data Extraction | 0.82 | 0.14 | 0.88 | 88.3% |
| Jailbreak | 0.78 | 0.17 | 0.90 | 85.3% |
| Encoding Bypass | 0.72 | 0.20 | 0.85 | 79.0% |
| Multi-Turn | 0.70 | 0.22 | 0.83 | 78.3% |
Effective Defense Rate = Vuln→Blocked + (Vuln→Partial x Partial→Blocked)
Finding 5: Multi-turn attacks remain the weakest point even with full guardrails (78.3%). This 16.3% gap compared to harmful content (94.6%) represents a critical area for improvement, particularly given that Crescendo achieves 100% ASR without guardrails.
6.6 Reasoning Mode as Safety Mechanism
Table X: Impact of Reasoning Mode on ASR
| Model Pair | Standard ASR | Reasoning ASR | Delta ASR | Reduction |
|---|---|---|---|---|
| Grok 4.1 Fast vs. Grok 4.1 (Reasoning) | 0.657 | 0.464 | -0.193 | 29.4% |
Detailed per-algorithm comparison:
| Algorithm | Standard Score | Reasoning Score | Improved? |
|---|---|---|---|
| PAIR | 1.0 | 1.0 | No |
| TAP | 1.0 | 1.0 | No |
| Crescendo | 1.0 | 1.0 | No |
| AutoDAN | 1.0 | 0.0 | Yes (Full Block) |
| BEAST | 1.0 | 1.0 | No |
| ArtPrompt | 1.0 | 0.5 | Yes (Partial) |
| HPM | 1.0 | 1.0 | No |
Finding 6: Reasoning mode selectively blocks genetically evolved (AutoDAN) and visual (ArtPrompt) attacks but does not improve defense against iterative (PAIR), tree-search (TAP), multi-turn (Crescendo), suffix-optimized (BEAST), or psychological (HPM) attacks. This suggests that chain-of-thought reasoning helps detect structurally anomalous prompts but cannot overcome conversational manipulation or universal optimization techniques.
6.7 Enterprise Guardrail Effectiveness Index (EGEI)
Table XI: EGEI Scoring by Configuration and Deployment Scenario
| Config | AR (x0.35) | CC (x0.25) | PS (x0.15) | DA (x0.25) | EGEI | Rating |
|---|---|---|---|---|---|---|
| C0 | 0.38 → 0.133 | 0.14 → 0.035 | 1.00 → 0.150 | 0.00 → 0.000 | 0.318 | D |
| C1 | 0.58 → 0.203 | 0.36 → 0.090 | 0.95 → 0.143 | 0.00 → 0.000 | 0.435 | D |
| C2 | 0.69 → 0.242 | 0.57 → 0.143 | 0.85 → 0.128 | 0.00 → 0.000 | 0.512 | C |
| C3 | 0.79 → 0.277 | 0.79 → 0.198 | 0.80 → 0.120 | 0.00 → 0.000 | 0.594 | C |
| C4 | 0.86 → 0.301 | 0.93 → 0.233 | 0.75 → 0.113 | 0.85 → 0.213 | 0.859 | A |
Component scoring methodology:
- AR (Attack Resilience): Based on estimated defense rate for each config (C0=38.1%, C1=58%, C2=69%, C3=79%, C4=86%)
- CC (Compliance Coverage): Fraction of 14 regulatory requirements addressed (C0=2/14 model-native; C1=5/14 adds audit logging + pattern matching; C2=8/14 adds ML classification + attribution; C3=11/14 adds human-equivalent judgment + proportionality; C4=13/14 adds domain-specific compliance)
- PS (Performance Score): Normalized latency efficiency (C0=1.0 no overhead; degrading as layers add latency)
- DA (Domain Adherence): Fraction of domain-specific policies enforced (0 for C0–C3; 0.85 for C4 with active domain modules)
Finding 7: Only Configuration C4 (Full 3-Tier + Domain Policy) achieves EGEI rating A, suitable for regulated enterprise deployment. Notably, C3 (Full 3-Tier without domain policies) scores only C (0.594), underscoring that domain-specific policy enforcement is not optional for high-risk use cases — it accounts for a 0.265 EGEI improvement (44.6% increase over C3).
6.8 EGEI by Enterprise Deployment Scenario
Table XII: EGEI Scores for Domain-Specific Deployments (C4 Configuration)
| Scenario | AR | CC | PS | DA | EGEI | Key Risk |
|---|---|---|---|---|---|---|
| General Enterprise | 0.86 | 0.93 | 0.80 | 0.75 | 0.840 (A) | Data leakage, PII |
| Telecom NOC | 0.86 | 0.93 | 0.75 | 0.90 | 0.870 (A) | Infrastructure exploit |
| Healthcare CDS | 0.82 | 0.86 | 0.75 | 0.85 | 0.824 (A) | Dosage safety, self-Rx |
| Financial Advisory | 0.84 | 0.93 | 0.80 | 0.88 | 0.861 (A) | Investment fraud |
| Military C2 | 0.88 | 0.86 | 0.70 | 0.95 | 0.862 (A) | ROE/IHL compliance |
All five enterprise scenarios achieve EGEI rating A with C4 configuration. The military C2 scenario achieves the highest Domain Adherence (0.95) due to the comprehensive 7-module military defense suite, while the healthcare scenario has the lowest Attack Resilience (0.82) due to the subtlety of dosage-related attacks that evade standard content filters.
6.9 Regulatory Compliance Assessment
Table XIII: Regulatory Requirement Coverage by Configuration
| Requirement | Regulation | C0 | C1 | C2 | C3 | C4 |
|---|---|---|---|---|---|---|
| Risk management system | EU AI Act Art. 9 | — | — | △ | ○ | ● |
| Data governance | EU AI Act Art. 10 | — | — | — | △ | ● |
| Transparency & logging | EU AI Act Art. 13 | — | △ | ○ | ● | ● |
| Human oversight | EU AI Act Art. 14 | — | — | — | ○ | ● |
| Accuracy & robustness | EU AI Act Art. 15 | — | △ | ○ | ○ | ● |
| Quality management | EU AI Act Art. 17 | — | — | — | △ | ● |
| Safety assurance | K-AI Act Art. 5 | — | △ | ○ | ○ | ● |
| User protection | K-AI Act Art. 7 | △ | △ | ○ | ● | ● |
| Transparency | K-AI Act Art. 9 | — | — | △ | ○ | ● |
| Personal data protection | K-AI Act Art. 12 | — | — | — | △ | ● |
| GOVERN function | NIST AI RMF | — | — | △ | ○ | ● |
| MAP function | NIST AI RMF | — | △ | ○ | ○ | ● |
| MEASURE function | NIST AI RMF | — | — | △ | ○ | ● |
| MANAGE function | NIST AI RMF | — | — | — | △ | ● |
● = Full, ○ = Substantial, △ = Partial, — = Not addressed
Finding 8: Only C4 achieves full coverage across all 14 regulatory requirements. C3 achieves substantial coverage for most requirements but lacks domain-specific data governance and quality management capabilities. Critically, C0 (model-native safety only) addresses at most 2 of 14 requirements, making unguarded LLM deployment non-compliant with all three regulatory frameworks.
7. Discussion
7.1 The 38.1% Baseline Problem
Our most consequential finding is that the average baseline defense rate without external guardrails is only 38.1%. In enterprise terms, this means that a model deployed with only its native safety mechanisms will fail to defend against approximately 6 out of 10 adversarial attacks. For high-risk domains governed by EU AI Act Article 15 (which mandates "appropriate levels of accuracy, robustness and cybersecurity"), a 38.1% defense rate is categorically non-compliant.
This finding has immediate practical implications: any enterprise deploying an LLM in a regulated context without external guardrails is likely in violation of applicable regulations. The EU AI Act provides for fines up to 3% of global turnover for non-compliance with high-risk AI requirements, making guardrail investment not merely a technical decision but a legal imperative.
7.2 The Diminishing Returns of Layered Defense
Our analysis reveals a clear pattern of diminishing returns as guardrail layers are added:
| Transition | Defense Rate Gain | Latency Cost |
|---|---|---|
| C0 → C1 (add rules) | +17–22% | +0.5ms |
| C1 → C2 (add ML) | +10–12% | +4.5ms |
| C2 → C3 (add LLM Judge) | +7–10% | +15ms (P99) |
| C3 → C4 (add domain policy) | +5–8% | +5ms |
The first tier (rule-based) provides the highest ROI: 17–22% defense improvement at sub-millisecond cost. The LLM Judge tier provides the lowest marginal gain (7–10%) at the highest marginal cost (15ms). However, for high-risk enterprise deployments, every percentage point of defense improvement carries significant value — the difference between 82% and 90% defense rate may represent the boundary between a regulatory audit pass and failure.
7.3 The Domain Policy Imperative
Configuration C3 (full 3-Tier, no domain policy) achieves EGEI rating C (0.594), while C4 (with domain policy) achieves rating A (0.859). This 44.6% EGEI improvement from domain-specific policies alone demonstrates that generic guardrails are necessary but insufficient for enterprise deployment. The domain policy contribution comes from two sources:
-
Domain-specific threat detection — telecom infrastructure exploits, healthcare prescribing boundaries, financial suitability rules, and military classification enforcement are invisible to generic content safety systems.
-
Regulatory compliance coverage — domain policies enable full compliance with data governance (EU AI Act Art. 10), quality management (Art. 17), and sector-specific requirements that generic guardrails cannot address.
7.4 The Refusal Paradox in Enterprise Context
Our qualitative analysis reveals a phenomenon we term the "refusal paradox": models that provide detailed, helpful refusal responses inadvertently leak information. In enterprise contexts, this paradox takes on heightened significance:
- GPT-5 refused DAN role-play but offered extensive alternative assistance, creating secondary attack surfaces.
- DeepSeek acknowledged the existence of proprietary system prompt details, confirming extractable secrets.
- Claude identified and labeled the attack technique, demonstrating meta-awareness but still engaging semantically.
For enterprise guardrails, the implication is clear: guardrail systems must intercept and replace model responses, not rely on the model's own refusal behavior. The AEGIS MODIFY verdict type enables response rewriting that removes information-leaking refusal content while maintaining user experience.
7.5 Reasoning Mode: A Promising but Incomplete Defense
The 29.4% ASR reduction from reasoning mode (Grok 4.1 standard → reasoning) is significant but selective. Reasoning mode successfully blocks:
- AutoDAN (genetically evolved prompts with structural anomalies detectable through deliberation)
- ArtPrompt (partially; ASCII art encoding recognizable through step-by-step analysis)
But it fails against:
- PAIR/Crescendo (sophisticated conversational patterns that survive deliberation)
- HPM (psychological manipulation that exploits the reasoning process itself)
This suggests that reasoning mode is a valuable complement to, but not a replacement for, external guardrail systems. Enterprise deployments should enable reasoning mode for safety-critical requests as an additional defense layer within the PALADIN pipeline.
7.6 Limitations
-
Estimated defense rates for C1–C4. While C0 data comes from direct empirical measurement (112 evaluations), the C1–C4 defense rates are estimated from per-category block probabilities and simulation. Future work should conduct full empirical evaluation at each configuration level.
-
Scenario coverage. Our evaluation tested primarily jailbreak and encoding bypass scenarios. Comprehensive enterprise benchmarking requires testing domain-specific threat vectors (Table II) against each configuration.
-
Single evaluation date. Both sessions were conducted on the same date. Longitudinal evaluation across model updates would capture safety regression or improvement trends.
-
EGEI weight sensitivity. The proposed EGEI weights (0.35/0.25/0.15/0.25) reflect our judgment of enterprise priorities. Different weight configurations (e.g., prioritizing compliance for heavily regulated sectors) may yield different configuration rankings.
-
Open-source model gap. All tested models are commercial. Enterprise deployments increasingly use self-hosted open-source models (LLaMA, Mistral, Qwen) whose safety characteristics may differ significantly.
8. Recommendations for Enterprise Deployment
8.1 Minimum Viable Guardrail (MVG)
For any enterprise deployment: Deploy at minimum a 2-Tier guardrail (C2: rule-based + ML classifier). This achieves 65–72% defense rate at < 5ms P99 latency, representing the minimum acceptable configuration for regulated environments.
8.2 High-Risk Domain Configuration
For healthcare, finance, defense, and critical infrastructure: Deploy the full C4 configuration (3-Tier + PALADIN + Domain Policy). The EGEI analysis demonstrates that only C4 achieves rating A across all enterprise scenarios, and only C4 provides full regulatory compliance coverage.
8.3 Reasoning Mode Activation
For all configurations: Enable reasoning mode (where available) for safety-critical request categories. While not a standalone solution, the 29.4% ASR reduction provides meaningful additional defense, particularly against genetically evolved attacks (AutoDAN).
8.4 Behavioral Analysis Priority
Invest in multi-turn defense. Crescendo's 100% ASR against all models, combined with multi-turn attacks' lowest AEGIS block rate (78.3%), identifies multi-turn escalation as the highest-priority gap in current guardrail technology. Enterprise deployments should prioritize PALADIN L5 (BehavioralAnalysis) configuration and session-level monitoring.
8.5 Provider Selection Guidance
Do not rely on safety reputation. Our finding that GPT-5 (highest industry reputation) exhibited the highest ASR (0.865) while Gemini (lowest pre-test rating) achieved the lowest ASR (0.398) demonstrates that empirical testing, not marketing claims, must drive provider selection decisions.
8.6 Continuous Evaluation
Implement continuous safety monitoring. The SABER framework's ASR@N scaling law enables predictive risk assessment:
Budget@τ = minimum N such that ASR@N >= τ
Enterprise security teams should continuously compute Budget@0.5 for deployed models and trigger guardrail hardening when Budget@0.5 falls below the "Strong" threshold (< 1,000 attempts).
9. Conclusion
This paper presents the first comprehensive benchmarking study of layered guardrail effectiveness in high-risk enterprise LLM use cases. Our empirical evaluation of 8 models across 7 attack algorithms in 112 evaluations establishes three foundational conclusions for enterprise AI safety:
First, LLM-native safety is fundamentally insufficient for enterprise deployment. With a baseline defense rate of only 38.1% and universal vulnerability to PAIR and Crescendo attacks, no tested model — including GPT-5, Claude Opus 4.6, and Gemini 3.1 Pro — meets the safety requirements of any major regulatory framework (EU AI Act, K-AI Act, NIST AI RMF) without external guardrails.
Second, layered guardrails are effective but require domain-specific policies for high-risk use cases. The 3-Tier architecture (rule-based + ML + LLM Judge) improves defense rates from 38.1% to 75–82%, but only the addition of domain-specific policy enforcement (C4 configuration) achieves regulatory compliance and EGEI rating A (0.859). Generic guardrails alone score EGEI C (0.594) — inadequate for regulated enterprise use.
Third, the Enterprise Guardrail Effectiveness Index (EGEI) provides a multidimensional evaluation framework that captures the full spectrum of enterprise requirements — attack resilience, regulatory compliance, operational performance, and domain-specific policy adherence — enabling principled comparison of guardrail configurations across deployment scenarios. We demonstrate that all five tested enterprise scenarios (general enterprise, telecom, healthcare, financial, military) achieve EGEI rating A with the C4 configuration.
These findings carry immediate practical implications. For enterprises operating under EU AI Act Article 15 or equivalent regulatory mandates, deployment of an LLM without external guardrails constitutes regulatory non-compliance. The estimated 75–85% defense rate achievable with full layered guardrails, while significantly better than the 38.1% baseline, still leaves a 15–25% residual risk — primarily from multi-turn escalation attacks — that must be addressed through continued research in conversational safety analysis and adaptive defense mechanisms.
References
[1] M. Mazeika, L. Phan, X. Yin, et al., "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal," arXiv:2402.04249, 2024.
[2] P. Chao, E. Dobriban, et al., "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models," arXiv:2404.01318, 2024.
[3] A. Zou, Z. Wang, et al., "AdvBench: A Benchmark for Evaluating Adversarial Robustness of Large Language Models," 2023.
[4] NVIDIA, "NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications," 2024.
[5] Guardrails AI, "Guardrails: Adding Reliable AI Safeguards to LLM Applications," 2024.
[6] Meta AI, "Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations," arXiv:2312.06674, 2023.
[7] H. Inan, K. Upasani, J. Chi, et al., "Llama Guard: LLM-based Input-Output Safeguard," arXiv:2312.06674, 2023.
[8] S. Han, K. Kelly, J. Xu, et al., "WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs," arXiv:2406.18495, 2024.
[9] European Parliament and Council, "Regulation (EU) 2024/1689 — Artificial Intelligence Act," Official Journal of the European Union, 2024.
[10] 대한민국 국회, "인공지능산업 육성 및 신뢰 확보에 관한 법률 (K-AI Act)," 2025.
[11] National Institute of Standards and Technology, "AI Risk Management Framework (AI RMF 1.0)," NIST AI 100-1, 2023.
[12] WHO, "Ethics and Governance of Artificial Intelligence for Health," 2021.
[13] Financial Stability Board, "Artificial Intelligence and Machine Learning in Financial Services," 2017.
[14] U.S. Department of Defense, "Directive 3000.09: Autonomy in Weapon Systems," 2023 (updated).
[15] OWASP, "OWASP Top 10 for LLM Applications," 2025.
[16] Anonymous, "GuardNet: Hierarchical Graph-Based Detection for LLM Safety," arXiv:2509.23037, 2025.
[17] Anonymous, "JBShield: Jailbreak Detection via Linear Representation Hypothesis," arXiv:2502.07557, 2025.
[18] P. Chao, A. Robey, E. Dobriban, et al., "Jailbreaking Black Box Large Language Models in Twenty Queries," arXiv:2310.08419, 2023.
[19] A. Mehrotra, M. Zampetakis, P. Kassianik, et al., "Tree of Attacks: Jailbreaking Black-Box LLMs with Auto-Generated Subversions," arXiv:2312.02119, 2023.
[20] M. Russinovich, A. Salem, R. Elber, "Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack," Microsoft Research, 2024.
[21] X. Liu, N. Xu, M. Chen, C. Xiao, "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models," arXiv:2310.04451, 2023.
[22] S. Sadasivan, S. Saha, G. Sriramanan, et al., "Fast Adversarial Attacks on Language Models In One GPU Minute," arXiv:2402.15570, 2024.
[23] F. Jiang, Z. Xu, L. Niu, et al., "ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs," arXiv:2402.11753, 2024.
[24] Anonymous, "Human-like Psychological Manipulation of LLMs," arXiv:2512.18244, 2025.
[25] Y. Bai, S. Kadavath, et al., "Constitutional AI: Harmlessness from AI Feedback," arXiv:2212.08073, 2022.
[26] AEGIS Research Team, "SABER: Statistical Adversarial Risk with Beta Extrapolation and Regression — Technical Report," Internal Report, 2026.
Appendix A: EGEI Calculation Methodology
A.1 Attack Resilience (AR)
AR = 1 - mean(ASR_algo) for all 7 algorithms
For C0: AR = 1 - 0.620 = 0.380 For C4: AR estimated from per-category block probabilities applied to C0 ASR data.
A.2 Compliance Coverage (CC)
CC = (Σ requirement_scores) / 14
where: ● = 1.0, ○ = 0.75, △ = 0.5, — = 0.0
A.3 Performance Score (PS)
PS = 1.0 - (P99_latency_ms / 200ms) × 0.5 - (1 - throughput/100K) × 0.5
Capped at [0, 1]. The 200ms normalization constant represents the maximum acceptable P99 for enterprise deployments.
A.4 Domain Adherence (DA)
DA = (active_domain_policies / total_applicable_policies)
Assessed per deployment scenario. C0–C3 have DA = 0 (no domain-specific policies). C4 DA varies by scenario (0.75–0.95).
Appendix B: Raw Experimental Data Summary
B.1 Session 1 (2026-02-23T22:00:16Z)
- Scenario set: Jailbreak + data extraction
- Models tested: 8
- Average defense rate: 0.378
- Average latency: 83,610ms (model response time, not guardrail latency)
- Safest model: DeepSeek (ASR 0.653)
B.2 Session 2 (2026-02-23T22:44:58Z)
- Scenario set: Jailbreak + encoding bypass
- Models tested: 8
- Average defense rate: 0.383
- Average latency: 63,228ms
- Safest model: Grok 4.1 Fast Reasoning (ASR 0.445)
B.3 Error Summary
| Model | Session | Algorithm | Error Type |
|---|---|---|---|
| Gemini 3.1 Pro | 1 | Crescendo, AutoDAN, BEAST, ArtPrompt, HPM | Timeout |
| GPT-5 | 2 | HPM | Timeout |
| Gemini 3.1 Pro | 2 | HPM | Rate limit (429) |